class: title-slide, center, middle count: false .band[ # Impact of tree choice in metagenomics differential abundance studies ### Antoine Bichat <div style = "margin-top: -30px"></div> ### January 14, 2019 - AgroParisTech Work in progress - In collaboration with C. Ambroise (LaMME), <br> M. Mariadassou (MaIAGE) & J. Plassais (Enterome) ] --- class: center, middle, inverse # Context --- ## Microbiota _Ecological community of microorganisms that reside in an environmental niche_ -- .footnote[
Gut: The Inside Story of Our Body's Most Underrated Organ (Giulia Enders) <br>  ] .pull-left[ #### Some figures for human gut * `\(10^{\small{15}}\)` bacterial cells in one gut... * ... weighing 2 kg * More than 1 500 different species * More than 10 millions unique genes <img src="impactoftreechoice_agro_files/figure-html/citations-1.png" width="288" style="display: block; margin: auto;" /> ] -- .footnote[
Gut: The Inside Story of Our Body's Most Underrated Organ (Giulia Enders) <br>
Opstelten et al. (2016), Bokulich et al. (2016), Blander et al. (2017)] .pull-right[ #### Proven associations * Immune system * Crohn's disease * Vaginosis * Diabete * Tobacco * Diet * Antibiotics * Birth mode ] --- ## Data collection .footnote[
Quince et al. (2017)] .center[<img src="img/workflow.png" width="650">] --- count: false ## Data collection .footnote[
Quince et al. (2017)] .center[<img src="img/workflow_ellipse.png" width="650">] --- ## Data - abundances of taxa .footnote[
Ravel et al. (2011)] ``` # A tibble: 122 x 395 Taxa S001 S002 S003 S004 S005 S006 S007 S008 S009 S010 <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Lactob… 2318 1388 1361 2256 88 1770 1490 119 2136 1790 2 Prevot… 0 1 1 0 525 7 134 753 0 0 3 Megasp… 0 1 0 0 402 0 4 102 0 0 4 Sneath… 0 0 0 0 302 0 35 272 0 0 5 Atopob… 0 1 0 0 84 0 12 54 0 0 6 Strept… 0 0 3 0 0 0 138 4 0 2 7 Dialis… 0 1 0 0 152 4 2 192 0 0 8 Anaero… 0 1 3 2 0 9 12 13 0 0 9 Pepton… 0 1 0 0 7 2 6 50 0 0 10 Eggert… 0 0 0 0 2 0 0 7 0 0 # … with 112 more rows, and 384 more variables ``` -- * Count data (or compositional) data * Zero-inflated data * __Correlation between species__ * Counts spanning several orders of magnitude: `\(1 \rightarrow 10^{\small{8}}\)` --- ## Data - samples information .footnote[
Ravel et al. (2011)] ``` # A tibble: 394 x 6 Sample Ethnic_Group pH Nugent_Score Nugent_Cat CST <chr> <chr> <dbl> <dbl> <chr> <chr> 1 S001 Asian 4 0 low I 2 S002 White 4 0 low II 3 S003 Black 4 1 low III 4 S004 Asian 4.7 0 low I 5 S005 Black 5 6 intermediate IV 6 S006 White 4 0 low I 7 S007 White 4.7 1 low II 8 S008 White 5.8 9 high IV 9 S009 White 4.4 0 low III 10 S010 Black 4.4 1 low I # … with 384 more rows ``` --- ## Data - taxonomy .footnote[
Ravel et al. (2011)] ``` # A tibble: 129 x 5 Phylum Class Order Family Genus <chr> <chr> <chr> <chr> <chr> 1 Actinobacteria Actinobacteria Actinomycetal… Actinomycetac… Actinobaculum 2 Actinobacteria Actinobacteria Actinomycetal… Actinomycetac… Actinomyces 3 Actinobacteria Actinobacteria Actinomycetal… Actinomycetac… Arcanobacter… 4 Actinobacteria Actinobacteria Actinomycetal… Actinomycetac… Mobiluncus 5 Actinobacteria Actinobacteria Actinomycetal… Actinomycetac… Varibaculum # … with 124 more rows ``` --- class: nologo count: false ## Data - taxonomy ``` # A tibble: 129 x 5 Phylum Class Order Family Genus <chr> <chr> <chr> <chr> <chr> 1 Actinobacteria Actinobacteria Actinomycetal… Actinomycetac… Actinobaculum 2 Actinobacteria Actinobacteria Actinomycetal… Actinomycetac… Actinomyces 3 Actinobacteria Actinobacteria Actinomycetal… Actinomycetac… Arcanobacter… 4 Actinobacteria Actinobacteria Actinomycetal… Actinomycetac… Mobiluncus 5 Actinobacteria Actinobacteria Actinomycetal… Actinomycetac… Varibaculum # … with 124 more rows ``` .center[<img src="img/tree_tax.png"/ width="500">] --- class: center, middle, inverse # Differential abundance studies --- ## Statistical issue .center[ #### Healthy VS Diseased <div style = "margin-top: -25px"></div> .center[<img src="img/cohort.png"/ width="375">] ] Is any bacteria more abundant in one group? -- .pull-left[ #### Goals * Biomarker * Companion Diagnostics * Potential targets for drugs ] -- .pull-right[ #### Methods * ANOVA / GLM * Wilcoxon rank sum test * Mixed model effects * ... ] --- ## Multiple testing problem Usual design: * 1500 species in the gut, up to hundreds in studies * Tens or hundreds of patients Need for a controling procedure! -- * Bonferroni (FWER) --- count: false ## Multiple testing problem Usual design: * 1500 species in the gut, up to hundreds in studies * Tens or hundreds of patients Need for a controling procedure! * ~~Bonferroni (FWER)~~ `\(\rightarrow\)` too conservative -- * Benjamini-Hochberg (FDR) -- But BH requires independance and there is lot of dependancies between species... .center[ <img src="img/cor_bact_firm.png" width="300"> ] --- ## Hierarchical FDR .footnote[
Yekutieli (2008)] .pull-left[ `\(\mathcal{T}_t = \left\{H_i \mid \text{Par}(i) = t\right\}\)`, `\(L\)` levels Descending method: * Test the family `\(\mathcal{T}_0\)` * If node `\(t\)` is rejected, test `\(\mathcal{T}_t\)` ] .pull-right[ <img src="img/tree_hyp.png" width="450"> ] -- How to test a family `\(\mathcal{T}_t\)` (simple BH procedure at level `\(q\)` within it): 1. Let `\(P^t_{(1)}\leq \dots\leq P^t_{(m_t)}\)` denote the set of ordered p-values corresponding to the hypotheses in `\(\mathcal{T}_t\)` 2. Let `\(r_t = \max\left\{i\mid P^t_{(i)} \leq i \frac{q}{m_t}\right\}\)` 3. Reject the `\(r_t\)` hypothesis with the smaller p-value -- .center[ __The procedure controls the FDR at level `\(\mathbf{2.88 \times q \times L}\)`__ ] --- count: false ## Hierarchical FDR .footnote[
Yekutieli (2008)] .pull-left[ `\(\mathcal{T}_t = \left\{H_i \mid \text{Par}(i) = t\right\}\)`, `\(L\)` levels Descending method: * Test the family `\(\mathcal{T}_0\)` * If node `\(t\)` is rejected, test `\(\mathcal{T}_t\)` ] .pull-right[ <img src="img/tree_hyp_1.png" width="450"> ] How to test a family `\(\mathcal{T}_t\)` (simple BH procedure at level `\(q\)` within it): 1. Let `\(P^t_{(1)}\leq \dots\leq P^t_{(m_t)}\)` denote the set of ordered p-values corresponding to the hypotheses in `\(\mathcal{T}_t\)` 2. Let `\(r_t = \max\left\{i\mid P^t_{(i)} \leq i \frac{q}{m_t}\right\}\)` 3. Reject the `\(r_t\)` hypothesis with the smaller p-value .center[ __The procedure controls the FDR at level `\(\mathbf{2.88 \times q \times L}\)`__ ] --- count: false ## Hierarchical FDR .footnote[
Yekutieli (2008)] .pull-left[ `\(\mathcal{T}_t = \left\{H_i \mid \text{Par}(i) = t\right\}\)`, `\(L\)` levels Descending method: * Test the family `\(\mathcal{T}_0\)` * If node `\(t\)` is rejected, test `\(\mathcal{T}_t\)` ] .pull-right[ <img src="img/tree_hyp_2.png" width="450"> ] How to test a family `\(\mathcal{T}_t\)` (simple BH procedure at level `\(q\)` within it): 1. Let `\(P^t_{(1)}\leq \dots\leq P^t_{(m_t)}\)` denote the set of ordered p-values corresponding to the hypotheses in `\(\mathcal{T}_t\)` 2. Let `\(r_t = \max\left\{i\mid P^t_{(i)} \leq i \frac{q}{m_t}\right\}\)` 3. Reject the `\(r_t\)` hypothesis with the smaller p-value .center[ __The procedure controls the FDR at level `\(\mathbf{2.88 \times q \times L}\)`__ ] --- count: false ## Hierarchical FDR .footnote[
Yekutieli (2008)] .pull-left[ `\(\mathcal{T}_t = \left\{H_i \mid \text{Par}(i) = t\right\}\)`, `\(L\)` levels Descending method: * Test the family `\(\mathcal{T}_0\)` * If node `\(t\)` is rejected, test `\(\mathcal{T}_t\)` ] .pull-right[ <img src="img/tree_hyp_3.png" width="450"> ] How to test a family `\(\mathcal{T}_t\)` (simple BH procedure at level `\(q\)` within it): 1. Let `\(P^t_{(1)}\leq \dots\leq P^t_{(m_t)}\)` denote the set of ordered p-values corresponding to the hypotheses in `\(\mathcal{T}_t\)` 2. Let `\(r_t = \max\left\{i\mid P^t_{(i)} \leq i \frac{q}{m_t}\right\}\)` 3. Reject the `\(r_t\)` hypothesis with the smaller p-value .center[ __The procedure controls the FDR at level `\(\mathbf{2.88 \times q \times L}\)`__ ] --- ## Which tree? .pull-left[Taxonomic tree? <img src="img/tree_tax.png"> ] -- .pull-right[Correlation tree? <img src="img/tree_cor.png"> ] -- <br> .center[__Are they similar for the procedure?__] --- class: center, middle, inverse # Comparison of trees --- ## Correlation tree * Pairwise correlation matrix `C` is computed from Spearman correlation after removing shared zeros between two species ```r cor(x[x+y > 0], y[x+y > 0], method = "spearman") ``` -- * Dissimilarity matrix `D` is obtained after a transformation of `C` ```r D <- 1 - C ``` -- * Hierarchical clustering is used to create the correlation tree `tree_cor` from `D` ```r tree_cor <- hclust(D) ``` -- <br> .center[__Conditionally to abundances, the construction of the correlation tree is independant of samples covariables__] --- ## Billera-Holmes-Vogtamnn distance .footnote[
Billera, Holmes & Vogtmann (2001)] The BHV distance between two trees is the total length of intern branch that one needs to contract or extend to transform one tree into the other -- .pull-left[ .center[ <img src="img/dBHV_1.png" height=300/> ] ] -- .pull-right[ .center[ <img src="img/dBHV_2.png" height=300/> ] ] --- count: false ## Billera-Holmes-Vogtamnn distance .footnote[
Billera, Holmes & Vogtmann (2001)] The BHV distance between two trees is the total length of intern branch that one needs to contract or extend to transform one tree into the other .pull-left[ .center[ <img src="img/dBHV_3.png" height=300/> ] ] -- .pull-right[ .center[ <img src="img/dBHV_4.png" height=300/> ] ] -- <br> .center[__There exists a distance on the space of trees__] --- ## Quantifying distance between trees * __trees of primary interest__ * correlation tree on original data * taxonomic tree -- * __what is the confident region for the correlation tree?__ -- * `\(N_B\)` correlation trees on boostrapped data -- * __are trees significantly closer than two random trees?__ -- * `\(N_{R_1}\)` trees created by random shuffling of correlation tree tip labels * `\(N_{R_2}\)` trees created by random shuffling of taxonomic tree tip labels -- <br> We compute all pairwise distances between these trees --- ## Random shuffling .center[<img src="img/shuffling.png", height="500">] --- ## Dataset .footnote[
Ravel et al. (2011)] * Vaginal microbiome of non pregnant women sequenced by 16S * 40 different genera after filtering (~ 30 %) * 394 women with various Nugent score <br> .pull-left[<img src="img/tree_tax.png">] .pull-right[<img src="img/tree_cor.png">] --- ## Distance to the correlation tree .center[<img src="img/comp_boxplot.png" width="650">] --- ## Principal Coordinates Analysis .center[<img src="img/comp_pcoa.png" width="650">] --- class: center, middle, inverse # Application --- ## Dataset .footnote[
Caporaso et al. (2011), Sankaran & Holmes (2014)] * Small subset of the `GlobalPatterns` dataset narrowed to Chlamydiae phylum * Sequenced by 454 * 21 different species * 26 samples representing 7 environments: soil, ocean, feces, skin... * Find which bacteria are differentially abundant bewtween environments * Association using Fisher statistic --- ## Detected species with different corrections .center[<img src="img/venn.png" height=450/>] --- count: false ## Detected species with different corrections .center[<img src="img/venn_bh.png" height=450/>] --- count: false ## Detected species with different corrections .center[<img src="img/venn_phy.png" height=450/>] --- count: false ## Detected species with different corrections .center[<img src="img/venn_cor.png" height=450/>] --- count: false ## Detected species with different corrections .center[<img src="img/venn_rand.png" height=450/>] --- ## Back to abundances .center[<img src="img/abundances.png" height=500/>] --- ## Representation of evidences on trees .center[<img src="img/trees_w_pvalues.png" height=500/>] --- class: center, middle, inverse # Reproductibility --- ##
package `{correlationtree}` <br> Available (soon) on my
GitHub account: <a href="https://github.com/abichat" target="_blank">@abichat</a> <br> ```r devtools::install_github("abichat/correlationtree") ``` <br> * Computation of correlation trees * Comparison of tree distances * Shiny application --- ## Shiny app .footnote[
Pasolli et al. (2017)] Allow to interactively <img src="img/hex-shiny.png" align="right" height=140/> * Select data from standardized datasets available in package `{curatedMetagenomicData}` and filter it -- * Visualize abundances by categorical variables -- * Compute tree distances using Billera-Holmes-Vogtman, Robinson-Foulds and Cophenetic distances -- .pull-left[<img src="img/shinyapp_1.png">] .pull-right[<img src="img/shinyapp_2.png">] --- class: center, middle, inverse # Perspectives --- ## Perspectives <br> * Work with other hierarchical correction methods (bayesian smoothing) -- <br> * Test the correlation tree on simulated data -- <br> * Release the package -- <br> * Publish the article --- class: center, middle, inverse count: false <br> # Thanks for you attention! <br> ####
<a href="https://github.com/abichat" target="_blank">@abichat</a> <div style = "margin-top: -10px"></div> ####
<a href="https://twitter.com/_abichat" target="_blank">@_abichat</a> <div style = "margin-top: -10px"></div> ####
<a href="https://www.linkedin.com/in/antoinebichat" target="_blank">antoinebichat</a> <div style = "margin-top: -10px"></div> ####
<a href="https://abichat.github.io" target="_blank">abichat.github.io</a> <div style = "margin-top: -10px"></div> ####
<a href="mailto:abichat@enterome.com?subject=Presentation">abichat@enterome.com</a>