class: title-slide, center, middle count: false <br> # Impact of tree choice in metagenomics differential abundance studies <br> ## Antoine Bichat <div style = "margin-top: -30px"></div> <!-- ###
@_abichat --> ### June 6, 2019 - Journées de Statistique - Nancy <br> #### In collaboration with C. Ambroise (LaMME), M. Mariadassou (MaIAGE) & J. Plassais (Enterome) --- class: center, middle, inverse # Context --- ## Microbiota _Ecological community of microorganisms that reside in an environmental niche_ -- .footnote[📘 Gut: The Inside Story of Our Body's Most Underrated Organ (Giulia Enders) <br> 📄 Opstelten et al. (2016), Bokulich et al. (2016), Blander et al. (2017)] .pull-left[ #### Some figures for human gut * `\(10^{\small{14}}\)` bacterial cells in one gut... * ... weighing 2 kg * More than 1 500 different species * More than 10 millions unique genes <img src="index_files/figure-html/citations-1.png" width="288" style="display: block; margin: auto;" /> ] -- .pull-right[ #### Proven associations * Immune system * Crohn's disease * Vaginosis * Diabete * Tobacco * Diet * Antibiotics * Birth mode ] --- ## Data - abundances of taxa .footnote[📄 Ravel et al. (2011)] ``` # A tibble: 122 x 395 Taxa S001 S002 S003 S004 S005 S006 S007 S008 S009 S010 <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Lactob… 2318 1388 1361 2256 88 1770 1490 119 2136 1790 2 Prevot… 0 1 1 0 525 7 134 753 0 0 3 Megasp… 0 1 0 0 402 0 4 102 0 0 4 Sneath… 0 0 0 0 302 0 35 272 0 0 5 Atopob… 0 1 0 0 84 0 12 54 0 0 6 Strept… 0 0 3 0 0 0 138 4 0 2 7 Dialis… 0 1 0 0 152 4 2 192 0 0 8 Anaero… 0 1 3 2 0 9 12 13 0 0 9 Pepton… 0 1 0 0 7 2 6 50 0 0 10 Eggert… 0 0 0 0 2 0 0 7 0 0 # … with 112 more rows, and 384 more variables ``` -- * Count data (or compositional) data * Zero-inflated data * Correlation between species * Counts spanning several orders of magnitude: `\(1 \rightarrow 10^{\small{8}}\)` --- ## Data - taxonomy .footnote[📄 Ravel et al. (2011)] ``` # A tibble: 129 x 5 Phylum Class Order Family Genus <chr> <chr> <chr> <chr> <chr> 1 Actinobacteria Actinobacteria Actinomycetal… Actinomycetac… Actinobaculum 2 Actinobacteria Actinobacteria Actinomycetal… Actinomycetac… Actinomyces 3 Actinobacteria Actinobacteria Actinomycetal… Actinomycetac… Arcanobacter… 4 Actinobacteria Actinobacteria Actinomycetal… Actinomycetac… Mobiluncus 5 Actinobacteria Actinobacteria Actinomycetal… Actinomycetac… Varibaculum # … with 124 more rows ``` --- class: nologo count: false ## Data - taxonomy ``` # A tibble: 129 x 5 Phylum Class Order Family Genus <chr> <chr> <chr> <chr> <chr> 1 Actinobacteria Actinobacteria Actinomycetal… Actinomycetac… Actinobaculum 2 Actinobacteria Actinobacteria Actinomycetal… Actinomycetac… Actinomyces 3 Actinobacteria Actinobacteria Actinomycetal… Actinomycetac… Arcanobacter… 4 Actinobacteria Actinobacteria Actinomycetal… Actinomycetac… Mobiluncus 5 Actinobacteria Actinobacteria Actinomycetal… Actinomycetac… Varibaculum # … with 124 more rows ``` .center[<img src="img/tree_tax.png"/ width="500">] --- class: center, middle, inverse # Differential abundance studies --- ## Statistical issue Univariate tests on hundred of taxa Need for a multiple testing controling procedure! -- .footnote[📄 Philippot et al. (2010)] <br> .pull-left[ A hierarchy is available <img src="img/coherence.png" width="400"> ] .pull-right[ Can we use it to do it better? ] --- count: false ## Statistical issue .footnote[📄 Philippot et al. (2010)] Univariate tests on hundred of taxa Need for a multiple testing controling procedure! <br> .pull-left[ A hierarchy is available <img src="img/coherence.png" width="400"> ] .pull-right[ Can we use it to do it better? <br> * Hierarchical FDR * z-scores smoothing ] --- ## Hierarchical FDR .footnote[📄 Yekutieli (2008)] `\(\mathcal{T}_t = \left\{H_i \mid \text{Par}(i) = t\right\}\)` Descending method: * Test the family `\(\mathcal{T}_0\)` * If node `\(t\)` is rejected, test `\(\mathcal{T}_t\)` with a classical BH procedure at level `\(q\)` <br> .center[ <img src="img/tree_hyp.png" width="420"> ] --- count: false ## Hierarchical FDR .footnote[📄 Yekutieli (2008)] `\(\mathcal{T}_t = \left\{H_i \mid \text{Par}(i) = t\right\}\)` Descending method: * Test the family `\(\mathcal{T}_0\)` * If node `\(t\)` is rejected, test `\(\mathcal{T}_t\)` with a classical BH procedure at level `\(q\)` <br> .center[ <img src="img/tree_hyp_1.png" width="420"> ] --- count: false ## Hierarchical FDR .footnote[📄 Yekutieli (2008)] `\(\mathcal{T}_t = \left\{H_i \mid \text{Par}(i) = t\right\}\)` Descending method: * Test the family `\(\mathcal{T}_0\)` * If node `\(t\)` is rejected, test `\(\mathcal{T}_t\)` with a classical BH procedure at level `\(q\)` <br> .center[ <img src="img/tree_hyp_2.png" width="420"> ] --- count: false ## Hierarchical FDR .footnote[📄 Yekutieli (2008)] `\(\mathcal{T}_t = \left\{H_i \mid \text{Par}(i) = t\right\}\)` Descending method: * Test the family `\(\mathcal{T}_0\)` * If node `\(t\)` is rejected, test `\(\mathcal{T}_t\)` with a classical BH procedure at level `\(q\)` <br> .center[ <img src="img/tree_hyp_3.png" width="420"> ] -- <br> .center[ __This procedure controls the FDR at level__ `\(\large{1.44 \times q \times \frac{\#\text{discoveries } +\text{ } \#\text{families tested}}{\#\text{discoveries } +\text{ } 1}}\)` ] --- ## z-scores smoothing .footnote[📄 Xiao et al. (2017)] Denote by `\(\mathbf{z}\)` the vector of observed z-values and `\(\mathbf{\mu}\)` the vector of "true" z-values -- <br> Assume that `\(\mathbf{z} | \mathbb{\mu} \sim \mathcal{N}_n \left( \mathbb{\mu}, \sigma^2 \mathbf{I}_m \right)\)` and `\(\mathbf{\mu} \sim \mathcal{N}_m\left(\gamma \mathbf{1} , \tau^2 \mathbf{C}_{\rho} \right)\)` -- then `$$\mathbf{z} \sim \mathcal{N}_m \left(\gamma \mathbf{1},\tau^2 \mathbf{C}_{\rho} + \sigma^2 \mathbf{I}_m\right)$$` and Bayes formula gives `$$\mathbb{\mu}^* = \left(\mathbf{I}_m + \frac{\sigma_0^2}{\tau_0^2} \mathbf{C}_{\rho_0}^{-1}\right)^{-1}\left(\frac{\sigma_0^2}{\tau_0^2} \mathbf{C}_{\rho_0}^{-1}\gamma_0 \mathbf{1} + \mathbf{z}\right)$$` with `\(\sigma_0\)`, `\(\tau_0\)`, `\(\rho_0\)` and `\(\gamma_0\)` hypermarameters -- <br> After smoothing, a multiple testing correction could be done on smoothed values --- ## Which tree? #### Taxonomy? Phylogeny? * Proxy for correlations at high-level niches * Not so much for low-level niches? * Not available everytime --- class: nologo count: false ## Which tree? #### Taxonomy? Phylogeny? * Proxy for correlations at high-level niches * Not so much for low-level niches? * Not available everytime #### Correlation tree? * Actual correlation between taxa * Computed from data using pairwise correlation .center[ <img src="img/tree_cor.png" width="350"> ] --- class: center, middle, inverse # Comparison of trees --- ## Billera-Holmes-Vogtamnn distance .footnote[📄 Billera, Holmes & Vogtmann (2001)] The BHV distance is the length of the unique shortest path between the trees on treespace .pull-left[ .center[ <img src="img/dBHV_2.png" height=300/> ] ] .pull-right[ .center[ <img src="img/dBHV_4.png" height=300/> ] ] --- ## Quantifying distance between trees * __trees of primary interest__ * <span style="color:#C77CFF">correlation tree on original data</span> * <span style="color:#F8766D">taxonomy</span> -- * __what is the confident region for the correlation tree?__ -- * <span style="color:#00BFC4">correlation trees on boostrapped data (resampling on samples)</span> -- * __are trees significantly closer than two random trees?__ -- * <span style="color:#7CAE00">trees created by random shuffling of correlation tree tip labels</span> * <span style="color:#FFA500">trees created by random shuffling of taxonomy tip labels</span> -- <br> We compute all pairwise distances between these trees --- ## Random shuffling .center[<img src="img/shuffling.png", height="500">] --- ## Dataset .footnote[📄 Ravel et al. (2011)] * Vaginal microbiome of non pregnant women sequenced by 16S * 40 different genera after filtering (~ 30 %) <br> .pull-left[ <img src="img/tree_tax.png"> .center[Taxonomy] ] .pull-right[ <img src="img/tree_cor.png"> .center[Correlation tree] ] --- ## Pairwise distances .pull-left[ .center[<img src="img/prez_ravel_comp_box.png" width="650">] ] -- .pull-right[ .center[<img src="img/prez_ravel_comp_pcoa.png" width="650">] ] -- <br> 😀 The <span style="color:#C77CFF">correlation tree</span> is different from the <span style="color:#F8766D">taxonomy</span> --- class: center, middle, inverse # Evaluation of hFDR --- ## Dataset .footnote[📄 Caporaso et al. (2011), Sankaran & Holmes (2014)] * Small subset of the `GlobalPatterns` dataset narrowed to Chlamydiae phylum * 21 different OTUs * 26 samples representing 9 very different environments: soil, ocean, feces, skin... -- ## Method * Find which bacteria are differentially abundant between environments * Association using Fisher statistic (ANOVA) * Correction with hierarchical FDR --- ## Abundances of detected species .center[<img src="img/realdata_chlamydia_box.png" height=500/>] --- ## Representation of evidences on trees .center[<img src="img/tree_mirors.png" height=500/>] --- ## But... `\(\alpha = 0.10\)` is only the family-level FDR. The _a posteriori_ global FDR is: * `\(\alpha' = 0.32\)` for phylogenetic correction * `\(\alpha' = 0.324\)` for correlation correction -- A BH procedure at the same global FDR level leads to 15 discoveries (+5) -- <br> 😀 Using correlation tree instead of taxonomy yields more results ☹️ Vanilla BH beats hFDR for a given level --- class: center, middle, inverse # Evaluation of z-scores smoothing --- ## Dataset .footnote[📄 Zeller et al. (2014)] * Dataset from cancer study * 119 different genera (after filtering) * 199 samples: 42 adenoma, 91 carcinoma and 66 control -- ## Method * Find which bacteria are differentially abundant between diseases * Association using Kruskal-Wallis test * Correction with hierarchical p-value smoothing --- ## Impact of the tree .center[<img src="img/zeller-lines-bold.png" height=430/>] -- 😊 z-scores smoothing is slightly better than vanilla BH ☹️ All hierachies give highly similar results --- class: center, middle, inverse # Simulations --- ## Workflow .footnote[📄 Brito et al. (2016)] * Simulate DA taxa starting from an homogeneous dataset * Correction with BH and hierarchical p-value smoothing .center[<img src="img/workflow_simu.png" height=450/>] --- ## Evaluation .center[<img src="img/simu_np_lines_prez.png" height=390/>] -- 😀 Using correlation tree instead of taxonomy yields more results ☹️ Vanilla BH is better 🤔 Taxonomy is worst than random trees --- class: center, middle, inverse # Conclusions --- ## Conclusions 😀 Correlation tree and taxonomy are very different 😀 Replacing taxonomy with correlationn increases the TPR -- <br> ☹️ Vanilla BH is more powerful than hFDR ☹️ Bayesian smoothing does not really depend on the tree for z-scores smoothing -- <br> 📦 `correlationtree` 📦 `yatah`
📦 `evabic`
--- class: title-slide, center, middle count: false <br> # Thanks for your attention! <br> # Questions? <br> ####
<a href="https://abichat.github.io" target="_blank">abichat.github.io</a>  
<a href="https://github.com/abichat" target="_blank">@abichat</a>  
<a href="https://twitter.com/_abichat" target="_blank">@_abichat</a>