Impact of tree choice in metagenomics differential abundance studies

class: title-slide, center, middle
count: false

.band[
# Impact of tree choice in metagenomics differential abundance studies

### Antoine Bichat
<div style = "margin-top: -30px"></div>
### March 11, 2019 - StatInfOmics

Work in progress - In collaboration with C. Ambroise (LaMME), 
M. Mariadassou (MaIAGE) & J. Plassais (Enterome)
]

---
class: center, middle, inverse

# Context

---
## Microbiota

_Ecological community of microorganisms that reside in an environmental niche_

.footnote[ Gut: The Inside Story of Our Body's Most Underrated Organ (Giulia Enders) &emsp;]

.pull-left[

#### Some figures for human gut

* `$10^{\small{14}}$` bacterial cells in one gut...

* ... weighing 2 kg

* More than 1&#8239;500 different species

* More than 10 millions unique genes

]

--
.footnote[ Gut: The Inside Story of Our Body's Most Underrated Organ (Giulia Enders) Opstelten et al. (2016), Bokulich et al. (2016), Blander et al. (2017)]

.pull-right[

#### Proven associations

* Immune system

* Crohn's disease

* Vaginosis

* Diabete

* Tobacco

* Diet

* Antibiotics

* Birth mode

]

---
## Data collection

.footnote[ Quince et al. (2017)]

.center[<img src="img/workflow.png" width="650">]

---
count: false
## Data collection

.footnote[ Quince et al. (2017)]

.center[<img src="img/workflow_ellipse.png" width="650">]

---
## Data - abundances of taxa

.footnote[ Ravel et al. (2011)]

```
 # A tibble: 122 x 395
 Taxa S001 S002 S003 S004 S005 S006 S007 S008 S009 S010
 <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Lactob… 2318 1388 1361 2256 88 1770 1490 119 2136 1790
 2 Prevot… 0 1 1 0 525 7 134 753 0 0
 3 Megasp… 0 1 0 0 402 0 4 102 0 0
 4 Sneath… 0 0 0 0 302 0 35 272 0 0
 5 Atopob… 0 1 0 0 84 0 12 54 0 0
 6 Strept… 0 0 3 0 0 0 138 4 0 2
 7 Dialis… 0 1 0 0 152 4 2 192 0 0
 8 Anaero… 0 1 3 2 0 9 12 13 0 0
 9 Pepton… 0 1 0 0 7 2 6 50 0 0
 10 Eggert… 0 0 0 0 2 0 0 7 0 0
 # … with 112 more rows, and 384 more variables
```

* Count data (or compositional) data

* Zero-inflated data

* __Correlation between species__

* Counts spanning several orders of magnitude: `$1 \rightarrow 10^{\small{8}}$`

---
## Data - samples information

.footnote[ Ravel et al. (2011)]

```
 # A tibble: 394 x 6
 Sample Ethnic_Group pH Nugent_Score Nugent_Cat CST 
 <chr> <chr> <dbl> <dbl> <chr> <chr>
 1 S001 Asian 4 0 low I 
 2 S002 White 4 0 low II 
 3 S003 Black 4 1 low III 
 4 S004 Asian 4.7 0 low I 
 5 S005 Black 5 6 intermediate IV 
 6 S006 White 4 0 low I 
 7 S007 White 4.7 1 low II 
 8 S008 White 5.8 9 high IV 
 9 S009 White 4.4 0 low III 
 10 S010 Black 4.4 1 low I 
 # … with 384 more rows
```

---
## Data - taxonomy

.footnote[ Ravel et al. (2011)]

```
 # A tibble: 129 x 5
 Phylum Class Order Family Genus 
 <chr> <chr> <chr> <chr> <chr> 
 1 Actinobacteria Actinobacteria Actinomycetal… Actinomycetac… Actinobaculum
 2 Actinobacteria Actinobacteria Actinomycetal… Actinomycetac… Actinomyces 
 3 Actinobacteria Actinobacteria Actinomycetal… Actinomycetac… Arcanobacter…
 4 Actinobacteria Actinobacteria Actinomycetal… Actinomycetac… Mobiluncus 
 5 Actinobacteria Actinobacteria Actinomycetal… Actinomycetac… Varibaculum 
 # … with 124 more rows
```

---
class: nologo
count: false
## Data - taxonomy

.center[<img src="img/tree_tax.png"/ width="500">]

---
class: center, middle, inverse

# Differential abundance studies

---
## Statistical issue

.center[
#### Healthy VS Diseased
<div style = "margin-top: -25px"></div>
.center[<img src="img/cohort.png"/ width="375">]
]

Is any bacteria more abundant in one group?

.pull-left[
#### Goals
* Biomarker

* Companion Diagnostics

* Potential targets for drugs
]

.pull-right[
#### Methods
* ANOVA / GLM

* Wilcoxon rank sum test

* Mixed model effects

* ...
]

---
## Multiple testing problem

Usual design:

* 1&#8239;500 species in the gut

* Up to hundreds in a single study

Need for a controling procedure!

* Bonferroni (FWER)

---
count: false

## Multiple testing problem

Usual design:

* 1&#8239;500 species in the gut

* Up to hundreds in a single study

Need for a controling procedure!

* ~~Bonferroni (FWER)~~ `$\rightarrow$` too conservative

* Benjamini-Hochberg (FDR)

---
## Benjamini-Hochberg

.footnote[ Benjamini et al. (1995, 2001)]

The False Discovery Rate is defined as

`$$\text{FDR} =\mathbb{E}\left[\frac{V}{R \vee  1}\right]$$`

Benjamini-Hochberg procedure:

* Order the p-values as `$p_{(1)} \leq p_{(2)} \leq \ldots \leq p_{(m)}$` and let `$p_{(0)}=0$`

* Consider the rank `$\hat{\ell} = \text{max}\left\{\ell \in \{0, 1, \ldots, m\} \mid p_{(\ell)} \leq \frac{\alpha\ell}{m} \right\}$`

* Reject the `$\hat{\ell}$` hypothesis corresponding to the smallest p-values

This procedure controls the FDR at level `$\frac{m_{0}}{m}\alpha \leq \alpha$` and holds when the test statistics are independent or PRDS

---
## Incorporate taxonomy in the analysis?

.footnote[ Philippot et al. (2010), Koeppel & Wu (2012)]

* Taxonomy is a proxy for the structure of the data

* Coherent with ecological niches

* Independant from count data and metadata

.center[
 <img src="img/coherence.png" width="550">
]

---
## Hierarchical FDR

.footnote[ Yekutieli (2008)]

`$\mathcal{T}_t = \left\{H_i \mid \text{Par}(i) = t\right\}$`

Descending method:

* Test the family `$\mathcal{T}_0$`

* If node `$t$` is rejected, test `$\mathcal{T}_t$` with a classical BH procedure at level `$q$`

.center[
 <img src="img/tree_hyp.png" width="450">
]

---
count: false
## Hierarchical FDR

.footnote[ Yekutieli (2008)]

`$\mathcal{T}_t = \left\{H_i \mid \text{Par}(i) = t\right\}$`

Descending method:

* Test the family `$\mathcal{T}_0$`

* If node `$t$` is rejected, test `$\mathcal{T}_t$` with a classical BH procedure at level `$q$`

.center[
 <img src="img/tree_hyp_1.png" width="450">
]

---
count: false

## Hierarchical FDR

.footnote[ Yekutieli (2008)]

`$\mathcal{T}_t = \left\{H_i \mid \text{Par}(i) = t\right\}$`

Descending method:

* Test the family `$\mathcal{T}_0$`

* If node `$t$` is rejected, test `$\mathcal{T}_t$` with a classical BH procedure at level `$q$`

.center[
 <img src="img/tree_hyp_2.png" width="450">
]

---
count: false
## Hierarchical FDR

.footnote[ Yekutieli (2008)]

`$\mathcal{T}_t = \left\{H_i \mid \text{Par}(i) = t\right\}$`

Descending method:

* Test the family `$\mathcal{T}_0$`

* If node `$t$` is rejected, test `$\mathcal{T}_t$` with a classical BH procedure at level `$q$`

.center[
 <img src="img/tree_hyp_3.png" width="450">
]

.center[
__This procedure controls the FDR at level__

`$\large{1.44 \times q \times \frac{\#\text{discoveries } +\text{ } \#\text{families tested}}{\#\text{discoveries } +\text{ } 1}}$`
]

---
## Bayesian smoothing

.footnote[ Xiao et al. (2017)]

Denote by `$\mathbf{z}$` the vector of observed z-values and `$\mathbf{\mu}$` the vector of "true" z-values

Assume that 
`$\mathbf{z} | \mathbb{\mu} \sim \mathcal{N}_n \left( \mathbb{\mu}, \sigma^2 \mathbf{I}_m \right)$` and `$\mathbf{\mu} \sim \mathcal{N}_m\left(\gamma \mathbf{1} , \tau^2 \mathbf{C}_{\rho} \right)$`

then 
`$$\mathbf{z} \sim \mathcal{N}_m \left(\gamma \mathbf{1},\tau^2 \mathbf{C}_{\rho} + \sigma^2 \mathbf{I}_m\right)$$`

and Bayes formula gives 
 
`$$\mathbb{\mu}^* = \left(\mathbf{I}_m + \frac{\sigma_0^2}{\tau_0^2} \mathbf{C}_{\rho_0}^{-1}\right)^{-1}\left(\frac{\sigma_0^2}{\tau_0^2} \mathbf{C}_{\rho_0}^{-1}\gamma_0 \mathbf{1} + \mathbf{z}\right)$$`

with `$\sigma_0$`, `$\tau_0$`, `$\rho_0$` and `$\gamma_0$` hypermarameters

After smoothing, a multiple testing correction could be done on smoothed values

---
## Which tree?

#### Taxonomic tree?

* Proxy for correlations at high-level niches

* Not so much for subtle niches?

* Non available everytime

---
class: nologo
count: false
## Which tree?

#### Taxonomic tree?

* Proxy for correlations at high-level niches

* Not so much for subtel niches?

* Non available everytime

#### Correlation tree?

* Actual correlation between taxa

* Computed from data using pairwise correlation

.center[
<img src="img/tree_cor.png" width="350">
]

---
class: center, middle, inverse

# Comparison of trees

---
## Billera-Holmes-Vogtmann distance on treespace

.footnote[ Billera, Holmes & Vogtmann (2001)]

.pull-left[
 .center[
 <img src="img/dBHV_4.png" height=300/>
 ]
]

.pull-right[
* The treespace is a CAT(0) space so there is a unique shortest path connecting any two trees

* The BHV distance is the length of the unique shortest path between the trees on treespace

* Solution of a max flow min cut algorithm in `$O(n_{\text{leaf}}^{\small{4}})$`

* Other distances exist like Robison-Foulds (for topology) or Cophenetic (by vectorisation) distances
]

---
count: false
## Billera-Holmes-Vogtmann distance on treespace

.footnote[ Billera, Holmes & Vogtmann (2001)]

.pull-left[
 .center[
 <img src="img/dBHV_1.png" height=300/>
 ]
]

.pull-right[
* The treespace is a CAT(0) space so there is a unique shortest path connecting any two trees

* The BHV distance is the length of the unique shortest path between the trees on treespace

* Solution of a max flow min cut algorithm in `$O(n_{\text{leaf}}^{\small{4}})$`

* Other distances exist like Robison-Foulds (for topology) or Cophenetic (by vectorisation) distances
]

---
count: false
## Billera-Holmes-Vogtmann distance on treespace

.footnote[ Billera, Holmes & Vogtmann (2001)]

.pull-left[
 .center[
 <img src="img/dBHV_2.png" height=300/>
 ]
]

.pull-right[
* The treespace is a CAT(0) space so there is a unique shortest path connecting any two trees

* The BHV distance is the length of the unique shortest path between the trees on treespace

* Solution of a max flow min cut algorithm in `$O(n_{\text{leaf}}^{\small{4}})$`

* Other distances exist like Robison-Foulds (for topology) or Cophenetic (by vectorisation) distances
]

---
count: false
## Billera-Holmes-Vogtmann distance on treespace

.footnote[ Billera, Holmes & Vogtmann (2001)]

.pull-left[
 .center[
 <img src="img/dBHV_4.png" height=300/>
 ]
]

.pull-right[
* The treespace is a CAT(0) space so there is a unique shortest path connecting any two trees

* The BHV distance is the length of the unique shortest path between the trees on treespace

* Solution of a max flow min cut algorithm in `$O(n_{\text{leaf}}^{\small{4}})$`

* Other distances exist like Robison-Foulds (for topology) or Cophenetic (by vectorisation) distances
]

---
## Quantifying distance between trees

* __trees of primary interest__

* correlation tree on original data
  
  * taxonomic tree
  
--

* __what is the confident region for the correlation tree?__

* `$N_B$` correlation trees on boostrapped data (resampling on samples)

* __are trees significantly closer than two random trees?__

* `$N_{R_1}$` trees created by random shuffling of correlation tree tip labels
  
  * `$N_{R_2}$` trees created by random shuffling of taxonomic tree tip labels

We compute all pairwise distances between these trees

---
## Random shuffling

.center[<img src="img/shuffling.png", height="500">]

---
## Dataset

.footnote[ Ravel et al. (2011)]

* Vaginal microbiome of non pregnant women sequenced by 16S

* 40 different genera after filtering (~&#8239;30&#8239;%)

* 394 women with various Nugent score

.pull-left[ 
 <img src="img/tree_tax.png">
 .center[Taxonomic tree]
]

.pull-right[
 <img src="img/tree_cor.png">
 .center[Correlation tree] 
]

---
## Pairwise distances

.pull-left[
#### Distances to the correlation tree
.center[<img src="img/comp_boxplot.png" width="650">]
]

.pull-right[
#### Principal Coordinates Analysis
.center[<img src="img/comp_pcoa.png" width="650">]
]

.center[
__The correlation tree is different from the taxonomic tree__
]

---
class: center, middle, inverse

# Application

---
## Dataset

.footnote[ Caporaso et al. (2011), Sankaran & Holmes (2014)]

* Small subset of the `GlobalPatterns` dataset narrowed to Chlamydiae phylum

* Sequenced by 454

* 21 different species

* 26 samples representing 7 very different environments: soil, ocean, feces, skin...

* Find which bacteria are differentially abundant between environments

* Association using Fisher statistic

---
## Correlations

.center[<img src="img/cormat.png" height=450/>]

---
## Detected species with different corrections

.center[<img src="img/venn_0.png" height=450/>]

---
count: false
## Detected species with different corrections

.center[<img src="img/venn_rand.png" height=450/>]

---
count: false
## Detected species with different corrections

.center[<img src="img/venn_phy.png" height=450/>]

---
count: false
## Detected species with different corrections

.center[<img src="img/venn_cor.png" height=450/>]

---
## Back to abundances

.center[<img src="img/abundances.png" height=500/>]

---
## Representation of evidences on trees

.center[<img src="img/trees_w_pvalues.png" height=500/>]

---
class: center, middle, inverse

# Simulations

---
## Differentially abundant dataset

* Take a homogeneous dataset

* Arbitrarly assign a group (A or B) to each sample

* Select taxa that will be differentially abundant

* Apply a fold change on these taxa only in one group

---
## Analysis

.footnote[ Brito et al. (2016), Pasolli et al. (2017)]

* Stool from healthy patients dataset came from Curated Metagenomics Data

* 5 trees are used:

* correlation tree
  
  * taxonomy
  
  * random correlation tree
  
  * random taxonomy
  
  * oracle tree

---
class: nologo
## Results: FDR

.center[
<img src="img/FDRlines.png" width=570/>
]

---
class: nologo
## Results: TPR

.center[
<img src="img/TPRlines.png" width=570/>
]

---
class: center, middle, inverse
count: false

# Thanks for you attention!

#### <a href="https://github.com/abichat" target="_blank">@abichat</a>
<div style = "margin-top: -10px"></div>
#### <a href="https://twitter.com/_abichat" target="_blank">@_abichat</a>
<div style = "margin-top: -10px"></div>
#### <a href="https://www.linkedin.com/in/antoinebichat" target="_blank">antoinebichat</a>
<div style = "margin-top: -10px"></div>
#### <a href="https://abichat.github.io" target="_blank">abichat.github.io</a>
<div style = "margin-top: -10px"></div>
#### <a href="mailto:abichat@enterome.com?subject=Presentation">abichat@enterome.com</a>