Impact of tree choice in metagenomics differential abundance studies

class: title-slide, center, middle
count: false

# Impact of tree choice in metagenomics differential abundance studies

## Antoine Bichat
<div style = "margin-top: -30px"></div>

### June 6, 2019 - Journées de Statistique - Nancy

#### In collaboration with C. Ambroise (LaMME), M. Mariadassou (MaIAGE) & J. Plassais (Enterome)

---
class: center, middle, inverse

# Context

---
## Microbiota

_Ecological community of microorganisms that reside in an environmental niche_

.footnote[📘 Gut: The Inside Story of Our Body's Most Underrated Organ (Giulia Enders) 📄 Opstelten et al. (2016), Bokulich et al. (2016), Blander et al. (2017)]

.pull-left[

#### Some figures for human gut

* `$10^{\small{14}}$` bacterial cells in one gut...

* ... weighing 2 kg

* More than 1&#8239;500 different species

* More than 10 millions unique genes

]

.pull-right[

#### Proven associations

* Immune system

* Crohn's disease

* Vaginosis

* Diabete

* Tobacco

* Diet

* Antibiotics

* Birth mode

]

---
## Data - abundances of taxa

.footnote[📄 Ravel et al. (2011)]

```
 # A tibble: 122 x 395
 Taxa S001 S002 S003 S004 S005 S006 S007 S008 S009 S010
 <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Lactob… 2318 1388 1361 2256 88 1770 1490 119 2136 1790
 2 Prevot… 0 1 1 0 525 7 134 753 0 0
 3 Megasp… 0 1 0 0 402 0 4 102 0 0
 4 Sneath… 0 0 0 0 302 0 35 272 0 0
 5 Atopob… 0 1 0 0 84 0 12 54 0 0
 6 Strept… 0 0 3 0 0 0 138 4 0 2
 7 Dialis… 0 1 0 0 152 4 2 192 0 0
 8 Anaero… 0 1 3 2 0 9 12 13 0 0
 9 Pepton… 0 1 0 0 7 2 6 50 0 0
 10 Eggert… 0 0 0 0 2 0 0 7 0 0
 # … with 112 more rows, and 384 more variables
```

* Count data (or compositional) data

* Zero-inflated data

* Correlation between species

* Counts spanning several orders of magnitude: `$1 \rightarrow 10^{\small{8}}$`

---
## Data - taxonomy

.footnote[📄 Ravel et al. (2011)]

```
 # A tibble: 129 x 5
 Phylum Class Order Family Genus 
 <chr> <chr> <chr> <chr> <chr> 
 1 Actinobacteria Actinobacteria Actinomycetal… Actinomycetac… Actinobaculum
 2 Actinobacteria Actinobacteria Actinomycetal… Actinomycetac… Actinomyces 
 3 Actinobacteria Actinobacteria Actinomycetal… Actinomycetac… Arcanobacter…
 4 Actinobacteria Actinobacteria Actinomycetal… Actinomycetac… Mobiluncus 
 5 Actinobacteria Actinobacteria Actinomycetal… Actinomycetac… Varibaculum 
 # … with 124 more rows
```

---
class: nologo
count: false
## Data - taxonomy

.center[<img src="img/tree_tax.png"/ width="500">]

---
class: center, middle, inverse

# Differential abundance studies

---
## Statistical issue

Univariate tests on hundred of taxa

Need for a multiple testing controling procedure!

.footnote[📄 Philippot et al. (2010)]

.pull-left[
A hierarchy is available

<img src="img/coherence.png" width="400">
]

.pull-right[
Can we use it to do it better?
]

---
count: false
## Statistical issue

.footnote[📄 Philippot et al. (2010)]

Univariate tests on hundred of taxa

Need for a multiple testing controling procedure!

.pull-left[
A hierarchy is available

<img src="img/coherence.png" width="400">
]

.pull-right[

Can we use it to do it better?

* Hierarchical FDR

* z-scores smoothing
]

---
## Hierarchical FDR

.footnote[📄 Yekutieli (2008)]

`$\mathcal{T}_t = \left\{H_i \mid \text{Par}(i) = t\right\}$`

Descending method:

* Test the family `$\mathcal{T}_0$`

* If node `$t$` is rejected, test `$\mathcal{T}_t$` with a classical BH procedure at level `$q$`

.center[
 <img src="img/tree_hyp.png" width="420">
]

---
count: false
## Hierarchical FDR

.footnote[📄 Yekutieli (2008)]

`$\mathcal{T}_t = \left\{H_i \mid \text{Par}(i) = t\right\}$`

Descending method:

* Test the family `$\mathcal{T}_0$`

* If node `$t$` is rejected, test `$\mathcal{T}_t$` with a classical BH procedure at level `$q$`

.center[
 <img src="img/tree_hyp_1.png" width="420">
]

---
count: false

## Hierarchical FDR

.footnote[📄 Yekutieli (2008)]

`$\mathcal{T}_t = \left\{H_i \mid \text{Par}(i) = t\right\}$`

Descending method:

* Test the family `$\mathcal{T}_0$`

* If node `$t$` is rejected, test `$\mathcal{T}_t$` with a classical BH procedure at level `$q$`

.center[
 <img src="img/tree_hyp_2.png" width="420">
]

---
count: false
## Hierarchical FDR

.footnote[📄 Yekutieli (2008)]

`$\mathcal{T}_t = \left\{H_i \mid \text{Par}(i) = t\right\}$`

Descending method:

* Test the family `$\mathcal{T}_0$`

* If node `$t$` is rejected, test `$\mathcal{T}_t$` with a classical BH procedure at level `$q$`

.center[
 <img src="img/tree_hyp_3.png" width="420">
]

.center[
__This procedure controls the FDR at level__

`$\large{1.44 \times q \times \frac{\#\text{discoveries } +\text{ } \#\text{families tested}}{\#\text{discoveries } +\text{ } 1}}$`
]

---
## z-scores smoothing

.footnote[📄 Xiao et al. (2017)]

Denote by `$\mathbf{z}$` the vector of observed z-values and `$\mathbf{\mu}$` the vector of "true" z-values

Assume that 
`$\mathbf{z} | \mathbb{\mu} \sim \mathcal{N}_n \left( \mathbb{\mu}, \sigma^2 \mathbf{I}_m \right)$` and `$\mathbf{\mu} \sim \mathcal{N}_m\left(\gamma \mathbf{1} , \tau^2 \mathbf{C}_{\rho} \right)$`

then 
`$$\mathbf{z} \sim \mathcal{N}_m \left(\gamma \mathbf{1},\tau^2 \mathbf{C}_{\rho} + \sigma^2 \mathbf{I}_m\right)$$`

and Bayes formula gives 
 
`$$\mathbb{\mu}^* = \left(\mathbf{I}_m + \frac{\sigma_0^2}{\tau_0^2} \mathbf{C}_{\rho_0}^{-1}\right)^{-1}\left(\frac{\sigma_0^2}{\tau_0^2} \mathbf{C}_{\rho_0}^{-1}\gamma_0 \mathbf{1} + \mathbf{z}\right)$$`

with `$\sigma_0$`, `$\tau_0$`, `$\rho_0$` and `$\gamma_0$` hypermarameters

After smoothing, a multiple testing correction could be done on smoothed values

---
## Which tree?

#### Taxonomy? Phylogeny?

* Proxy for correlations at high-level niches

* Not so much for low-level niches?

* Not available everytime

---
class: nologo
count: false
## Which tree?

#### Taxonomy? Phylogeny?

* Proxy for correlations at high-level niches

* Not so much for low-level niches?

* Not available everytime

#### Correlation tree?

* Actual correlation between taxa

* Computed from data using pairwise correlation

.center[
<img src="img/tree_cor.png" width="350">
]

---
class: center, middle, inverse

# Comparison of trees

---
## Billera-Holmes-Vogtamnn distance

.footnote[📄 Billera, Holmes & Vogtmann (2001)]

The BHV distance is the length of the unique shortest path between the trees on treespace

.pull-left[
 .center[
<img src="img/dBHV_2.png" height=300/>
 ]
]

.pull-right[
 .center[
 <img src="img/dBHV_4.png" height=300/>
 ]
]

---
## Quantifying distance between trees

* __trees of primary interest__

* correlation tree on original data
 
 * taxonomy
 
--

* __what is the confident region for the correlation tree?__

* correlation trees on boostrapped data (resampling on samples)

* __are trees significantly closer than two random trees?__

* trees created by random shuffling of correlation tree tip labels
 
 * trees created by random shuffling of taxonomy tip labels

We compute all pairwise distances between these trees

---
## Random shuffling

.center[<img src="img/shuffling.png", height="500">]

---
## Dataset

.footnote[📄 Ravel et al. (2011)]

* Vaginal microbiome of non pregnant women sequenced by 16S

* 40 different genera after filtering (~&#8239;30&#8239;%)

.pull-left[ 
 <img src="img/tree_tax.png">
 .center[Taxonomy]
]

.pull-right[
 <img src="img/tree_cor.png">
 .center[Correlation tree] 
]

---
## Pairwise distances

.pull-left[
.center[<img src="img/prez_ravel_comp_box.png" width="650">]
]

.pull-right[
.center[<img src="img/prez_ravel_comp_pcoa.png" width="650">]
]

😀 The correlation tree is different from the taxonomy

---
class: center, middle, inverse

# Evaluation of hFDR

---
## Dataset

.footnote[📄 Caporaso et al. (2011), Sankaran & Holmes (2014)]

* Small subset of the `GlobalPatterns` dataset narrowed to Chlamydiae phylum

* 21 different OTUs

* 26 samples representing 9 very different environments: soil, ocean, feces, skin...

## Method

* Find which bacteria are differentially abundant between environments

* Association using Fisher statistic (ANOVA)

* Correction with hierarchical FDR

---
## Abundances of detected species

.center[<img src="img/realdata_chlamydia_box.png" height=500/>]

---
## Representation of evidences on trees

.center[<img src="img/tree_mirors.png" height=500/>]

---
## But...

`$\alpha = 0.10$` is only the family-level FDR.

The _a posteriori_ global FDR is:

* `$\alpha' = 0.32$` for phylogenetic correction

* `$\alpha' = 0.324$` for correlation correction

A BH procedure at the same global FDR level leads to 15 discoveries (+5)

😀 Using correlation tree instead of taxonomy yields more results

☹️ Vanilla BH beats hFDR for a given level

---
class: center, middle, inverse

# Evaluation of z-scores smoothing

---
## Dataset

.footnote[📄 Zeller et al. (2014)]

* Dataset from cancer study

* 119 different genera (after filtering)

* 199 samples: 42 adenoma, 91 carcinoma and 66 control

## Method

* Find which bacteria are differentially abundant between diseases

* Association using Kruskal-Wallis test

* Correction with hierarchical p-value smoothing

---
## Impact of the tree

.center[<img src="img/zeller-lines-bold.png" height=430/>]

😊 z-scores smoothing is slightly better than vanilla BH

☹️ All hierachies give highly similar results

---
class: center, middle, inverse

# Simulations

---
## Workflow

.footnote[📄 Brito et al. (2016)]

* Simulate DA taxa starting from an homogeneous dataset

* Correction with BH and hierarchical p-value smoothing

.center[<img src="img/workflow_simu.png" height=450/>]

---
## Evaluation

.center[<img src="img/simu_np_lines_prez.png" height=390/>]

😀 Using correlation tree instead of taxonomy yields more results

☹️ Vanilla BH is better

🤔 Taxonomy is worst than random trees

---
class: center, middle, inverse

# Conclusions

---

## Conclusions

😀 Correlation tree and taxonomy are very different

😀 Replacing taxonomy with correlationn increases the TPR

☹️ Vanilla BH is more powerful than hFDR

☹️ Bayesian smoothing does not really depend on the tree for z-scores smoothing

📦 `correlationtree`

📦 `yatah`

📦 `evabic`

---
class: title-slide, center, middle
count: false

# Thanks for your attention!

# Questions?

#### <a href="https://abichat.github.io" target="_blank">abichat.github.io</a> &emsp; <a href="https://github.com/abichat" target="_blank">@abichat</a> &emsp; <a href="https://twitter.com/_abichat" target="_blank">@_abichat</a>