Impact of tree choice in metagenomics differential abundance studies

class: title-slide, center, middle
count: false

.band[
# Impact of tree choice in metagenomics differential abundance studies

### Antoine Bichat
<div style = "margin-top: -30px"></div>
### January 14, 2019 - AgroParisTech

Work in progress - In collaboration with C. Ambroise (LaMME), 
M. Mariadassou (MaIAGE) & J. Plassais (Enterome)
]

---
class: center, middle, inverse

# Context

---
## Microbiota

_Ecological community of microorganisms that reside in an environmental niche_

.footnote[ Gut: The Inside Story of Our Body's Most Underrated Organ (Giulia Enders) &emsp;]

.pull-left[

#### Some figures for human gut

* `\(10^{\small{15}}\)` bacterial cells in one gut...

* ... weighing 2 kg

* More than 1&#8239;500 different species

* More than 10 millions unique genes

]

--
.footnote[ Gut: The Inside Story of Our Body's Most Underrated Organ (Giulia Enders) Opstelten et al. (2016), Bokulich et al. (2016), Blander et al. (2017)]

.pull-right[

#### Proven associations

* Immune system

* Crohn's disease

* Vaginosis

* Diabete

* Tobacco

* Diet

* Antibiotics

* Birth mode

]

---
## Data collection

.footnote[ Quince et al. (2017)]

.center[<img src="img/workflow.png" width="650">]

---
count: false
## Data collection

.footnote[ Quince et al. (2017)]

.center[<img src="img/workflow_ellipse.png" width="650">]

---
## Data - abundances of taxa

.footnote[ Ravel et al. (2011)]

```
 # A tibble: 122 x 395
 Taxa S001 S002 S003 S004 S005 S006 S007 S008 S009 S010
 <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Lactob… 2318 1388 1361 2256 88 1770 1490 119 2136 1790
 2 Prevot… 0 1 1 0 525 7 134 753 0 0
 3 Megasp… 0 1 0 0 402 0 4 102 0 0
 4 Sneath… 0 0 0 0 302 0 35 272 0 0
 5 Atopob… 0 1 0 0 84 0 12 54 0 0
 6 Strept… 0 0 3 0 0 0 138 4 0 2
 7 Dialis… 0 1 0 0 152 4 2 192 0 0
 8 Anaero… 0 1 3 2 0 9 12 13 0 0
 9 Pepton… 0 1 0 0 7 2 6 50 0 0
 10 Eggert… 0 0 0 0 2 0 0 7 0 0
 # … with 112 more rows, and 384 more variables
```

* Count data (or compositional) data

* Zero-inflated data

* __Correlation between species__

* Counts spanning several orders of magnitude: `\(1 \rightarrow 10^{\small{8}}\)`

---
## Data - samples information

.footnote[ Ravel et al. (2011)]

```
 # A tibble: 394 x 6
 Sample Ethnic_Group pH Nugent_Score Nugent_Cat CST 
 <chr> <chr> <dbl> <dbl> <chr> <chr>
 1 S001 Asian 4 0 low I 
 2 S002 White 4 0 low II 
 3 S003 Black 4 1 low III 
 4 S004 Asian 4.7 0 low I 
 5 S005 Black 5 6 intermediate IV 
 6 S006 White 4 0 low I 
 7 S007 White 4.7 1 low II 
 8 S008 White 5.8 9 high IV 
 9 S009 White 4.4 0 low III 
 10 S010 Black 4.4 1 low I 
 # … with 384 more rows
```

---
## Data - taxonomy

.footnote[ Ravel et al. (2011)]

```
 # A tibble: 129 x 5
 Phylum Class Order Family Genus 
 <chr> <chr> <chr> <chr> <chr> 
 1 Actinobacteria Actinobacteria Actinomycetal… Actinomycetac… Actinobaculum
 2 Actinobacteria Actinobacteria Actinomycetal… Actinomycetac… Actinomyces 
 3 Actinobacteria Actinobacteria Actinomycetal… Actinomycetac… Arcanobacter…
 4 Actinobacteria Actinobacteria Actinomycetal… Actinomycetac… Mobiluncus 
 5 Actinobacteria Actinobacteria Actinomycetal… Actinomycetac… Varibaculum 
 # … with 124 more rows
```

---
class: nologo
count: false
## Data - taxonomy

.center[<img src="img/tree_tax.png"/ width="500">]

---
class: center, middle, inverse

# Differential abundance studies

---
## Statistical issue

.center[
#### Healthy VS Diseased
<div style = "margin-top: -25px"></div>
.center[<img src="img/cohort.png"/ width="375">]
]

Is any bacteria more abundant in one group?

.pull-left[
#### Goals
* Biomarker

* Companion Diagnostics

* Potential targets for drugs
]

.pull-right[
#### Methods
* ANOVA / GLM

* Wilcoxon rank sum test

* Mixed model effects

* ...
]

---
## Multiple testing problem

Usual design:

* 1500 species in the gut, up to hundreds in studies

* Tens or hundreds of patients

Need for a controling procedure!

* Bonferroni (FWER)

---
count: false

## Multiple testing problem

Usual design:

* 1500 species in the gut, up to hundreds in studies

* Tens or hundreds of patients

Need for a controling procedure!

* ~~Bonferroni (FWER)~~ `\(\rightarrow\)` too conservative

* Benjamini-Hochberg (FDR)

But BH requires independance and there is lot of dependancies between species...

.center[
<img src="img/cor_bact_firm.png" width="300">
]

---
## Hierarchical FDR

.footnote[ Yekutieli (2008)]

.pull-left[

`\(\mathcal{T}_t = \left\{H_i \mid \text{Par}(i) = t\right\}\)`, `\(L\)` levels

Descending method:

* Test the family `\(\mathcal{T}_0\)`

* If node `\(t\)` is rejected, test `\(\mathcal{T}_t\)`
]

.pull-right[
<img src="img/tree_hyp.png" width="450">
]

How to test a family `\(\mathcal{T}_t\)` (simple BH procedure at level `\(q\)` within it):

1. Let `\(P^t_{(1)}\leq \dots\leq P^t_{(m_t)}\)` denote the set of ordered p-values corresponding to the hypotheses in `\(\mathcal{T}_t\)`

2. Let `\(r_t = \max\left\{i\mid P^t_{(i)} \leq i \frac{q}{m_t}\right\}\)`

3. Reject the `\(r_t\)` hypothesis with the smaller p-value

.center[
__The procedure controls the FDR at level `\(\mathbf{2.88 \times q \times L}\)`__
]

---
count: false
## Hierarchical FDR

.footnote[ Yekutieli (2008)]

.pull-left[

`\(\mathcal{T}_t = \left\{H_i \mid \text{Par}(i) = t\right\}\)`, `\(L\)` levels

Descending method:

* Test the family `\(\mathcal{T}_0\)`

* If node `\(t\)` is rejected, test `\(\mathcal{T}_t\)`
]

.pull-right[
<img src="img/tree_hyp_1.png" width="450">
]

How to test a family `\(\mathcal{T}_t\)` (simple BH procedure at level `\(q\)` within it):

1. Let `\(P^t_{(1)}\leq \dots\leq P^t_{(m_t)}\)` denote the set of ordered p-values corresponding to the hypotheses in `\(\mathcal{T}_t\)`

2. Let `\(r_t = \max\left\{i\mid P^t_{(i)} \leq i \frac{q}{m_t}\right\}\)`

3. Reject the `\(r_t\)` hypothesis with the smaller p-value

.center[
__The procedure controls the FDR at level `\(\mathbf{2.88 \times q \times L}\)`__
]

---
count: false
## Hierarchical FDR

.footnote[ Yekutieli (2008)]

.pull-left[

`\(\mathcal{T}_t = \left\{H_i \mid \text{Par}(i) = t\right\}\)`, `\(L\)` levels

Descending method:

* Test the family `\(\mathcal{T}_0\)`

* If node `\(t\)` is rejected, test `\(\mathcal{T}_t\)`
]

.pull-right[
<img src="img/tree_hyp_2.png" width="450">
]

How to test a family `\(\mathcal{T}_t\)` (simple BH procedure at level `\(q\)` within it):

1. Let `\(P^t_{(1)}\leq \dots\leq P^t_{(m_t)}\)` denote the set of ordered p-values corresponding to the hypotheses in `\(\mathcal{T}_t\)`

2. Let `\(r_t = \max\left\{i\mid P^t_{(i)} \leq i \frac{q}{m_t}\right\}\)`

3. Reject the `\(r_t\)` hypothesis with the smaller p-value

.center[
__The procedure controls the FDR at level `\(\mathbf{2.88 \times q \times L}\)`__
]

---
count: false
## Hierarchical FDR

.footnote[ Yekutieli (2008)]

.pull-left[

`\(\mathcal{T}_t = \left\{H_i \mid \text{Par}(i) = t\right\}\)`, `\(L\)` levels

Descending method:

* Test the family `\(\mathcal{T}_0\)`

* If node `\(t\)` is rejected, test `\(\mathcal{T}_t\)`
]

.pull-right[
<img src="img/tree_hyp_3.png" width="450">
]

How to test a family `\(\mathcal{T}_t\)` (simple BH procedure at level `\(q\)` within it):

1. Let `\(P^t_{(1)}\leq \dots\leq P^t_{(m_t)}\)` denote the set of ordered p-values corresponding to the hypotheses in `\(\mathcal{T}_t\)`

2. Let `\(r_t = \max\left\{i\mid P^t_{(i)} \leq i \frac{q}{m_t}\right\}\)`

3. Reject the `\(r_t\)` hypothesis with the smaller p-value

.center[
__The procedure controls the FDR at level `\(\mathbf{2.88 \times q \times L}\)`__
]

---
## Which tree?

.pull-left[Taxonomic tree?

<img src="img/tree_tax.png"> 
]

.pull-right[Correlation tree?

<img src="img/tree_cor.png"> ]

.center[__Are they similar for the procedure?__]

---
class: center, middle, inverse

# Comparison of trees

---
## Correlation tree

* Pairwise correlation matrix `C` is computed from Spearman correlation after removing shared zeros between two species

```r
cor(x[x+y > 0], y[x+y > 0], method = "spearman")
```
--

* Dissimilarity matrix `D` is obtained after a transformation of `C`

```r
D <- 1 - C
```
--

* Hierarchical clustering is used to create the correlation tree `tree_cor` from `D`

```r
tree_cor <- hclust(D)
```

.center[__Conditionally to abundances, the construction of the correlation tree is independant of samples covariables__]

---
## Billera-Holmes-Vogtamnn distance

.footnote[ Billera, Holmes & Vogtmann (2001)]

The BHV distance between two trees is the total length of intern branch that one needs to contract or extend to transform one tree into the other

.pull-left[
 .center[
<img src="img/dBHV_1.png" height=300/>
 ]
]

.pull-right[
 .center[
 <img src="img/dBHV_2.png" height=300/>
 ]
]

---
count: false

## Billera-Holmes-Vogtamnn distance

.footnote[ Billera, Holmes & Vogtmann (2001)]

The BHV distance between two trees is the total length of intern branch that one needs to contract or extend to transform one tree into the other

.pull-left[
 .center[
<img src="img/dBHV_3.png" height=300/>
 ]
]

.pull-right[
 .center[
 <img src="img/dBHV_4.png" height=300/>
 ]
]

.center[__There exists a distance on the space of trees__]

---
## Quantifying distance between trees

* __trees of primary interest__

* correlation tree on original data
  
  * taxonomic tree
  
--

* __what is the confident region for the correlation tree?__

* `\(N_B\)` correlation trees on boostrapped data

* __are trees significantly closer than two random trees?__

* `\(N_{R_1}\)` trees created by random shuffling of correlation tree tip labels
  
  * `\(N_{R_2}\)` trees created by random shuffling of taxonomic tree tip labels

We compute all pairwise distances between these trees

---
## Random shuffling

.center[<img src="img/shuffling.png", height="500">]

---
## Dataset

.footnote[ Ravel et al. (2011)]

* Vaginal microbiome of non pregnant women sequenced by 16S

* 40 different genera after filtering (~&#8239;30&#8239;%)

* 394 women with various Nugent score

.pull-left[<img src="img/tree_tax.png">]

.pull-right[<img src="img/tree_cor.png">]

---
## Distance to the correlation tree

.center[<img src="img/comp_boxplot.png" width="650">]

---
## Principal Coordinates Analysis

.center[<img src="img/comp_pcoa.png" width="650">]

---
class: center, middle, inverse

# Application

---
## Dataset

.footnote[ Caporaso et al. (2011), Sankaran & Holmes (2014)]

* Small subset of the `GlobalPatterns` dataset narrowed to Chlamydiae phylum

* Sequenced by 454

* 21 different species

* 26 samples representing 7 environments: soil, ocean, feces, skin...

* Find which bacteria are differentially abundant bewtween environments

* Association using Fisher statistic

---
## Detected species with different corrections

.center[<img src="img/venn.png" height=450/>]

---
count: false
## Detected species with different corrections

.center[<img src="img/venn_bh.png" height=450/>]

---
count: false
## Detected species with different corrections

.center[<img src="img/venn_phy.png" height=450/>]

---
count: false
## Detected species with different corrections

.center[<img src="img/venn_cor.png" height=450/>]

---
count: false
## Detected species with different corrections

.center[<img src="img/venn_rand.png" height=450/>]

---
## Back to abundances

.center[<img src="img/abundances.png" height=500/>]

---
## Representation of evidences on trees

.center[<img src="img/trees_w_pvalues.png" height=500/>]

---
class: center, middle, inverse

# Reproductibility

---
## package `{correlationtree}`

Available (soon) on my GitHub account: <a href="https://github.com/abichat" target="_blank">@abichat</a>

```r
devtools::install_github("abichat/correlationtree")
```

* Computation of correlation trees

* Comparison of tree distances

* Shiny application

---
## Shiny app

.footnote[ Pasolli et al. (2017)]

Allow to interactively <img src="img/hex-shiny.png" align="right" height=140/>

* Select data from standardized datasets available in package `{curatedMetagenomicData}` and filter it

* Visualize abundances by categorical variables

* Compute tree distances using Billera-Holmes-Vogtman, Robinson-Foulds and Cophenetic distances

.pull-left[<img src="img/shinyapp_1.png">]
.pull-right[<img src="img/shinyapp_2.png">]

---
class: center, middle, inverse

# Perspectives

---
## Perspectives

* Work with other hierarchical correction methods (bayesian smoothing)

* Test the correlation tree on simulated data

* Release the package

* Publish the article

---
class: center, middle, inverse
count: false

# Thanks for you attention!

#### <a href="https://github.com/abichat" target="_blank">@abichat</a>
<div style = "margin-top: -10px"></div>
#### <a href="https://twitter.com/_abichat" target="_blank">@_abichat</a>
<div style = "margin-top: -10px"></div>
#### <a href="https://www.linkedin.com/in/antoinebichat" target="_blank">antoinebichat</a>
<div style = "margin-top: -10px"></div>
#### <a href="https://abichat.github.io" target="_blank">abichat.github.io</a>
<div style = "margin-top: -10px"></div>
#### <a href="mailto:abichat@enterome.com?subject=Presentation">abichat@enterome.com</a>