Modeling in the Tidyverse

class: title-slide, center, middle
count: false

# Modeling in the Tidyverse
<div style = "margin-top: -40px"></div>
## and applications to metagenomics data

### Antoine Bichat
<div style = "margin-top: -30px"></div>
### November 23, 2018 - AgroParisTech

---
class: center, middle, inverse

# Tidyverse

---
## Metapackage

```r
library(tidyverse)
```

```
  ── Attaching packages ─────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
```

```
  ✔ ggplot2 3.1.0     ✔ purrr   0.2.5
  ✔ tibble  1.4.2     ✔ dplyr   0.7.8
  ✔ tidyr   0.8.2     ✔ stringr 1.3.1
  ✔ readr   1.1.1     ✔ forcats 0.3.0
```

```
  ── Conflicts ────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
  ✖ dplyr::filter() masks stats::filter()
  ✖ dplyr::lag()    masks stats::lag()
```

.footnote[<a href="https://abichat.github.io/slides/introtidyverse" target="_blank"> Introduction to the Tidyverse</a>]

```r
theme_set(theme_minimal())
options("tibble.max_extra_cols" = 5)
```

---
## Data analysis pipeline

.footnote[<a href="https://r4ds.had.co.nz" target="_blank"> R for data science (G. Grolemund & H. Wickham)</a>]

---
class: center, middle, inverse

# Dplyr

---
## List of dplyr's useful functions

.pull-left[

* `mutate()`

* `rename()`

* `select()`

* `arrange()`, `arrange(desc())`

* `pull()`

* `filter()`

* `distinct()`

* `top_n()`

* `sample_n()`, `sample_frac()`
]

.pull-right[

* `*_if()`, `*_at()`, `*_all()`

* `*_join()`

* `group_by()`

* `summarise()`

* `first()`, `last()`, `nth()`

* `n()`, `n_distinct()`

* `row_numbers()`

* `lead()`, `lag()`, `cummean()` ...
]

---
class: center, middle, inverse

# Tidyr

---

## List of tidyr's useful functions

.pull-left[
* `spread()` & `gather()`

* `separate()` & `unite()`

* `nest()` & `unnest()`

* `fill()`
]

.pull-right[
<img src="img/tidyr-spread-gather.gif", width=300> 
]

```r
pryr::object_size(iris)
```

```
  7.2 kB
```

```r
pryr::object_size(gather(iris, key = "Part", value = "Size", -Species))
```

```
  13.8 kB
```

.footnote[<a href="https://github.com/gadenbuie/tidyexplain" target="_blank"> Animations of tidyverse verbs (G. Aden-Buie)</a>]

---
class: center, middle, inverse

# Purrr

---
## Functionnal programming

* Never use `for` loops

* Forget all `*apply()` functions and friends for lists and dataframes

* `apply`, `lapply()`, `sapply()`, `tapply()`, `mapply()`, ...
  
  * `Map()`, `Reduce()`
  
  * `Vectorize()`

* Embrace `map()` function and variants

* `map()`, `map_dbl()`, `map_chr()`, `map_lgl()`, ...
  
  * `map2()`, `map2_dbl()`, `map2_chr()`, `map2_lgl()`, ...
  
  * `pmap()`, `pmap_dbl()`, `pmap_chr()`, `pmap_lgl()`, ...
  
  * `walk()`, `imap()` and variants

---
## `map*()`

* `map()` applies the function to each element and returns a list

* `map_lgl()`, `map_int()`, `map_dbl()` and `map_chr()` applies the function to each element and returns a atomic vector of the corresponding type

---
## `map*()`

```r
map(1:5, rnorm)
```

```
  [[1]]
  [1] 1.370958
  
  [[2]]
  [1] -0.5646982  0.3631284
  
  [[3]]
  [1]  0.6328626  0.4042683 -0.1061245
  
  [[4]]
  [1]  1.51152200 -0.09465904  2.01842371 -0.06271410
  
  [[5]]
  [1]  1.3048697  2.2866454 -1.3888607 -0.2787888 -0.1333213
```

---
count: false
## `map*()`

```r
map(1:5, rnorm) %>% 
  map_dbl(mean)
```

```
  [1]  1.3709584 -0.1007849  0.3103355  0.8431431  0.3581088
```

---
count: false
## `map*()`

```r
map(1:100, rnorm) %>% 
  map_dbl(mean) %>% 
  tibble(Mean = .) %>%       # Faster than 1:n() and rownames_to_column("N")
  mutate(N = row_number()) # Slightly different from rownames_to_column("N")
```

```
 # A tibble: 100 x 2
 Mean N
 <dbl> <int>
 1 1.37 1
 2 -0.101 2
 3 0.310 3
 4 0.843 4
 5 0.358 5
 6 -0.622 6
 7 -0.185 7
 8 0.0243 8
 9 -0.612 9
 10 0.314 10
 # ... with 90 more rows
```

---
count: false
## `map*()`

```r
map(1:100, rnorm) %>% 
  map_dbl(mean) %>% 
  tibble(Mean = .) %>% 
  mutate(N = row_number()) %>% 
  ggplot(aes(x = N, y = Mean)) +
    geom_hline(yintercept = 0, color = "red", alpha = 0.5) +
    geom_line()
```

---
## A better way with `compose()` and `map_df()`

```r
mean_rnorm <- compose(tibble, mean, rnorm)
mean_rnorm(5)
```

```
 # A tibble: 1 x 1
 out
 <dbl>
 1 -0.345
```

```r
map_dfr(1:10, mean_rnorm) %>% rownames_to_column("N")
```

```
 # A tibble: 10 x 2
 N out
 <chr> <dbl>
 1 1 1.37 
 2 2 -0.101 
 3 3 0.310 
 4 4 0.843 
 5 5 0.358 
 6 6 -0.622 
 7 7 -0.185 
 8 8 0.0243
 9 9 -0.612 
 10 10 0.314
```

---
## `map*()`

```r
map(1:4, rnorm, n = 6)
```

```
  [[1]]
  [1] 2.3709584 0.4353018 1.3631284 1.6328626 1.4042683 0.8938755
  
  [[2]]
  [1] 3.511522 1.905341 4.018424 1.937286 3.304870 4.286645
  
  [[3]]
  [1] 1.6111393 2.7212112 2.8666787 3.6359504 2.7157471 0.3435446
  
  [[4]]
  [1] 1.559533 5.320113 3.693361 2.218692 3.828083 5.214675
```

---
## `map*()`

```r
map(1:4, rnorm, n = 6) %>% 
  map_dbl(function(x) x[2])
```

```
  [1] 0.4353018 1.9053410 2.7212112 5.3201133
```

---
count: false
## `map*()`

```r
map(1:4, rnorm, n = 6) %>% 
  map_dbl(~ .[2])
```

```
  [1] 0.4353018 1.9053410 2.7212112 5.3201133
```

---
count: false
## `map*()`

```r
map(1:4, rnorm, n = 6) %>% 
  map_dbl(2)
```

```
  [1] 0.4353018 1.9053410 2.7212112 5.3201133
```

---
## Why `map()` is better

.footnote[<a href="https://colinfay.me/happy-dev-purrr/" target="_blank"> Happy dev with {purrr} (C.Fay)</a>]

* Stable and consistent grammar

```r
apply(X, MARGIN, FUN, ...)
lapply(X, FUN, ...)
sapply(X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)
tapply(X, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE)
mapply(FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)
eapply(env, FUN, ..., all.names = FALSE, USE.NAMES = TRUE)
...
```

```r
map(.x, .f, ...)
map_if(.x, .p, .f, ...)
map_chr(.x, .f, ...)
map_int(.x, .f, ...)
map_dbl(.x, .f, ...)
...
```

---
count: false
## Why `map()` is better

.footnote[<a href="https://colinfay.me/happy-dev-purrr/" target="_blank"> Happy dev with {purrr} (C.Fay)</a>]

* Stable and consistent grammar

* Type stability

```r
sapply(iris$Sepal.Length, as.numeric) %>% class()
```

```
  [1] "numeric"
```

```r
sapply(iris$Sepal.Length, as.data.frame) %>% class()
```

```
  [1] "list"
```

```r
map_dbl(iris$Sepal.Length, as.numeric) %>% class()
```

```
  [1] "numeric"
```

```r
map_df(iris$Sepal.Length, as.data.frame) %>% class()
```

```
  [1] "data.frame"
```

---
count: false
## Why `map()` is better

.footnote[<a href="https://colinfay.me/happy-dev-purrr/" target="_blank"> Happy dev with {purrr} (C.Fay)</a>]

* Stable and consistent grammar

* Type stability

* Anonymous functions & verbosity

```r
lapply(list, function(x) x + 2)
mapply(function(x, y) x + y, list1, list2)
lapply(list, function(x) x[[2]])
lapply(list, function(x) x$foo)
```

```r
map(list, ~ . + 2)
map2(list1, list2, ~ .x + .y)
map(list, 2)
map(list, "foo")
```

---
count: false
## Why `map()` is better

.footnote[<a href="https://colinfay.me/happy-dev-purrr/" target="_blank"> Happy dev with {purrr} (C.Fay)</a>]

* Stable and consistent grammar

* Type stability

* Anonymous functions & verbosity

* But it's slightly slower... (~ 40 ns per element)

---
## `map2*()`

`map2()` and `map2_*()` are variants of `map()` and `map_*()` which work with two arguments

---
## `map2*()`

```r
map2(1:4, c(2, 5, 5, 10), runif, n = 5)
```

```
  [[1]]
  [1] 1.970967 1.618838 1.333427 1.346748 1.398485
  
  [[2]]
  [1] 4.354078 2.116809 4.246386 4.031830 2.513793
  
  [[3]]
  [1] 3.522176 4.028826 4.351215 4.965634 4.519089
  
  [[4]]
  [1] 7.398931 9.098138 5.136844 5.627720 8.968951
```

---
## `pmap*()`

`pmap()` and `pmap_*()` are generalized versions of `map()` and `map_*()` which work with any number of arguments

---
## `pmap*()`

```r
pmap(list(n = c(2, 3, 2, 5), min = 1:4, max = c(2, 5, 5, 10)), runif)
```

```
  [[1]]
  [1] 1.693205 1.240545
  
  [[2]]
  [1] 2.128966 2.421437 2.649156
  
  [[3]]
  [1] 3.958797 3.394821
  
  [[4]]
  [1] 8.316135 4.047308 6.252940 7.086446 4.009423
```

---
## `*walk*()` functions

`walk()` and variants work like `map()` but silently apply the function and return the list unchanged

Useful for saving files or plotting

---
class: nologo
## `imap*()`

`imap(x, fun)` is equivalent to `map2(x, names(x), fun)`

```r
imap(swiss, ~ ggplot(swiss, aes(x = Fertility, y = .x)) + 
       geom_point() + labs(x = "Fertility", y = .y)) %>%
  gridExtra::grid.arrange(grobs = ., ncol = 3)
```

---
count:false
class: nologo
## `imap*()`

`imap(x, fun)` is equivalent to `map2(x, names(x), fun)`

```r
swiss %>%
  imap(., function(x, y) ggplot(., aes(x = Fertility, y = x)) + 
         geom_point() + labs(x = "Fertility", y = y)) %>%
  gridExtra::grid.arrange(grobs = ., ncol = 3)
```

---
## `reduce()`

`reduce(list(x1, x2, x3), f)` is equivalent to `f(f(x1, x2), x3)`

---
## `reduce()`

```r
rerun(4, sample(1:10, 6)) 
```

```
  [[1]]
  [1] 3 4 5 7 2 8
  
  [[2]]
  [1] 10  6  9  1  2  7
  
  [[3]]
  [1]  7  4 10  9  5  6
  
  [[4]]
  [1]  4  7  8  2 10  1
```

```r
rerun(4, sample(1:10, 6)) %>% 
  reduce(intersect)
```

---
count: false
## `reduce()`

```r
rerun(4, sample(1:10, 6))
```

```
  [[1]]
  [1] 3 4 5 7 2 8
  
  [[2]]
  [1] 10  6  9  1  2  7
  
  [[3]]
  [1]  7  4 10  9  5  6
  
  [[4]]
  [1]  4  7  8  2 10  1
```

```r
rerun(4, sample(1:10, 6)) %>% 
  reduce(intersect)
```

```
  [1] 7
```

---
class: center, middle, inverse

# Tibble

---
## Nice printing

```r
mtcars
```

```
                       mpg cyl  disp  hp drat    wt  qsec vs am gear carb
  Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
  Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
  Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
  Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
  Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
  Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
  Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
  Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
  Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
  Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
  Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
  Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
  Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
  Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
  Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
  Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
  Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
  Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
  Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
  Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
  Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
  Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
  AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
  Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
  Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
  Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
  Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
  Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
  Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
  Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
  Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
  Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
```

---
count: false
## Nice printing

```r
as_tibble(mtcars)
```

```
 # A tibble: 32 x 11
 mpg cyl disp hp drat wt qsec vs am gear carb
 * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
 # ... with 22 more rows
```

---
count: false
## Nice printing ... but don't use row names!

```r
as_tibble(mtcars, rownames = "model")
```

```
 # A tibble: 32 x 12
 model mpg cyl disp hp drat wt qsec vs am gear carb
 <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Mazd… 21 6 160 110 3.9 2.62 16.5 0 1 4 4
 2 Mazd… 21 6 160 110 3.9 2.88 17.0 0 1 4 4
 3 Dats… 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
 4 Horn… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
 5 Horn… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
 6 Vali… 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
 7 Dust… 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
 8 Merc… 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
 9 Merc… 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
 10 Merc… 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
 # ... with 22 more rows
```

---
## Consistancy in subsetting

```r
class(iris[, 1:2])
```

```
  [1] "data.frame"
```

```r
class(iris[, 1])
```

---
count: false
## Consistancy in subsetting

```r
class(iris[, 1:2])
```

```
  [1] "data.frame"
```

```r
class(iris[, 1])
```

```
  [1] "numeric"
```

--

```r
iris_tbl <- as_tibble(iris)
class(iris_tbl[, 1:2])
```

```
  [1] "tbl_df"     "tbl"        "data.frame"
```

```r
class(iris_tbl[, 1])
```

```
  [1] "tbl_df"     "tbl"        "data.frame"
```

---
class: center, middle, noslidenumber, nologo

---
## A little function

```r
divisors <- function(N){
 if(N == 1) {return(1)}
 if(N == 2) {return(1:2)}
 c(1, (2:(N/2))[N %% 2:(N/2) == 0], N)
}
```

```r
map(4:8, divisors)
```

```
  [[1]]
  [1] 1 2 4
  
  [[2]]
  [1] 1 5
  
  [[3]]
  [1] 1 2 3 6
  
  [[4]]
  [1] 1 7
  
  [[5]]
  [1] 1 2 4 8
```

---
## List-columns

```r
tibble(N = 1:50)
```

```
 # A tibble: 50 x 1
 N
 <int>
 1 1
 2 2
 3 3
 4 4
 5 5
 6 6
 7 7
 8 8
 9 9
 10 10
 # ... with 40 more rows
```

---
count: false
## List-columns

```r
tibble(N = 1:50) %>% 
  mutate(Divisors = map(N, divisors))
```

```
 # A tibble: 50 x 2
 N Divisors 
 <int> <list> 
 1 1 <dbl [1]>
 2 2 <int [2]>
 3 3 <dbl [2]>
 4 4 <dbl [3]>
 5 5 <dbl [2]>
 6 6 <dbl [4]>
 7 7 <dbl [2]>
 8 8 <dbl [4]>
 9 9 <dbl [3]>
 10 10 <dbl [4]>
 # ... with 40 more rows
```

---
count: false
## List-columns

```r
tibble(N = 1:50) %>% 
  mutate(Divisors = map(N, divisors)) %>% 
  pull(Divisors)
```

```
  [[1]]
  [1] 1
  
  [[2]]
  [1] 1 2
  
  [[3]]
  [1] 1 3
  
  [[4]]
  [1] 1 2 4
  
  [[5]]
  [1] 1 5
  
  [[6]]
  [1] 1 2 3 6
  
  [[7]]
  [1] 1 7
  
  [[8]]
  [1] 1 2 4 8
  
  [[9]]
  [1] 1 3 9
  
  [[10]]
  [1]  1  2  5 10
  
  [[11]]
  [1]  1 11
  
  [[12]]
  [1]  1  2  3  4  6 12
  
  [[13]]
  [1]  1 13
  
  [[14]]
  [1]  1  2  7 14
  
  [[15]]
  [1]  1  3  5 15
  
  [[16]]
  [1]  1  2  4  8 16
  
  [[17]]
  [1]  1 17
  
  [[18]]
  [1]  1  2  3  6  9 18
  
  [[19]]
  [1]  1 19
  
  [[20]]
  [1]  1  2  4  5 10 20
  
  [[21]]
  [1]  1  3  7 21
  
  [[22]]
  [1]  1  2 11 22
  
  [[23]]
  [1]  1 23
  
  [[24]]
  [1]  1  2  3  4  6  8 12 24
  
  [[25]]
  [1]  1  5 25
  
  [[26]]
  [1]  1  2 13 26
  
  [[27]]
  [1]  1  3  9 27
  
  [[28]]
  [1]  1  2  4  7 14 28
  
  [[29]]
  [1]  1 29
  
  [[30]]
  [1]  1  2  3  5  6 10 15 30
  
  [[31]]
  [1]  1 31
  
  [[32]]
  [1]  1  2  4  8 16 32
  
  [[33]]
  [1]  1  3 11 33
  
  [[34]]
  [1]  1  2 17 34
  
  [[35]]
  [1]  1  5  7 35
  
  [[36]]
  [1]  1  2  3  4  6  9 12 18 36
  
  [[37]]
  [1]  1 37
  
  [[38]]
  [1]  1  2 19 38
  
  [[39]]
  [1]  1  3 13 39
  
  [[40]]
  [1]  1  2  4  5  8 10 20 40
  
  [[41]]
  [1]  1 41
  
  [[42]]
  [1]  1  2  3  6  7 14 21 42
  
  [[43]]
  [1]  1 43
  
  [[44]]
  [1]  1  2  4 11 22 44
  
  [[45]]
  [1]  1  3  5  9 15 45
  
  [[46]]
  [1]  1  2 23 46
  
  [[47]]
  [1]  1 47
  
  [[48]]
   [1]  1  2  3  4  6  8 12 16 24 48
  
  [[49]]
  [1]  1  7 49
  
  [[50]]
  [1]  1  2  5 10 25 50
```

---
count: false
## List-columns

```r
tibble(N = 1:50) %>% 
  mutate(Divisors = map(N, divisors),
         NbDivisors = map(Divisors, length)) 
```

```
 # A tibble: 50 x 3
 N Divisors NbDivisors
 <int> <list> <list> 
 1 1 <dbl [1]> <int [1]> 
 2 2 <int [2]> <int [1]> 
 3 3 <dbl [2]> <int [1]> 
 4 4 <dbl [3]> <int [1]> 
 5 5 <dbl [2]> <int [1]> 
 6 6 <dbl [4]> <int [1]> 
 7 7 <dbl [2]> <int [1]> 
 8 8 <dbl [4]> <int [1]> 
 9 9 <dbl [3]> <int [1]> 
 10 10 <dbl [4]> <int [1]> 
 # ... with 40 more rows
```

---
count: false
## List-columns

```r
tibble(N = 1:50) %>% 
  mutate(Divisors = map(N, divisors),
         NbDivisors = map_dbl(Divisors, length)) 
```

```
 # A tibble: 50 x 3
 N Divisors NbDivisors
 <int> <list> <dbl>
 1 1 <dbl [1]> 1
 2 2 <int [2]> 2
 3 3 <dbl [2]> 2
 4 4 <dbl [3]> 3
 5 5 <dbl [2]> 2
 6 6 <dbl [4]> 4
 7 7 <dbl [2]> 2
 8 8 <dbl [4]> 4
 9 9 <dbl [3]> 3
 10 10 <dbl [4]> 4
 # ... with 40 more rows
```

---
count: false
## List-columns

```r
tibble(N = 1:50) %>% 
 mutate(Divisors = map(N, divisors),
 NbDivisors = map_dbl(Divisors, length)) %>% 
 ggplot(aes(x = N, y = NbDivisors)) +
 geom_col(aes(fill = NbDivisors < 3), color = "grey30") +
 scale_fill_viridis_d(begin = 0.6) +
 labs(x = "N", y = "Number of divisors of N") +
 theme(legend.position = "none")
```

---
# `nest()`

```r
iris %>% 
  as_tibble()
```

```
 # A tibble: 150 x 5
 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
 <dbl> <dbl> <dbl> <dbl> <fct> 
 1 5.1 3.5 1.4 0.2 setosa 
 2 4.9 3 1.4 0.2 setosa 
 3 4.7 3.2 1.3 0.2 setosa 
 4 4.6 3.1 1.5 0.2 setosa 
 5 5 3.6 1.4 0.2 setosa 
 6 5.4 3.9 1.7 0.4 setosa 
 7 4.6 3.4 1.4 0.3 setosa 
 8 5 3.4 1.5 0.2 setosa 
 9 4.4 2.9 1.4 0.2 setosa 
 10 4.9 3.1 1.5 0.1 setosa 
 # ... with 140 more rows
```

---
count:false
# `nest()`

```r
iris %>% 
  group_by(Species) %>% 
  summarise(N = n())
```

```
 # A tibble: 3 x 2
 Species N
 <fct> <int>
 1 setosa 50
 2 versicolor 50
 3 virginica 50
```

---
count:false
# `nest()`

```r
iris %>% 
  group_by(Species) %>% 
  nest()
```

```
 # A tibble: 3 x 2
 Species data 
 <fct> <list> 
 1 setosa <tibble [50 × 4]>
 2 versicolor <tibble [50 × 4]>
 3 virginica <tibble [50 × 4]>
```

---
count:false
# `nest()`

```r
iris %>% 
  group_by(Species) %>% 
  nest() %>% 
  pull(data) %>% 
  first()
```

```
 # A tibble: 50 x 4
 Sepal.Length Sepal.Width Petal.Length Petal.Width
 <dbl> <dbl> <dbl> <dbl>
 1 5.1 3.5 1.4 0.2
 2 4.9 3 1.4 0.2
 3 4.7 3.2 1.3 0.2
 4 4.6 3.1 1.5 0.2
 5 5 3.6 1.4 0.2
 6 5.4 3.9 1.7 0.4
 7 4.6 3.4 1.4 0.3
 8 5 3.4 1.5 0.2
 9 4.4 2.9 1.4 0.2
 10 4.9 3.1 1.5 0.1
 # ... with 40 more rows
```

---
class: center, middle, inverse

# Tidymodels

---
## Metapackage

```r
library(tidymodels)
```

```
  ── Attaching packages ────────────────────────────────────────────────────────────────────────────── tidymodels 0.0.1 ──
```

```
  ✔ rsample   0.0.2     ✔ yardstick 0.0.2
  ✔ recipes   0.1.4     ✔ infer     0.4.0
  ✔ broom     0.5.0
```

```
  ── Conflicts ───────────────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
  ✖ rsample::fill()    masks tidyr::fill()
  ✖ dplyr::filter()    masks stats::filter()
  ✖ recipes::fixed()   masks stringr::fixed()
  ✖ dplyr::lag()       masks stats::lag()
  ✖ recipes::prepper() masks rsample::prepper()
  ✖ yardstick::spec()  masks readr::spec()
  ✖ recipes::step()    masks stats::step()
  ✖ yardstick::tidy()  masks recipes::tidy(), broom::tidy()
```

```r
tidy <- broom::tidy
```

---
class: center, middle, inverse

# Broom

---
## Dataset

```r
df_pines <- read_csv("pines.txt")
df_pines
```

```
 # A tibble: 39 x 3
 Species Diameter Height
 <chr> <dbl> <int>
 1 White 21.2 127
 2 White 20.2 119
 3 White 24.6 135
 4 White 23 132
 5 White 27.2 130
 6 White 18.6 130
 7 White 17.3 110
 8 White 10 75
 9 White 19.7 110
 10 White 22.3 124
 # ... with 29 more rows
```

---
## Classical output of linear regression

```r
reg_simple <- lm(Height ~ Diameter * Species, data = df_pines)
reg_simple
```

```
  
  Call:
  lm(formula = Height ~ Diameter * Species, data = df_pines)
  
  Coefficients:
             (Intercept)                Diameter           SpeciesYellow  
                 41.2747                  3.9789                -35.8390  
  Diameter:SpeciesYellow  
                  0.5293
```

---
count:false
## Classical output of linear regression

```r
reg_simple <- lm(Height ~ Diameter * Species, data = df_pines)
summary(reg_simple)
```

```
  
  Call:
  lm(formula = Height ~ Diameter * Species, data = df_pines)
  
  Residuals:
       Min       1Q   Median       3Q      Max 
  -19.5014  -8.2424  -0.7899   7.5008  28.3598 
  
  Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
  (Intercept)             41.2747     6.8238   6.049 6.66e-07 ***
  Diameter                 3.9789     0.3926  10.136 5.95e-12 ***
  SpeciesYellow          -35.8390    10.5955  -3.382  0.00178 ** 
  Diameter:SpeciesYellow   0.5293     0.5899   0.897  0.37572    
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  Residual standard error: 11.36 on 35 degrees of freedom
  Multiple R-squared:  0.8744,	Adjusted R-squared:  0.8637 
  F-statistic: 81.25 on 3 and 35 DF,  p-value: 7.69e-16
```

---
count:false
## Classical output of linear regression

```r
reg_simple <- lm(Height ~ Diameter * Species, data = df_pines)
coef(summary(reg_simple))
```

```
                            Estimate Std. Error    t value     Pr(>|t|)
  (Intercept)             41.2747285  6.8237545  6.0486831 6.661673e-07
  Diameter                 3.9789200  0.3925661 10.1356694 5.951760e-12
  SpeciesYellow          -35.8390158 10.5955323 -3.3824649 1.781368e-03
  Diameter:SpeciesYellow   0.5292551  0.5898638  0.8972496 3.757164e-01
```

```r
coef(reg_simple)
```

```
             (Intercept)               Diameter          SpeciesYellow 
              41.2747285              3.9789200            -35.8390158 
  Diameter:SpeciesYellow 
               0.5292551
```

---
## `tidy()`

Presents the outputs of the model in a tibble

```r
tidy(reg_simple)
```

```
 # A tibble: 4 x 5
 term estimate std.error statistic p.value
 <chr> <dbl> <dbl> <dbl> <dbl>
 1 (Intercept) 41.3 6.82 6.05 6.66e- 7
 2 Diameter 3.98 0.393 10.1 5.95e-12
 3 SpeciesYellow -35.8 10.6 -3.38 1.78e- 3
 4 Diameter:SpeciesYellow 0.529 0.590 0.897 3.76e- 1
```

---
## `glance()`

Returns summary informations about the model in a single-row tibble

```r
glance(reg_simple)
```

```
 # A tibble: 1 x 11
 r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
 * <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1 0.874 0.864 11.4 81.2 7.69e-16 4 -148. 306. 314.
 # ... with 2 more variables: deviance <dbl>, df.residual <int>
```

---
## `augment()`

Adds information about each observation in the dataset

```r
(augmented <- augment(reg_simple, df_pines))
```

```
 # A tibble: 39 x 10
 Species Diameter Height .fitted .se.fit .resid .hat .sigma .cooksd
 * <chr> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 White 21.2 127 126. 3.16 1.37 0.0775 11.5 3.32e-4
 2 White 20.2 119 122. 2.94 -2.65 0.0668 11.5 1.04e-3
 3 White 24.6 135 139. 4.13 -4.16 0.132 11.5 5.86e-3
 4 White 23 132 133. 3.64 -0.790 0.103 11.5 1.55e-4
 5 White 27.2 130 150. 4.98 -19.5 0.192 10.9 2.17e-1
 6 White 18.6 130 115. 2.65 14.7 0.0545 11.2 2.56e-2
 7 White 17.3 110 110. 2.52 -0.110 0.0491 11.5 1.27e-6
 8 White 10 75 81.1 3.47 -6.06 0.0935 11.5 8.10e-3
 9 White 19.7 110 120. 2.83 -9.66 0.0623 11.4 1.28e-2
 10 White 22.3 124 130. 3.45 -6.00 0.0921 11.5 7.81e-3
 # ... with 29 more rows, and 1 more variable: .std.resid <dbl>
```

---
## Graph of residuals

```r
augmented %>% 
  arrange(.fitted) %>% 
  mutate(.lowess = lowess(.fitted, .resid)[[2]]) %>% 
  ggplot() +
    aes(x = .fitted) +
    geom_hline(yintercept = 0, alpha = 0.5, linetype = "dashed") +
    geom_line(aes(y = .lowess), color = "saddlebrown") +
    geom_point(aes(y = .resid, color = Species)) +
    labs(x = "Fitted values", y = "Residuals")
```

---
## Quantile-quantile plot

```r
ggplot(augmented) +
  aes(sample = .resid) +
  geom_qq_line(color = "red") +
  geom_qq()+
  labs(x = "Theoretical quantiles", y = "Standardized residuals")
```

---
## Leverage

```r
augmented %>% 
  arrange(.hat) %>% 
  mutate(.lowess = lowess(.hat, .std.resid)[[2]]) %>% 
  ggplot() +
    aes(x = .hat) +
    geom_vline(xintercept = 0, alpha = 0.5, linetype = "dashed") +
    geom_hline(yintercept = 0, alpha = 0.5, linetype = "dashed") +
    geom_line(aes(y = .lowess), color = "saddlebrown") +
    geom_point(aes(y = .std.resid, color = Species)) +
    labs(x = "Leverage", y = "Standardized residuals")
```

---
## Cook's distances

```r
augmented %>% 
  rowid_to_column("ID") %>% 
  ggplot() +
    aes(x = ID, y = .cooksd, fill = Species) +
    geom_col(color = "grey30") +
    labs(x = "Observation", y = "Cook's Distance")
```

---
## Wilcoxon test

```r
w_test <- wilcox.test(Height ~ Species, data = df_pines)
w_test
```

```
  
  	Wilcoxon rank sum test with continuity correction
  
  data:  Height by Species
  W = 269.5, p-value = 0.0241
  alternative hypothesis: true location shift is not equal to 0
```

```r
tidy(w_test)
```

```
 # A tibble: 1 x 4
 statistic p.value method alternative
 <dbl> <dbl> <chr> <chr> 
 1 270. 0.0241 Wilcoxon rank sum test with continuity co… two.sided
```

---
## Tidying methods for your package

If you want to include `tidy()`, `glance()` and `augment()` methods for your models:

1. Re-export the `tidy()`, `glance()` or `augment()` generics from the {modelgenerics} package

2. Implement appropriate tidying methods in your own package

3. Test the tidying methods

General guidelines:

* Reach a minimum 90% test coverage for new tidiers

* `tidy()`, `glance()` and `augment()` methods must return tibbles

* Use new tidyverse packages and follow the tidyverse style conventions

* ...

Complete information are on <a href="https://broom.tidyverse.org/articles/adding-tidiers.html" target="_blank"> Adding new tidiers to broom</a> and <a href="https://broom.tidyverse.org/articles/external-tidiers.html" target="_blank"> Exporting tidying methods from a package</a>

---
class: center, middle, inverse

# Rsample

---
## Estimation of parameters

```r
bootstrap_pins <-
 df_pines %>% 
 rowid_to_column("ID") %>% 
 bootstraps(times = 500)
bootstrap_pins
```

```
 # Bootstrap sampling 
 # A tibble: 500 x 2
 splits id 
 <list> <chr> 
 1 <S3: rsplit> Bootstrap001
 2 <S3: rsplit> Bootstrap002
 3 <S3: rsplit> Bootstrap003
 4 <S3: rsplit> Bootstrap004
 5 <S3: rsplit> Bootstrap005
 6 <S3: rsplit> Bootstrap006
 7 <S3: rsplit> Bootstrap007
 8 <S3: rsplit> Bootstrap008
 9 <S3: rsplit> Bootstrap009
 10 <S3: rsplit> Bootstrap010
 # ... with 490 more rows
```

---
## `rsplit` object

```r
bootstrap_pins %>% 
  pull(splits) %>% 
  first()
```

```
 <39/12/39>
```

---
count:false
## `rsplit` object

```r
bootstrap_pins %>% 
  pull(splits) %>% 
  first() %>% 
  as_tibble() 
```

```
 # A tibble: 39 x 4
 ID Species Diameter Height
 <int> <chr> <dbl> <int>
 1 11 White 5.6 55
 2 16 White 14.4 107
 3 1 White 21.2 127
 4 15 White 12.6 108
 5 34 Yellow 12.2 57
 6 14 White 13.1 116
 7 19 White 7 56
 8 24 Yellow 14.4 84
 9 20 White 9.3 70
 10 8 White 10 75
 # ... with 29 more rows
```

---
## Analysis & assessment sets

```r
bootstrap_pins %>% 
  pull(splits) %>% 
  first() %>% 
  as_tibble() %>% # analysis()
  pull(ID) %>% 
  table()
```

```
  .
   1  3  4  5  8 10 11 12 13 14 15 16 17 18 19 20 21 22 24 26 27 28 29 31 33 
   2  1  1  1  1  1  1  1  1  1  1  2  1  2  3  1  2  1  1  2  1  1  2  3  2 
  34 36 
   2  1
```

```r
bootstrap_pins %>% 
  pull(splits) %>% 
  first() %>% 
  as_tibble(data = "assessment") %>% # assessment()
  pull(ID) %>% 
  table()
```

```
  .
   2  6  7  9 23 25 30 32 35 37 38 39 
   1  1  1  1  1  1  1  1  1  1  1  1
```

---
## Bootstrapped models

```r
my_lm <- partial(lm, formula = Height ~ Diameter * Species)
```

```r
(bootstrap_pins <-
 bootstrap_pins %>% 
 mutate(reg = map(splits, my_lm)))
```

```
 # Bootstrap sampling 
 # A tibble: 500 x 3
 splits id reg 
 * <list> <chr> <list> 
 1 <S3: rsplit> Bootstrap001 <S3: lm>
 2 <S3: rsplit> Bootstrap002 <S3: lm>
 3 <S3: rsplit> Bootstrap003 <S3: lm>
 4 <S3: rsplit> Bootstrap004 <S3: lm>
 5 <S3: rsplit> Bootstrap005 <S3: lm>
 6 <S3: rsplit> Bootstrap006 <S3: lm>
 7 <S3: rsplit> Bootstrap007 <S3: lm>
 8 <S3: rsplit> Bootstrap008 <S3: lm>
 9 <S3: rsplit> Bootstrap009 <S3: lm>
 10 <S3: rsplit> Bootstrap010 <S3: lm>
 # ... with 490 more rows
```

---
count: false
## Bootstrapped models

```r
my_lm <- partial(lm, formula = Height ~ Diameter * Species)
```

```r
(bootstrap_pins <-
 bootstrap_pins %>% 
 mutate(reg = map(splits, my_lm),
 tidied = map(reg, tidy)))
```

```
 # Bootstrap sampling 
 # A tibble: 500 x 4
 splits id reg tidied 
 * <list> <chr> <list> <list> 
 1 <S3: rsplit> Bootstrap001 <S3: lm> <tibble [4 × 5]>
 2 <S3: rsplit> Bootstrap002 <S3: lm> <tibble [4 × 5]>
 3 <S3: rsplit> Bootstrap003 <S3: lm> <tibble [4 × 5]>
 4 <S3: rsplit> Bootstrap004 <S3: lm> <tibble [4 × 5]>
 5 <S3: rsplit> Bootstrap005 <S3: lm> <tibble [4 × 5]>
 6 <S3: rsplit> Bootstrap006 <S3: lm> <tibble [4 × 5]>
 7 <S3: rsplit> Bootstrap007 <S3: lm> <tibble [4 × 5]>
 8 <S3: rsplit> Bootstrap008 <S3: lm> <tibble [4 × 5]>
 9 <S3: rsplit> Bootstrap009 <S3: lm> <tibble [4 × 5]>
 10 <S3: rsplit> Bootstrap010 <S3: lm> <tibble [4 × 5]>
 # ... with 490 more rows
```

---
count: false
## Bootstrapped models

```r
my_lm <- partial(lm, formula = Height ~ Diameter * Species)
```

```r
(bootstrap_pins <-
 bootstrap_pins %>% 
 mutate(reg = map(splits, my_lm),
 tidied = map(reg, tidy)) %>% 
 unnest(tidied)) 
```

```
 # A tibble: 2,000 x 6
 id term estimate std.error statistic p.value
 <chr> <chr> <dbl> <dbl> <dbl> <dbl>
 1 Bootstrap001 (Intercept) 35.0 4.93 7.09 2.95e- 8
 2 Bootstrap001 Diameter 4.32 0.299 14.4 2.57e-16
 3 Bootstrap001 SpeciesYellow -37.0 8.26 -4.48 7.56e- 5
 4 Bootstrap001 Diameter:SpeciesYel… 0.387 0.516 0.750 4.58e- 1
 5 Bootstrap002 (Intercept) 49.6 6.64 7.47 9.53e- 9
 6 Bootstrap002 Diameter 3.68 0.394 9.32 5.19e-11
 7 Bootstrap002 SpeciesYellow -53.4 9.91 -5.38 5.00e- 6
 8 Bootstrap002 Diameter:SpeciesYel… 1.37 0.605 2.26 2.99e- 2
 9 Bootstrap003 (Intercept) 33.4 7.70 4.34 1.14e- 4
 10 Bootstrap003 Diameter 4.31 0.449 9.59 2.52e-11
 # ... with 1,990 more rows
```

---
## Bootstrapped models

```r
bootstrap_pins %>% 
  mutate(term = fct_relevel(term, "Diameter:SpeciesYellow", after = Inf)) %>%
  ggplot(aes(x = estimate, y = term, fill = term)) +
    ggridges::geom_density_ridges() + 
    theme(legend.position = "none")
```

---
## Other resampling techniques

* `initial_split(prop = 3/4)`, `training()`, `testing()`

* `loo_cv()`

* `mc_cv()`

* `vfold_cv()`

* `group_vfold_cv()`

* `rolling_origin()`

* `nested_cv()`

Since the original data is not modified, does not make an automatic copy so creating `$N$` bootstraps of a data set does not create an object that is `$N$`-fold larger in memory.

```r
pryr::object_size(bootstraps(mtcars, 100)) / pryr::object_size(mtcars)
```

```
  15.2 B
```

---
class: center, middle, inverse

# Your turn!
## Application to human microbiome

---
class: nologo
## Data

```r
df_abund <- read_csv("abund_table.txt")
df_abund
```

```
 # A tibble: 256 x 654
 Clade SRS011086 SRS011090 SRS011098 SRS011111 SRS011115 SRS011126
 <chr> <int> <int> <int> <int> <int> <int>
 1 k__B… 0 0 0 0 0 0
 2 k__B… 0 0 0 0 0 0
 3 k__B… 0 0 0 0 0 0
 4 k__B… 0 0 0 0 0 0
 5 k__B… 0 0 0 0 0 0
 6 k__B… 0 0 0 0 0 0
 7 k__B… 0 5366 0 0 1182 0
 8 k__B… 0 0 0 0 0 0
 9 k__B… 0 0 0 0 0 0
 10 k__B… 0 0 0 0 0 0
 # ... with 246 more rows, and 647 more variables: SRS011140 <int>,
 # SRS011144 <int>, SRS011152 <int>, SRS011243 <int>, SRS011247 <int>, …
```

```r
df_abund$Clade[1:2]
```

---
## Data

```r
df_samples <- read_csv("samples_table.txt")
df_samples
```

```
 # A tibble: 653 x 6
 Sample Body_site Body_subsite Age Age_category Gender
 <chr> <chr> <chr> <int> <chr> <chr> 
 1 SRS011086 oralcavity tongue_dorsum 30 adult female
 2 SRS011090 oralcavity buccal_mucosa 30 adult female
 3 SRS011098 oralcavity supragingival_plaque 30 adult female
 4 SRS011111 oralcavity posterior_fornix 30 adult female
 5 SRS011115 oralcavity tongue_dorsum 28 adult male 
 6 SRS011126 oralcavity supragingival_plaque 28 adult male 
 7 SRS011140 oralcavity tongue_dorsum 25 adult male 
 8 SRS011144 oralcavity buccal_mucosa 25 adult male 
 9 SRS011152 oralcavity supragingival_plaque 25 adult male 
 10 SRS011243 oralcavity tongue_dorsum 27 adult female
 # ... with 643 more rows
```

---
## Create a taxonomic table

Like this:

```
 # A tibble: 256 x 6
 Kingdom Phylum Class Order Family Genus 
 <chr> <chr> <chr> <chr> <chr> <chr> 
 1 Archaea Euryarcha… Methanobac… Methanobac… Methanobac… Methanobreviba…
 2 Archaea Euryarcha… Methanobac… Methanobac… Methanobac… Methanosphaera 
 3 Bacteria Acidobact… Acidobacte… Acidobacte… Acidobacte… Candidatus_Chl…
 4 Bacteria Acidobact… Acidobacte… Acidobacte… Acidobacte… Granulicella 
 5 Bacteria Acidobact… Acidobacte… Acidobacte… Acidobacte… Terriglobus 
 6 Bacteria Actinobac… Actinobact… Actinomyce… Actinomyce… Actinobaculum 
 7 Bacteria Actinobac… Actinobact… Actinomyce… Actinomyce… Actinomyces 
 8 Bacteria Actinobac… Actinobact… Actinomyce… Actinomyce… Mobiluncus 
 9 Bacteria Actinobac… Actinobact… Actinomyce… Actinomyce… Varibaculum 
 10 Bacteria Actinobac… Actinobact… Actinomyce… Brevibacte… Brevibacterium 
 # ... with 246 more rows
```

Hint: use `tidyr::separate()` and `stringr`

---
count: false
## Create a taxonomic table

```r
df_tax <-
 df_abund %>% 
 select(Clade) %>% 
 separate(Clade, sep = "\\|",
 into = c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus")) %>% 
 mutate_all(str_remove, ".__") %>% 
 arrange_all()
df_tax
```

---
## Put `df_abund` in a tidy way

Like this:

```
 # A tibble: 167,168 x 3
 Genus Sample Abundance
 <chr> <chr> <dbl>
 1 Bacteroides SRS011086 0
 2 Parabacteroides SRS011086 0
 3 Alistipes SRS011086 0
 4 Burkholderiales_noname SRS011086 0
 5 Faecalibacterium SRS011086 0
 6 Roseburia SRS011086 0
 7 Subdoligranulum SRS011086 0
 8 Bacteroidales_noname SRS011086 0
 9 Parasutterella SRS011086 0
 10 Oscillibacter SRS011086 0
 # ... with 167,158 more rows
```

Hint: use a regular expression and `tidyr::gather()`

---
count: false
## Put `df_abund` in a tidy way

```r
df_abund_long <-
 df_abund %>% 
 mutate(Genus = str_remove(Clade, ".*__")) %>% 
 select(-Clade) %>% 
 gather(key = "Sample", value = "Abundance", -Genus)
df_abund_long
```

---
## Remove genera present in less than 25 samples

```r
df_abund_filtered <-
 df_abund_long %>% 
 group_by(Genus) %>% 
 mutate(Preval = sum(Abundance > 0)) %>% 
 filter(Preval > 25) %>% 
 select(-Preval) %>% 
 ungroup()
df_abund_filtered
```

```
 # A tibble: 85,543 x 3
 Genus Sample Abundance
 <chr> <chr> <dbl>
 1 Bacteroides SRS011086 0
 2 Parabacteroides SRS011086 0
 3 Alistipes SRS011086 0
 4 Burkholderiales_noname SRS011086 0
 5 Faecalibacterium SRS011086 0
 6 Roseburia SRS011086 0
 7 Subdoligranulum SRS011086 0
 8 Bacteroidales_noname SRS011086 0
 9 Parasutterella SRS011086 0
 10 Oscillibacter SRS011086 0
 # ... with 85,533 more rows
```

---
## Find differentially abundant genera

* Between subsites `buccal_mucosa` and `supragingival_plaque`

* With a Wilcowon-text: `wilcox.test()`

* Corrected for false discovery rate: `p.adjust(method = "fdr")`

---
## Find differentially abundant genera

```r
df_abund_filtered %>% 
  left_join(df_samples, by = "Sample") %>%
  filter(Body_subsite %in% c("buccal_mucosa", "supragingival_plaque")) %>%
  select(Genus, Body_subsite, Abundance)
```

```
 # A tibble: 32,357 x 3
 Genus Body_subsite Abundance
 <chr> <chr> <dbl>
 1 Bacteroides buccal_mucosa 0
 2 Parabacteroides buccal_mucosa 0
 3 Alistipes buccal_mucosa 0
 4 Burkholderiales_noname buccal_mucosa 0
 5 Faecalibacterium buccal_mucosa 0
 6 Roseburia buccal_mucosa 0
 7 Subdoligranulum buccal_mucosa 5366
 8 Bacteroidales_noname buccal_mucosa 0
 9 Parasutterella buccal_mucosa 0
 10 Oscillibacter buccal_mucosa 0
 # ... with 32,347 more rows
```

---
count: false
## Find differentially abundant genera

```
 # A tibble: 131 x 2
 Genus data 
 <chr> <list> 
 1 Bacteroides <tibble [247 × 2]>
 2 Parabacteroides <tibble [247 × 2]>
 3 Alistipes <tibble [247 × 2]>
 4 Burkholderiales_noname <tibble [247 × 2]>
 5 Faecalibacterium <tibble [247 × 2]>
 6 Roseburia <tibble [247 × 2]>
 7 Subdoligranulum <tibble [247 × 2]>
 8 Bacteroidales_noname <tibble [247 × 2]>
 9 Parasutterella <tibble [247 × 2]>
 10 Oscillibacter <tibble [247 × 2]>
 # ... with 121 more rows
```

---
count: false
## Find differentially abundant genera

```
 # A tibble: 131 x 3
 Genus data wt 
 <chr> <list> <list> 
 1 Bacteroides <tibble [247 × 2]> <S3: htest>
 2 Parabacteroides <tibble [247 × 2]> <S3: htest>
 3 Alistipes <tibble [247 × 2]> <S3: htest>
 4 Burkholderiales_noname <tibble [247 × 2]> <S3: htest>
 5 Faecalibacterium <tibble [247 × 2]> <S3: htest>
 6 Roseburia <tibble [247 × 2]> <S3: htest>
 7 Subdoligranulum <tibble [247 × 2]> <S3: htest>
 8 Bacteroidales_noname <tibble [247 × 2]> <S3: htest>
 9 Parasutterella <tibble [247 × 2]> <S3: htest>
 10 Oscillibacter <tibble [247 × 2]> <S3: htest>
 # ... with 121 more rows
```

---
count: false
class: nologo
## Find differentially abundant genera

```
 # A tibble: 131 x 4
 Genus data wt tidied 
 <chr> <list> <list> <list> 
 1 Bacteroides <tibble [247 × 2]> <S3: htest> <tibble [1 × 4]>
 2 Parabacteroides <tibble [247 × 2]> <S3: htest> <tibble [1 × 4]>
 3 Alistipes <tibble [247 × 2]> <S3: htest> <tibble [1 × 4]>
 4 Burkholderiales_noname <tibble [247 × 2]> <S3: htest> <tibble [1 × 4]>
 5 Faecalibacterium <tibble [247 × 2]> <S3: htest> <tibble [1 × 4]>
 6 Roseburia <tibble [247 × 2]> <S3: htest> <tibble [1 × 4]>
 7 Subdoligranulum <tibble [247 × 2]> <S3: htest> <tibble [1 × 4]>
 8 Bacteroidales_noname <tibble [247 × 2]> <S3: htest> <tibble [1 × 4]>
 9 Parasutterella <tibble [247 × 2]> <S3: htest> <tibble [1 × 4]>
 10 Oscillibacter <tibble [247 × 2]> <S3: htest> <tibble [1 × 4]>
 # ... with 121 more rows
```

---
count: false
class: nologo
## Find differentially abundant genera

```r
df_abund_filtered %>% 
  left_join(df_samples, by = "Sample") %>%
  filter(Body_subsite %in% c("buccal_mucosa", "supragingival_plaque")) %>%
  select(Genus, Body_subsite, Abundance) %>% 
  group_by(Genus) %>% 
  nest() %>% 
  mutate(wt = map(data, ~ wilcox.test(formula = Abundance ~ Body_subsite, 
                                      data = .)),
         tidied = map(wt, tidy)) %>% 
  unnest(tidied, .drop = TRUE)
```

```
 # A tibble: 131 x 5
 Genus statistic p.value method alternative
 <chr> <dbl> <dbl> <chr> <chr> 
 1 Bacteroides 8248 0.142 Wilcoxon rank sum test wi… two.sided 
 2 Parabacteroid… 7722. 0.651 Wilcoxon rank sum test wi… two.sided 
 3 Alistipes 8073 0.127 Wilcoxon rank sum test wi… two.sided 
 4 Burkholderial… 7621 0.959 Wilcoxon rank sum test wi… two.sided 
 5 Faecalibacter… 7645 0.882 Wilcoxon rank sum test wi… two.sided 
 6 Roseburia 7754 0.358 Wilcoxon rank sum test wi… two.sided 
 7 Subdoligranul… 7824 0.579 Wilcoxon rank sum test wi… two.sided 
 8 Bacteroidales… 7808 0.0720 Wilcoxon rank sum test wi… two.sided 
 9 Parasutterella 7621 0.959 Wilcoxon rank sum test wi… two.sided 
 10 Oscillibacter 7684. 0.529 Wilcoxon rank sum test wi… two.sided 
 # ... with 121 more rows
```

---
count: false
class: nologo
## Find differentially abundant genera

```
 # A tibble: 55 x 3
 Genus p.value q.value
 <chr> <dbl> <dbl>
 1 Corynebacterium 4.99e-37 2.87e-35
 2 Cardiobacterium 2.74e-37 2.87e-35
 3 Actinomyces 1.27e-36 4.85e-35
 4 Capnocytophaga 3.38e-36 9.72e-35
 5 Kingella 8.21e-36 1.89e-34
 6 Eikenella 2.78e-35 5.33e-34
 7 Actinobaculum 1.75e-33 2.87e-32
 8 Propionibacterium 2.14e-33 3.08e-32
 9 Peptostreptococcaceae_noname 2.88e-31 3.68e-30
 10 Campylobacter 8.68e-31 9.98e-30
 # ... with 45 more rows
```

---
## Make a PCA with samples as observations

```r
df_abund_transp <-
 df_abund_filtered %>% 
 spread(key = Genus, value = Abundance) 
```

* Don't forget to remove the `Sample` column

* Use `prcomp()` for PCA (there is no `broom` methods for `FactoMineR` functions yet)

* Set `scale. = TRUE`

* Keep only the four first principal components

---
## Make a PCA with samples as observations

```r
df_abund_transp %>% 
  select(-Sample) %>% 
  prcomp(scale. = TRUE) %>% 
  augment(df_abund_transp) %>% 
  select(Sample, .fittedPC1:.fittedPC4)
```

```
 # A tibble: 653 x 5
 Sample .fittedPC1 .fittedPC2 .fittedPC3 .fittedPC4
 <chr> <dbl> <dbl> <dbl> <dbl>
 1 SRS011061 4.34 0.430 0.751 -2.60 
 2 SRS011084 10.3 1.45 4.48 -1.19 
 3 SRS011086 -1.51 -2.67 -0.586 0.726
 4 SRS011090 0.499 -0.493 -2.51 -0.314
 5 SRS011098 -1.51 1.42 -1.68 2.12 
 6 SRS011111 0.196 -2.31 -2.72 -3.91 
 7 SRS011115 -9.51 -4.81 9.64 5.33 
 8 SRS011126 -6.38 8.50 2.13 4.04 
 9 SRS011134 6.27 0.489 2.17 -0.815
 10 SRS011140 -2.23 -2.54 0.111 1.86 
 # ... with 643 more rows
```

---
count: false
## Make a PCA with samples as observations

```r
df_abund_transp %>% 
  select(-Sample) %>% 
  prcomp(scale. = TRUE) %>% 
  augment(df_abund_transp) %>% 
  select(Sample, .fittedPC1:.fittedPC4) %>% 
  left_join(df_samples, by = "Sample") -> df_PCA
df_PCA
```

```
 # A tibble: 653 x 10
 Sample .fittedPC1 .fittedPC2 .fittedPC3 .fittedPC4 Body_site
 <chr> <dbl> <dbl> <dbl> <dbl> <chr> 
 1 SRS01… 4.34 0.430 0.751 -2.60 stool 
 2 SRS01… 10.3 1.45 4.48 -1.19 stool 
 3 SRS01… -1.51 -2.67 -0.586 0.726 oralcavi…
 4 SRS01… 0.499 -0.493 -2.51 -0.314 oralcavi…
 5 SRS01… -1.51 1.42 -1.68 2.12 oralcavi…
 6 SRS01… 0.196 -2.31 -2.72 -3.91 oralcavi…
 7 SRS01… -9.51 -4.81 9.64 5.33 oralcavi…
 8 SRS01… -6.38 8.50 2.13 4.04 oralcavi…
 9 SRS01… 6.27 0.489 2.17 -0.815 stool 
 10 SRS01… -2.23 -2.54 0.111 1.86 oralcavi…
 # ... with 643 more rows, and 4 more variables: Body_subsite <chr>,
 # Age <int>, Age_category <chr>, Gender <chr>
```

---
## Make a PCA with samples as observations

```r
ggplot(df_PCA) +
  aes(x = .fittedPC1, y = .fittedPC2, color = Body_site) +
  geom_point(alpha = 0.6)
```

---
count: false
## Make a PCA with samples as observations

```r
ggplot(df_PCA) +
  aes(x = .fittedPC1, y = .fittedPC2, color = Body_subsite) +
  geom_point(alpha = 0.6)
```

---
## Compute k-means for several K

```r
tibble(K = 1:10)
```

```
 # A tibble: 10 x 1
 K
 <int>
 1 1
 2 2
 3 3
 4 4
 5 5
 6 6
 7 7
 8 8
 9 9
 10 10
```

---
count: false
## Compute k-means for several K

```r
tibble(K = 1:10) %>% 
  mutate(kmeans = map(K, kmeans, nstart = 10,
                      x = select(df_abund_transp, -Sample)))
```

```
 # A tibble: 10 x 2
 K kmeans 
 <int> <list> 
 1 1 <S3: kmeans>
 2 2 <S3: kmeans>
 3 3 <S3: kmeans>
 4 4 <S3: kmeans>
 5 5 <S3: kmeans>
 6 6 <S3: kmeans>
 7 7 <S3: kmeans>
 8 8 <S3: kmeans>
 9 9 <S3: kmeans>
 10 10 <S3: kmeans>
```

---
count: false
## Compute k-means for several K

```r
tibble(K = 1:10) %>% 
  mutate(kmeans = map(K, kmeans, nstart = 10,
                      x = select(df_abund_transp, -Sample)), 
         tidied = map(kmeans, tidy),
         glanced = map(kmeans, glance),
         augmented = map(kmeans, augment, df_PCA))
```

```
 # A tibble: 10 x 5
 K kmeans tidied glanced augmented 
 <int> <list> <list> <list> <list> 
 1 1 <S3: kmeans> <tibble [1 × 134]> <tibble [1 × 4… <tibble [653 × 1…
 2 2 <S3: kmeans> <tibble [2 × 134]> <tibble [1 × 4… <tibble [653 × 1…
 3 3 <S3: kmeans> <tibble [3 × 134]> <tibble [1 × 4… <tibble [653 × 1…
 4 4 <S3: kmeans> <tibble [4 × 134]> <tibble [1 × 4… <tibble [653 × 1…
 5 5 <S3: kmeans> <tibble [5 × 134]> <tibble [1 × 4… <tibble [653 × 1…
 6 6 <S3: kmeans> <tibble [6 × 134]> <tibble [1 × 4… <tibble [653 × 1…
 7 7 <S3: kmeans> <tibble [7 × 134]> <tibble [1 × 4… <tibble [653 × 1…
 8 8 <S3: kmeans> <tibble [8 × 134]> <tibble [1 × 4… <tibble [653 × 1…
 9 9 <S3: kmeans> <tibble [9 × 134]> <tibble [1 × 4… <tibble [653 × 1…
 10 10 <S3: kmeans> <tibble [10 × 134… <tibble [1 × 4… <tibble [653 × 1…
```

---
count: false
## Compute k-means for several K

```
 # A tibble: 10 x 8
 K kmeans tidied augmented totss tot.withinss betweenss iter
 <int> <list> <list> <list> <dbl> <dbl> <dbl> <int>
 1 1 <S3: km… <tibble… <tibble [… 7.55e17 7.55e17 -2.42e 4 1
 2 2 <S3: km… <tibble… <tibble [… 7.55e17 3.79e17 3.76e17 1
 3 3 <S3: km… <tibble… <tibble [… 7.55e17 3.15e17 4.40e17 2
 4 4 <S3: km… <tibble… <tibble [… 7.55e17 2.73e17 4.82e17 2
 5 5 <S3: km… <tibble… <tibble [… 7.55e17 2.54e17 5.01e17 2
 6 6 <S3: km… <tibble… <tibble [… 7.55e17 2.35e17 5.20e17 3
 7 7 <S3: km… <tibble… <tibble [… 7.55e17 2.13e17 5.42e17 4
 8 8 <S3: km… <tibble… <tibble [… 7.55e17 2.02e17 5.53e17 4
 9 9 <S3: km… <tibble… <tibble [… 7.55e17 1.84e17 5.71e17 4
 10 10 <S3: km… <tibble… <tibble [… 7.55e17 1.78e17 5.77e17 5
```

---
## Compute k-means for several K

```r
ggplot(df_kmeans) +
  aes(x = K, y = betweenss) +
  geom_line(color = "grey30") +
  geom_point(color = "tomato")
```

---
## Plots of k-means on the principal plane

```r
df_kmeans %>% 
  filter(K %in% 2:4)
```

```
 # A tibble: 3 x 8
 K kmeans tidied augmented totss tot.withinss betweenss iter
 <int> <list> <list> <list> <dbl> <dbl> <dbl> <int>
 1 2 <S3: km… <tibble… <tibble [6… 7.55e17 3.79e17 3.76e17 1
 2 3 <S3: km… <tibble… <tibble [6… 7.55e17 3.15e17 4.40e17 2
 3 4 <S3: km… <tibble… <tibble [6… 7.55e17 2.73e17 4.82e17 2
```

---
count: false
## Plots of k-means on the principal plane

```r
df_kmeans %>% 
  filter(K %in% 2:4) %>% 
  select(K, augmented) %>% 
  unnest(augmented) -> df_clust
df_clust
```

```
 # A tibble: 1,959 x 12
 K Sample .fittedPC1 .fittedPC2 .fittedPC3 .fittedPC4 Body_site
 <int> <chr> <dbl> <dbl> <dbl> <dbl> <chr> 
 1 2 SRS01… 4.34 0.430 0.751 -2.60 stool 
 2 2 SRS01… 10.3 1.45 4.48 -1.19 stool 
 3 2 SRS01… -1.51 -2.67 -0.586 0.726 oralcavi…
 4 2 SRS01… 0.499 -0.493 -2.51 -0.314 oralcavi…
 5 2 SRS01… -1.51 1.42 -1.68 2.12 oralcavi…
 6 2 SRS01… 0.196 -2.31 -2.72 -3.91 oralcavi…
 7 2 SRS01… -9.51 -4.81 9.64 5.33 oralcavi…
 8 2 SRS01… -6.38 8.50 2.13 4.04 oralcavi…
 9 2 SRS01… 6.27 0.489 2.17 -0.815 stool 
 10 2 SRS01… -2.23 -2.54 0.111 1.86 oralcavi…
 # ... with 1,949 more rows, and 5 more variables: Body_subsite <chr>,
 # Age <int>, Age_category <chr>, Gender <chr>, .cluster <chr>
```

---
class: nologo
## Plots of k-means on the principal plane

```r
ggplot(df_clust) +
  aes(x = .fittedPC1, y = .fittedPC2, color = .cluster, shape = Body_site) +
  geom_point(alpha = 0.6) +
  scale_shape_manual(values = 0:2) +
  facet_grid(. ~ K) +
  theme(legend.position = "bottom")
```

---
## Contingency tables of the clusterings

* Use the `janitor::tabyl()` function to create a contingency table by K

```r
df_clust %>% 
  nest(-K)
```

```
 # A tibble: 3 x 2
 K data 
 <int> <list> 
 1 2 <tibble [653 × 11]>
 2 3 <tibble [653 × 11]>
 3 4 <tibble [653 × 11]>
```

---
count: false
## Contingency tables of the clusterings

* Use the `janitor::tabyl()` function to create a contingency table by K

```r
df_clust %>% 
  nest(-K) %>% 
  mutate(CT = map(data, janitor::tabyl, Body_site, .cluster))
```

```
 # A tibble: 3 x 3
 K data CT 
 <int> <list> <list> 
 1 2 <tibble [653 × 11]> <tabyl [3 × 3]>
 2 3 <tibble [653 × 11]> <tabyl [3 × 4]>
 3 4 <tibble [653 × 11]> <tabyl [3 × 5]>
```

---
count: false
## Contingency tables of the clusterings

* Use the `janitor::tabyl()` function to create a contingency table by K

```r
df_clust %>% 
  nest(-K) %>% 
  mutate(CT = map(data, janitor::tabyl, .cluster, Body_site)) %>% 
  pull(CT)
```

```
  [[1]]
   .cluster oralcavity skin stool
          1          0    0   114
          2        479   27    33
  
  [[2]]
   .cluster oralcavity skin stool
          1        347   27    28
          2          0    0   113
          3        132    0     6
  
  [[3]]
   .cluster oralcavity skin stool
          1          0    0    63
          2          0    0    79
          3        344   27     5
          4        135    0     0
```

---
## Agregate the abundance at phylum level

Like this:

```
 # A tibble: 17 x 654
 Phylum SRS011086 SRS011090 SRS011098 SRS011111 SRS011115 SRS011126
 <chr> <int> <int> <int> <int> <int> <int>
 1 Acido… 0 0 0 0 0 0
 2 Actin… 19795422 261680 40149371 70894509 17912678 36577594
 3 Ascom… 0 0 0 0 0 0
 4 Bacte… 5173832 161067 4307084 1532049 31986039 19951810
 5 Basid… 0 0 0 0 0 0
 6 Candi… 0 0 4925 0 59724 3161196
 7 Chlor… 0 0 0 0 0 0
 8 Deino… 0 0 0 0 0 0
 9 Eurya… 0 0 0 0 0 0
 10 Firmi… 52604350 2118788 4312759 10592022 47823300 8196029
 11 Fusob… 319848 122811 1206832 1387 19118873 7588485
 12 Prote… 3646454 1066799 5391435 0 49033824 31115434
 13 Spiro… 0 0 214 0 114106 1021118
 14 Syner… 0 0 0 0 0 0
 15 Tener… 0 0 0 36345 345065 0
 16 Verru… 0 0 0 0 0 0
 17 Virus… 124898 126218 1041431 0 15143 430516
 # ... with 647 more variables: SRS011140 <int>, SRS011144 <int>,
 # SRS011152 <int>, SRS011243 <int>, SRS011247 <int>, …
```

---
count: false
## Agregate the abundance at phylum level

```r
df_abund %>% 
  mutate(Clade = str_remove(Clade,  ".*__")) %>% 
  rename(Genus = Clade)
```

```
 # A tibble: 256 x 654
 Genus SRS011086 SRS011090 SRS011098 SRS011111 SRS011115 SRS011126
 <chr> <int> <int> <int> <int> <int> <int>
 1 Bact… 0 0 0 0 0 0
 2 Para… 0 0 0 0 0 0
 3 Alis… 0 0 0 0 0 0
 4 Burk… 0 0 0 0 0 0
 5 Faec… 0 0 0 0 0 0
 6 Rose… 0 0 0 0 0 0
 7 Subd… 0 5366 0 0 1182 0
 8 Bact… 0 0 0 0 0 0
 9 Para… 0 0 0 0 0 0
 10 Osci… 0 0 0 0 0 0
 # ... with 246 more rows, and 647 more variables: SRS011140 <int>,
 # SRS011144 <int>, SRS011152 <int>, SRS011243 <int>, SRS011247 <int>, …
```

---
count: false
## Agregate the abundance at phylum level

```r
df_abund %>% 
  mutate(Clade = str_remove(Clade,  ".*__")) %>% 
  rename(Genus = Clade)%>% 
  left_join(df_tax, ., by = "Genus")
```

```
 # A tibble: 256 x 659
 Kingdom Phylum Class Order Family Genus SRS011086 SRS011090 SRS011098
 <chr> <chr> <chr> <chr> <chr> <chr> <int> <int> <int>
 1 Archaea Eurya… Meth… Meth… Metha… Meth… 0 0 0
 2 Archaea Eurya… Meth… Meth… Metha… Meth… 0 0 0
 3 Bacter… Acido… Acid… Acid… Acido… Cand… 0 0 0
 4 Bacter… Acido… Acid… Acid… Acido… Gran… 0 0 0
 5 Bacter… Acido… Acid… Acid… Acido… Terr… 0 0 0
 6 Bacter… Actin… Acti… Acti… Actin… Acti… 4949 0 810698
 7 Bacter… Actin… Acti… Acti… Actin… Acti… 9596511 105544 6691711
 8 Bacter… Actin… Acti… Acti… Actin… Mobi… 0 0 0
 9 Bacter… Actin… Acti… Acti… Actin… Vari… 0 0 0
 10 Bacter… Actin… Acti… Acti… Brevi… Brev… 15933 0 1630
 # ... with 246 more rows, and 650 more variables: SRS011111 <int>,
 # SRS011115 <int>, SRS011126 <int>, SRS011140 <int>, SRS011144 <int>, …
```

---
count: false
## Agregate the abundance at phylum level

```r
df_abund %>% 
  mutate(Clade = str_remove(Clade,  ".*__")) %>% 
  rename(Genus = Clade)%>% 
  left_join(df_tax, ., by = "Genus") %>% 
  select(-c(Kingdom, Class:Genus))
```

```
 # A tibble: 256 x 654
 Phylum SRS011086 SRS011090 SRS011098 SRS011111 SRS011115 SRS011126
 <chr> <int> <int> <int> <int> <int> <int>
 1 Eurya… 0 0 0 0 0 0
 2 Eurya… 0 0 0 0 0 0
 3 Acido… 0 0 0 0 0 0
 4 Acido… 0 0 0 0 0 0
 5 Acido… 0 0 0 0 0 0
 6 Actin… 4949 0 810698 0 33598 725860
 7 Actin… 9596511 105544 6691711 0 11614891 7299386
 8 Actin… 0 0 0 142209 0 0
 9 Actin… 0 0 0 4610 0 0
 10 Actin… 15933 0 1630 0 10634 1826
 # ... with 246 more rows, and 647 more variables: SRS011140 <int>,
 # SRS011144 <int>, SRS011152 <int>, SRS011243 <int>, SRS011247 <int>, …
```

---
count: false
## Agregate the abundance at phylum level

```r
df_abund %>% 
  mutate(Clade = str_remove(Clade,  ".*__")) %>% 
  rename(Genus = Clade)%>% 
  left_join(df_tax, ., by = "Genus") %>% 
  select(-c(Kingdom, Class:Genus)) %>% 
  group_by(Phylum) %>% 
  summarise_all(sum) -> df_phylum
df_phylum
```

---
## Transpose the dataset and choose stool samples

```r
(df_phylum_transp_stool <-
 df_phylum %>% 
 gather(key = "Sample", value = "Abundance", -Phylum) %>% 
 spread(key = "Phylum", value = "Abundance") %>% 
 left_join(df_samples, by = "Sample") %>% 
 filter(Body_site == "stool"))
```

```
 # A tibble: 147 x 23
 Sample Acidobacteria Actinobacteria Ascomycota Bacteroidetes
 <chr> <dbl> <dbl> <dbl> <dbl>
 1 SRS01… 0 37836 0 82049580
 2 SRS01… 0 925978 0 199905409
 3 SRS01… 0 80513 0 74422158
 4 SRS01… 0 1101941 1825 112432267
 5 SRS01… 0 277822 0 106439366
 6 SRS01… 0 547275 0 58763094
 7 SRS01… 0 20241 0 119708033
 8 SRS01… 0 47477 0 33479123
 9 SRS01… 0 416032 0 111204570
 10 SRS01… 0 138253 0 76713977
 # ... with 137 more rows, and 18 more variables: Basidiomycota <dbl>,
 # Candidatus_Saccharibacteria <dbl>, Chloroflexi <dbl>,
 # Deinococcus_Thermus <dbl>, Euryarchaeota <dbl>, …
```

---
## Correlation between Firmicutes and Bacteroidetes

```r
ggplot(df_phylum_transp_stool) +
  aes(x = Firmicutes, y = Bacteroidetes) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm")
```

---
## Correlation between Firmicutes and Bacteroidetes

* Create 500 bootstrapped datasets to find the distribution of the correlation

* Use both Pearson and Spearman methods

* Plot it

---
## Correlation between Firmicutes and Bacteroidetes

```r
df_phylum_transp_stool %>% 
  select(Firmicutes, Bacteroidetes) %>% 
  bootstraps(times = 500) 
```

---
count: false
## Correlation between Firmicutes and Bacteroidetes

```
 # Bootstrap sampling 
 # A tibble: 500 x 4
 splits id Firmicutes Bacteroidetes
 * <list> <chr> <list> <list> 
 1 <tibble [147 × 2]> Bootstrap001 <dbl [147]> <dbl [147]> 
 2 <tibble [147 × 2]> Bootstrap002 <dbl [147]> <dbl [147]> 
 3 <tibble [147 × 2]> Bootstrap003 <dbl [147]> <dbl [147]> 
 4 <tibble [147 × 2]> Bootstrap004 <dbl [147]> <dbl [147]> 
 5 <tibble [147 × 2]> Bootstrap005 <dbl [147]> <dbl [147]> 
 6 <tibble [147 × 2]> Bootstrap006 <dbl [147]> <dbl [147]> 
 7 <tibble [147 × 2]> Bootstrap007 <dbl [147]> <dbl [147]> 
 8 <tibble [147 × 2]> Bootstrap008 <dbl [147]> <dbl [147]> 
 9 <tibble [147 × 2]> Bootstrap009 <dbl [147]> <dbl [147]> 
 10 <tibble [147 × 2]> Bootstrap010 <dbl [147]> <dbl [147]> 
 # ... with 490 more rows
```

---
count: false
class: nologo
## Correlation between Firmicutes and Bacteroidetes

```r
df_phylum_transp_stool %>% 
  select(Firmicutes, Bacteroidetes) %>% 
  bootstraps(times = 500) %>%
  mutate(splits = map(splits, as_tibble), 
         Firmicutes = map(splits, pull, Firmicutes),
         Bacteroidetes = map(splits, pull, Bacteroidetes),
         Pearson = map2_dbl(Firmicutes, Bacteroidetes, cor),
         Spearman = map2_dbl(Firmicutes, Bacteroidetes, cor, 
                             method = "spearman"))
```

```
 # Bootstrap sampling 
 # A tibble: 500 x 6
 splits id Firmicutes Bacteroidetes Pearson Spearman
 * <list> <chr> <list> <list> <dbl> <dbl>
 1 <tibble [147 × 2… Bootstrap0… <dbl [147… <dbl [147]> -0.489 -0.515
 2 <tibble [147 × 2… Bootstrap0… <dbl [147… <dbl [147]> -0.510 -0.422
 3 <tibble [147 × 2… Bootstrap0… <dbl [147… <dbl [147]> -0.535 -0.530
 4 <tibble [147 × 2… Bootstrap0… <dbl [147… <dbl [147]> -0.714 -0.634
 5 <tibble [147 × 2… Bootstrap0… <dbl [147… <dbl [147]> -0.542 -0.460
 6 <tibble [147 × 2… Bootstrap0… <dbl [147… <dbl [147]> -0.444 -0.456
 7 <tibble [147 × 2… Bootstrap0… <dbl [147… <dbl [147]> -0.633 -0.681
 8 <tibble [147 × 2… Bootstrap0… <dbl [147… <dbl [147]> -0.392 -0.379
 9 <tibble [147 × 2… Bootstrap0… <dbl [147… <dbl [147]> -0.557 -0.382
 10 <tibble [147 × 2… Bootstrap0… <dbl [147… <dbl [147]> -0.543 -0.525
 # ... with 490 more rows
```

---
count: false
class: nologo
## Correlation between Firmicutes and Bacteroidetes

```
 # A tibble: 1,000 x 2
 Method Correlation
 <chr> <dbl>
 1 Pearson -0.489
 2 Pearson -0.510
 3 Pearson -0.535
 4 Pearson -0.714
 5 Pearson -0.542
 6 Pearson -0.444
 7 Pearson -0.633
 8 Pearson -0.392
 9 Pearson -0.557
 10 Pearson -0.543
 # ... with 990 more rows
```

---
## Correlation between Firmicutes and Bacteroidetes

```r
ggplot(df_cor) +
  aes(x = Correlation, fill = Method) +
  geom_density(alpha = 0.5) +
  geom_vline(xintercept = 0, color = "grey")
```

---
class: center, middle, inverse

# References

---

### Purrr

* <a href="https://stackoverflow.com/questions/45101045/why-use-purrrmap-instead-of-lapply/47123420" target="_blank"> Why use purrr::map instead of lapply? (H. Wickham)</a>

### Tibble

* <a href="https://www.rstudio.com/resources/videos/using-list-cols-in-your-dataframe/" target="_blank"> Putting square pegs in round holes: Using list-cols in your dataframe (J. Bryan)</a>

### Tidymodels

* <a href="https://static1.squarespace.com/static/51156277e4b0b8b2ffe11c00/t/5b75871e21c67c985a06f481/1534428959053/RPharma_18_Kuhn.pdf" target="_blank"> Modeling in the Tidyverse (M. Kuhn)</a>

### Metagenomic data

* <a href="http://waldronlab.io/curatedMetagenomicData/" target="_blank"> Curated Metagenomic Data (L. Waldron)</a>

---
class: center, middle, inverse
count:false

# Thanks for you attention!

#### <a href="https://github.com/abichat" target="_blank">@abichat</a>
<div style = "margin-top: -10px"></div>
#### <a href="https://twitter.com/_abichat" target="_blank">@_abichat</a>
<div style = "margin-top: -10px"></div>
#### <a href="https://www.linkedin.com/in/antoinebichat" target="_blank">antoinebichat</a>
<div style = "margin-top: -10px"></div>
#### <a href="https://abichat.github.io" target="_blank">abichat.github.io</a>
<div style = "margin-top: -10px"></div>
#### <a href="mailto:antoine.bichat@mines-nancy.org?subject=SOTR">antoine.bichat@mines-nancy.org</a>