class: title-slide, center, middle count: false <br> # Modeling in the Tidyverse <div style = "margin-top: -40px"></div> ## and applications to metagenomics data <br><br><br><br><br><br><br><br><br><br><br><br> ### Antoine Bichat <div style = "margin-top: -30px"></div> ### November 23, 2018 - AgroParisTech --- class: center, middle, inverse <br> # Tidyverse <img src="img/hex-tidyverse.png", width=270> --- ## Metapackage ```r library(tidyverse) ``` ``` ── Attaching packages ─────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ── ``` ``` ✔ ggplot2 3.1.0 ✔ purrr 0.2.5 ✔ tibble 1.4.2 ✔ dplyr 0.7.8 ✔ tidyr 0.8.2 ✔ stringr 1.3.1 ✔ readr 1.1.1 ✔ forcats 0.3.0 ``` ``` ── Conflicts ────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() ``` .footnote[<a href="https://abichat.github.io/slides/introtidyverse" target="_blank">
Introduction to the Tidyverse</a>] -- <br> ```r theme_set(theme_minimal()) options("tibble.max_extra_cols" = 5) ``` --- ## Data analysis pipeline <br> <center> <img src="img/data-science-pipeline.png", height=250> </center> .footnote[<a href="https://r4ds.had.co.nz" target="_blank">
R for data science (G. Grolemund & H. Wickham)</a>] --- class: center, middle, inverse <br> # Dplyr <img src="img/hex-dplyr.png", width=270> --- ## List of dplyr's useful functions .pull-left[ * `mutate()` * `rename()` * `select()` * `arrange()`, `arrange(desc())` <br> * `pull()` <br> * `filter()` * `distinct()` * `top_n()` * `sample_n()`, `sample_frac()` ] -- .pull-right[ * `*_if()`, `*_at()`, `*_all()` <br> * `*_join()` <br> * `group_by()` * `summarise()` <br> * `first()`, `last()`, `nth()` * `n()`, `n_distinct()` <br> * `row_numbers()` * `lead()`, `lag()`, `cummean()` ... ] --- class: center, middle, inverse <br> # Tidyr <img src="img/hex-tidyr.png", width=270> --- ## List of tidyr's useful functions .pull-left[ * `spread()` & `gather()` * `separate()` & `unite()` * `nest()` & `unnest()` * `fill()` ] -- .pull-right[ <img src="img/tidyr-spread-gather.gif", width=300> ] ```r pryr::object_size(iris) ``` ``` 7.2 kB ``` ```r pryr::object_size(gather(iris, key = "Part", value = "Size", -Species)) ``` ``` 13.8 kB ``` .footnote[<a href="https://github.com/gadenbuie/tidyexplain" target="_blank">
Animations of tidyverse verbs (G. Aden-Buie)</a>] --- class: center, middle, inverse <br> # Purrr <img src="img/hex-purrr.png", width=270> --- ## Functionnal programming * Never use `for` loops -- <br> * Forget all `*apply()` functions and friends for lists and dataframes * `apply`, `lapply()`, `sapply()`, `tapply()`, `mapply()`, ... * `Map()`, `Reduce()` * `Vectorize()` -- <br> * Embrace `map()` function and variants * `map()`, `map_dbl()`, `map_chr()`, `map_lgl()`, ... * `map2()`, `map2_dbl()`, `map2_chr()`, `map2_lgl()`, ... * `pmap()`, `pmap_dbl()`, `pmap_chr()`, `pmap_lgl()`, ... * `walk()`, `imap()` and variants --- ## `map*()` <center> <img src="img/map.png", width=400> </center> -- * `map()` applies the function to each element and returns a list -- * `map_lgl()`, `map_int()`, `map_dbl()` and `map_chr()` applies the function to each element and returns a atomic vector of the corresponding type --- ## `map*()` ```r map(1:5, rnorm) ``` ``` [[1]] [1] 1.370958 [[2]] [1] -0.5646982 0.3631284 [[3]] [1] 0.6328626 0.4042683 -0.1061245 [[4]] [1] 1.51152200 -0.09465904 2.01842371 -0.06271410 [[5]] [1] 1.3048697 2.2866454 -1.3888607 -0.2787888 -0.1333213 ``` --- count: false ## `map*()` ```r map(1:5, rnorm) %>% map_dbl(mean) ``` ``` [1] 1.3709584 -0.1007849 0.3103355 0.8431431 0.3581088 ``` --- count: false ## `map*()` ```r map(1:100, rnorm) %>% map_dbl(mean) %>% tibble(Mean = .) %>% # Faster than 1:n() and rownames_to_column("N") mutate(N = row_number()) # Slightly different from rownames_to_column("N") ``` ``` # A tibble: 100 x 2 Mean N <dbl> <int> 1 1.37 1 2 -0.101 2 3 0.310 3 4 0.843 4 5 0.358 5 6 -0.622 6 7 -0.185 7 8 0.0243 8 9 -0.612 9 10 0.314 10 # ... with 90 more rows ``` --- count: false ## `map*()` ```r map(1:100, rnorm) %>% map_dbl(mean) %>% tibble(Mean = .) %>% mutate(N = row_number()) %>% ggplot(aes(x = N, y = Mean)) + geom_hline(yintercept = 0, color = "red", alpha = 0.5) + geom_line() ``` <img src="index_files/figure-html/unnamed-chunk-9-1.png" width="576" style="display: block; margin: auto;" /> --- ## A better way with `compose()` and `map_df()` ```r mean_rnorm <- compose(tibble, mean, rnorm) mean_rnorm(5) ``` ``` # A tibble: 1 x 1 out <dbl> 1 -0.345 ``` ```r map_dfr(1:10, mean_rnorm) %>% rownames_to_column("N") ``` ``` # A tibble: 10 x 2 N out <chr> <dbl> 1 1 1.37 2 2 -0.101 3 3 0.310 4 4 0.843 5 5 0.358 6 6 -0.622 7 7 -0.185 8 8 0.0243 9 9 -0.612 10 10 0.314 ``` --- ## `map*()` <center> <img src="img/map-arg.png", width=400> </center> ```r map(1:4, rnorm, n = 6) ``` ``` [[1]] [1] 2.3709584 0.4353018 1.3631284 1.6328626 1.4042683 0.8938755 [[2]] [1] 3.511522 1.905341 4.018424 1.937286 3.304870 4.286645 [[3]] [1] 1.6111393 2.7212112 2.8666787 3.6359504 2.7157471 0.3435446 [[4]] [1] 1.559533 5.320113 3.693361 2.218692 3.828083 5.214675 ``` --- ## `map*()` <center> <img src="img/map-arg.png", width=400> </center> ```r map(1:4, rnorm, n = 6) %>% map_dbl(function(x) x[2]) ``` ``` [1] 0.4353018 1.9053410 2.7212112 5.3201133 ``` --- count: false ## `map*()` <center> <img src="img/map-arg.png", width=400> </center> ```r map(1:4, rnorm, n = 6) %>% map_dbl(~ .[2]) ``` ``` [1] 0.4353018 1.9053410 2.7212112 5.3201133 ``` --- count: false ## `map*()` <center> <img src="img/map-arg.png", width=400> </center> ```r map(1:4, rnorm, n = 6) %>% map_dbl(2) ``` ``` [1] 0.4353018 1.9053410 2.7212112 5.3201133 ``` --- ## Why `map()` is better .footnote[<a href="https://colinfay.me/happy-dev-purrr/" target="_blank">
Happy dev with {purrr} (C.Fay)</a>] * Stable and consistent grammar ```r apply(X, MARGIN, FUN, ...) lapply(X, FUN, ...) sapply(X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE) tapply(X, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE) mapply(FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE) eapply(env, FUN, ..., all.names = FALSE, USE.NAMES = TRUE) ... ``` -- VS ```r map(.x, .f, ...) map_if(.x, .p, .f, ...) map_chr(.x, .f, ...) map_int(.x, .f, ...) map_dbl(.x, .f, ...) ... ``` --- count: false ## Why `map()` is better .footnote[<a href="https://colinfay.me/happy-dev-purrr/" target="_blank">
Happy dev with {purrr} (C.Fay)</a>] * Stable and consistent grammar * Type stability ```r sapply(iris$Sepal.Length, as.numeric) %>% class() ``` ``` [1] "numeric" ``` ```r sapply(iris$Sepal.Length, as.data.frame) %>% class() ``` ``` [1] "list" ``` -- VS ```r map_dbl(iris$Sepal.Length, as.numeric) %>% class() ``` ``` [1] "numeric" ``` ```r map_df(iris$Sepal.Length, as.data.frame) %>% class() ``` ``` [1] "data.frame" ``` --- count: false ## Why `map()` is better .footnote[<a href="https://colinfay.me/happy-dev-purrr/" target="_blank">
Happy dev with {purrr} (C.Fay)</a>] * Stable and consistent grammar * Type stability * Anonymous functions & verbosity ```r lapply(list, function(x) x + 2) mapply(function(x, y) x + y, list1, list2) lapply(list, function(x) x[[2]]) lapply(list, function(x) x$foo) ``` -- VS ```r map(list, ~ . + 2) map2(list1, list2, ~ .x + .y) map(list, 2) map(list, "foo") ``` --- count: false ## Why `map()` is better .footnote[<a href="https://colinfay.me/happy-dev-purrr/" target="_blank">
Happy dev with {purrr} (C.Fay)</a>] * Stable and consistent grammar * Type stability * Anonymous functions & verbosity * But it's slightly slower... (~ 40 ns per element) --- ## `map2*()` <center> <img src="img/map2.png", width=400> </center> `map2()` and `map2_*()` are variants of `map()` and `map_*()` which work with two arguments --- ## `map2*()` <center> <img src="img/map2-arg.png", width=400> </center> ```r map2(1:4, c(2, 5, 5, 10), runif, n = 5) ``` ``` [[1]] [1] 1.970967 1.618838 1.333427 1.346748 1.398485 [[2]] [1] 4.354078 2.116809 4.246386 4.031830 2.513793 [[3]] [1] 3.522176 4.028826 4.351215 4.965634 4.519089 [[4]] [1] 7.398931 9.098138 5.136844 5.627720 8.968951 ``` --- ## `pmap*()` <center> <img src="img/pmap-3.png", width=400> </center> `pmap()` and `pmap_*()` are generalized versions of `map()` and `map_*()` which work with any number of arguments --- ## `pmap*()` ```r pmap(list(n = c(2, 3, 2, 5), min = 1:4, max = c(2, 5, 5, 10)), runif) ``` ``` [[1]] [1] 1.693205 1.240545 [[2]] [1] 2.128966 2.421437 2.649156 [[3]] [1] 3.958797 3.394821 [[4]] [1] 8.316135 4.047308 6.252940 7.086446 4.009423 ``` --- ## `*walk*()` functions <center> <img src="img/walk.png", width=300> </center> `walk()` and variants work like `map()` but silently apply the function and return the list unchanged Useful for saving files or plotting --- class: nologo ## `imap*()` `imap(x, fun)` is equivalent to `map2(x, names(x), fun)` -- ```r imap(swiss, ~ ggplot(swiss, aes(x = Fertility, y = .x)) + geom_point() + labs(x = "Fertility", y = .y)) %>% gridExtra::grid.arrange(grobs = ., ncol = 3) ``` <img src="index_files/figure-html/unnamed-chunk-29-1.png" width="720" style="display: block; margin: auto;" /> --- count:false class: nologo ## `imap*()` `imap(x, fun)` is equivalent to `map2(x, names(x), fun)` ```r swiss %>% imap(., function(x, y) ggplot(., aes(x = Fertility, y = x)) + geom_point() + labs(x = "Fertility", y = y)) %>% gridExtra::grid.arrange(grobs = ., ncol = 3) ``` <img src="index_files/figure-html/unnamed-chunk-30-1.png" width="720" style="display: block; margin: auto;" /> --- ## `reduce()` <center> <img src="img/reduce.png", width=400> </center> `reduce(list(x1, x2, x3), f)` is equivalent to `f(f(x1, x2), x3)` --- ## `reduce()` ```r rerun(4, sample(1:10, 6)) ``` ``` [[1]] [1] 3 4 5 7 2 8 [[2]] [1] 10 6 9 1 2 7 [[3]] [1] 7 4 10 9 5 6 [[4]] [1] 4 7 8 2 10 1 ``` -- ```r rerun(4, sample(1:10, 6)) %>% reduce(intersect) ``` --- count: false ## `reduce()` ```r rerun(4, sample(1:10, 6)) ``` ``` [[1]] [1] 3 4 5 7 2 8 [[2]] [1] 10 6 9 1 2 7 [[3]] [1] 7 4 10 9 5 6 [[4]] [1] 4 7 8 2 10 1 ``` ```r rerun(4, sample(1:10, 6)) %>% reduce(intersect) ``` ``` [1] 7 ``` --- class: center, middle, inverse <br> # Tibble <img src="img/hex-tibble.png", width=270> --- ## Nice printing ```r mtcars ``` ``` mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 ``` --- count: false ## Nice printing ```r as_tibble(mtcars) ``` ``` # A tibble: 32 x 11 mpg cyl disp hp drat wt qsec vs am gear carb * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 # ... with 22 more rows ``` --- count: false ## Nice printing ... but don't use row names! ```r as_tibble(mtcars, rownames = "model") ``` ``` # A tibble: 32 x 12 model mpg cyl disp hp drat wt qsec vs am gear carb <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Mazd… 21 6 160 110 3.9 2.62 16.5 0 1 4 4 2 Mazd… 21 6 160 110 3.9 2.88 17.0 0 1 4 4 3 Dats… 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 4 Horn… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 5 Horn… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 6 Vali… 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 7 Dust… 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 8 Merc… 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 9 Merc… 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 10 Merc… 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 # ... with 22 more rows ``` --- ## Consistancy in subsetting ```r class(iris[, 1:2]) ``` ``` [1] "data.frame" ``` -- ```r class(iris[, 1]) ``` --- count: false ## Consistancy in subsetting ```r class(iris[, 1:2]) ``` ``` [1] "data.frame" ``` ```r class(iris[, 1]) ``` ``` [1] "numeric" ``` -- <br> ```r iris_tbl <- as_tibble(iris) class(iris_tbl[, 1:2]) ``` ``` [1] "tbl_df" "tbl" "data.frame" ``` ```r class(iris_tbl[, 1]) ``` ``` [1] "tbl_df" "tbl" "data.frame" ``` --- class: center, middle, noslidenumber, nologo <img src="img/tibble_kiddo.png", width=400> --- ## A little function ```r divisors <- function(N){ if(N == 1) {return(1)} if(N == 2) {return(1:2)} c(1, (2:(N/2))[N %% 2:(N/2) == 0], N) } ``` ```r map(4:8, divisors) ``` ``` [[1]] [1] 1 2 4 [[2]] [1] 1 5 [[3]] [1] 1 2 3 6 [[4]] [1] 1 7 [[5]] [1] 1 2 4 8 ``` --- ## List-columns ```r tibble(N = 1:50) ``` ``` # A tibble: 50 x 1 N <int> 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 # ... with 40 more rows ``` --- count: false ## List-columns ```r tibble(N = 1:50) %>% mutate(Divisors = map(N, divisors)) ``` ``` # A tibble: 50 x 2 N Divisors <int> <list> 1 1 <dbl [1]> 2 2 <int [2]> 3 3 <dbl [2]> 4 4 <dbl [3]> 5 5 <dbl [2]> 6 6 <dbl [4]> 7 7 <dbl [2]> 8 8 <dbl [4]> 9 9 <dbl [3]> 10 10 <dbl [4]> # ... with 40 more rows ``` --- count: false ## List-columns ```r tibble(N = 1:50) %>% mutate(Divisors = map(N, divisors)) %>% pull(Divisors) ``` ``` [[1]] [1] 1 [[2]] [1] 1 2 [[3]] [1] 1 3 [[4]] [1] 1 2 4 [[5]] [1] 1 5 [[6]] [1] 1 2 3 6 [[7]] [1] 1 7 [[8]] [1] 1 2 4 8 [[9]] [1] 1 3 9 [[10]] [1] 1 2 5 10 [[11]] [1] 1 11 [[12]] [1] 1 2 3 4 6 12 [[13]] [1] 1 13 [[14]] [1] 1 2 7 14 [[15]] [1] 1 3 5 15 [[16]] [1] 1 2 4 8 16 [[17]] [1] 1 17 [[18]] [1] 1 2 3 6 9 18 [[19]] [1] 1 19 [[20]] [1] 1 2 4 5 10 20 [[21]] [1] 1 3 7 21 [[22]] [1] 1 2 11 22 [[23]] [1] 1 23 [[24]] [1] 1 2 3 4 6 8 12 24 [[25]] [1] 1 5 25 [[26]] [1] 1 2 13 26 [[27]] [1] 1 3 9 27 [[28]] [1] 1 2 4 7 14 28 [[29]] [1] 1 29 [[30]] [1] 1 2 3 5 6 10 15 30 [[31]] [1] 1 31 [[32]] [1] 1 2 4 8 16 32 [[33]] [1] 1 3 11 33 [[34]] [1] 1 2 17 34 [[35]] [1] 1 5 7 35 [[36]] [1] 1 2 3 4 6 9 12 18 36 [[37]] [1] 1 37 [[38]] [1] 1 2 19 38 [[39]] [1] 1 3 13 39 [[40]] [1] 1 2 4 5 8 10 20 40 [[41]] [1] 1 41 [[42]] [1] 1 2 3 6 7 14 21 42 [[43]] [1] 1 43 [[44]] [1] 1 2 4 11 22 44 [[45]] [1] 1 3 5 9 15 45 [[46]] [1] 1 2 23 46 [[47]] [1] 1 47 [[48]] [1] 1 2 3 4 6 8 12 16 24 48 [[49]] [1] 1 7 49 [[50]] [1] 1 2 5 10 25 50 ``` --- count: false ## List-columns ```r tibble(N = 1:50) %>% mutate(Divisors = map(N, divisors), NbDivisors = map(Divisors, length)) ``` -- ``` # A tibble: 50 x 3 N Divisors NbDivisors <int> <list> <list> 1 1 <dbl [1]> <int [1]> 2 2 <int [2]> <int [1]> 3 3 <dbl [2]> <int [1]> 4 4 <dbl [3]> <int [1]> 5 5 <dbl [2]> <int [1]> 6 6 <dbl [4]> <int [1]> 7 7 <dbl [2]> <int [1]> 8 8 <dbl [4]> <int [1]> 9 9 <dbl [3]> <int [1]> 10 10 <dbl [4]> <int [1]> # ... with 40 more rows ``` --- count: false ## List-columns ```r tibble(N = 1:50) %>% mutate(Divisors = map(N, divisors), NbDivisors = map_dbl(Divisors, length)) ``` ``` # A tibble: 50 x 3 N Divisors NbDivisors <int> <list> <dbl> 1 1 <dbl [1]> 1 2 2 <int [2]> 2 3 3 <dbl [2]> 2 4 4 <dbl [3]> 3 5 5 <dbl [2]> 2 6 6 <dbl [4]> 4 7 7 <dbl [2]> 2 8 8 <dbl [4]> 4 9 9 <dbl [3]> 3 10 10 <dbl [4]> 4 # ... with 40 more rows ``` --- count: false ## List-columns ```r tibble(N = 1:50) %>% mutate(Divisors = map(N, divisors), NbDivisors = map_dbl(Divisors, length)) %>% ggplot(aes(x = N, y = NbDivisors)) + geom_col(aes(fill = NbDivisors < 3), color = "grey30") + scale_fill_viridis_d(begin = 0.6) + labs(x = "N", y = "Number of divisors of N") + theme(legend.position = "none") ``` <img src="index_files/figure-html/listcolumnplot-1.png" width="720" style="display: block; margin: auto;" /> --- # `nest()` ```r iris %>% as_tibble() ``` ``` # A tibble: 150 x 5 Sepal.Length Sepal.Width Petal.Length Petal.Width Species <dbl> <dbl> <dbl> <dbl> <fct> 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa 7 4.6 3.4 1.4 0.3 setosa 8 5 3.4 1.5 0.2 setosa 9 4.4 2.9 1.4 0.2 setosa 10 4.9 3.1 1.5 0.1 setosa # ... with 140 more rows ``` --- count:false # `nest()` ```r iris %>% group_by(Species) %>% summarise(N = n()) ``` ``` # A tibble: 3 x 2 Species N <fct> <int> 1 setosa 50 2 versicolor 50 3 virginica 50 ``` --- count:false # `nest()` ```r iris %>% group_by(Species) %>% nest() ``` ``` # A tibble: 3 x 2 Species data <fct> <list> 1 setosa <tibble [50 × 4]> 2 versicolor <tibble [50 × 4]> 3 virginica <tibble [50 × 4]> ``` --- count:false # `nest()` ```r iris %>% group_by(Species) %>% nest() %>% pull(data) %>% first() ``` ``` # A tibble: 50 x 4 Sepal.Length Sepal.Width Petal.Length Petal.Width <dbl> <dbl> <dbl> <dbl> 1 5.1 3.5 1.4 0.2 2 4.9 3 1.4 0.2 3 4.7 3.2 1.3 0.2 4 4.6 3.1 1.5 0.2 5 5 3.6 1.4 0.2 6 5.4 3.9 1.7 0.4 7 4.6 3.4 1.4 0.3 8 5 3.4 1.5 0.2 9 4.4 2.9 1.4 0.2 10 4.9 3.1 1.5 0.1 # ... with 40 more rows ``` --- class: center, middle, inverse <br> # Tidymodels <img src="img/hex-tidymodels.png", width=270> --- ## Metapackage ```r library(tidymodels) ``` ``` ── Attaching packages ────────────────────────────────────────────────────────────────────────────── tidymodels 0.0.1 ── ``` ``` ✔ rsample 0.0.2 ✔ yardstick 0.0.2 ✔ recipes 0.1.4 ✔ infer 0.4.0 ✔ broom 0.5.0 ``` ``` ── Conflicts ───────────────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ── ✖ rsample::fill() masks tidyr::fill() ✖ dplyr::filter() masks stats::filter() ✖ recipes::fixed() masks stringr::fixed() ✖ dplyr::lag() masks stats::lag() ✖ recipes::prepper() masks rsample::prepper() ✖ yardstick::spec() masks readr::spec() ✖ recipes::step() masks stats::step() ✖ yardstick::tidy() masks recipes::tidy(), broom::tidy() ``` -- <br> ```r tidy <- broom::tidy ``` --- class: center, middle, inverse <br> # Broom <img src="img/hex-broom.png", width=270> --- ## Dataset ```r df_pines <- read_csv("pines.txt") df_pines ``` ``` # A tibble: 39 x 3 Species Diameter Height <chr> <dbl> <int> 1 White 21.2 127 2 White 20.2 119 3 White 24.6 135 4 White 23 132 5 White 27.2 130 6 White 18.6 130 7 White 17.3 110 8 White 10 75 9 White 19.7 110 10 White 22.3 124 # ... with 29 more rows ``` --- ## Classical output of linear regression ```r reg_simple <- lm(Height ~ Diameter * Species, data = df_pines) reg_simple ``` ``` Call: lm(formula = Height ~ Diameter * Species, data = df_pines) Coefficients: (Intercept) Diameter SpeciesYellow 41.2747 3.9789 -35.8390 Diameter:SpeciesYellow 0.5293 ``` --- count:false ## Classical output of linear regression ```r reg_simple <- lm(Height ~ Diameter * Species, data = df_pines) summary(reg_simple) ``` ``` Call: lm(formula = Height ~ Diameter * Species, data = df_pines) Residuals: Min 1Q Median 3Q Max -19.5014 -8.2424 -0.7899 7.5008 28.3598 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 41.2747 6.8238 6.049 6.66e-07 *** Diameter 3.9789 0.3926 10.136 5.95e-12 *** SpeciesYellow -35.8390 10.5955 -3.382 0.00178 ** Diameter:SpeciesYellow 0.5293 0.5899 0.897 0.37572 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 11.36 on 35 degrees of freedom Multiple R-squared: 0.8744, Adjusted R-squared: 0.8637 F-statistic: 81.25 on 3 and 35 DF, p-value: 7.69e-16 ``` --- count:false ## Classical output of linear regression ```r reg_simple <- lm(Height ~ Diameter * Species, data = df_pines) coef(summary(reg_simple)) ``` ``` Estimate Std. Error t value Pr(>|t|) (Intercept) 41.2747285 6.8237545 6.0486831 6.661673e-07 Diameter 3.9789200 0.3925661 10.1356694 5.951760e-12 SpeciesYellow -35.8390158 10.5955323 -3.3824649 1.781368e-03 Diameter:SpeciesYellow 0.5292551 0.5898638 0.8972496 3.757164e-01 ``` ```r coef(reg_simple) ``` ``` (Intercept) Diameter SpeciesYellow 41.2747285 3.9789200 -35.8390158 Diameter:SpeciesYellow 0.5292551 ``` --- ## `tidy()` Presents the outputs of the model in a tibble ```r tidy(reg_simple) ``` ``` # A tibble: 4 x 5 term estimate std.error statistic p.value <chr> <dbl> <dbl> <dbl> <dbl> 1 (Intercept) 41.3 6.82 6.05 6.66e- 7 2 Diameter 3.98 0.393 10.1 5.95e-12 3 SpeciesYellow -35.8 10.6 -3.38 1.78e- 3 4 Diameter:SpeciesYellow 0.529 0.590 0.897 3.76e- 1 ``` --- ## `glance()` Returns summary informations about the model in a single-row tibble ```r glance(reg_simple) ``` ``` # A tibble: 1 x 11 r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC * <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> 1 0.874 0.864 11.4 81.2 7.69e-16 4 -148. 306. 314. # ... with 2 more variables: deviance <dbl>, df.residual <int> ``` --- ## `augment()` Adds information about each observation in the dataset ```r (augmented <- augment(reg_simple, df_pines)) ``` ``` # A tibble: 39 x 10 Species Diameter Height .fitted .se.fit .resid .hat .sigma .cooksd * <chr> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 White 21.2 127 126. 3.16 1.37 0.0775 11.5 3.32e-4 2 White 20.2 119 122. 2.94 -2.65 0.0668 11.5 1.04e-3 3 White 24.6 135 139. 4.13 -4.16 0.132 11.5 5.86e-3 4 White 23 132 133. 3.64 -0.790 0.103 11.5 1.55e-4 5 White 27.2 130 150. 4.98 -19.5 0.192 10.9 2.17e-1 6 White 18.6 130 115. 2.65 14.7 0.0545 11.2 2.56e-2 7 White 17.3 110 110. 2.52 -0.110 0.0491 11.5 1.27e-6 8 White 10 75 81.1 3.47 -6.06 0.0935 11.5 8.10e-3 9 White 19.7 110 120. 2.83 -9.66 0.0623 11.4 1.28e-2 10 White 22.3 124 130. 3.45 -6.00 0.0921 11.5 7.81e-3 # ... with 29 more rows, and 1 more variable: .std.resid <dbl> ``` --- ## Graph of residuals ```r augmented %>% arrange(.fitted) %>% mutate(.lowess = lowess(.fitted, .resid)[[2]]) %>% ggplot() + aes(x = .fitted) + geom_hline(yintercept = 0, alpha = 0.5, linetype = "dashed") + geom_line(aes(y = .lowess), color = "saddlebrown") + geom_point(aes(y = .resid, color = Species)) + labs(x = "Fitted values", y = "Residuals") ``` <img src="index_files/figure-html/resid-1.png" width="576" style="display: block; margin: auto;" /> --- ## Quantile-quantile plot ```r ggplot(augmented) + aes(sample = .resid) + geom_qq_line(color = "red") + geom_qq()+ labs(x = "Theoretical quantiles", y = "Standardized residuals") ``` <img src="index_files/figure-html/qq-1.png" width="576" style="display: block; margin: auto;" /> --- ## Leverage ```r augmented %>% arrange(.hat) %>% mutate(.lowess = lowess(.hat, .std.resid)[[2]]) %>% ggplot() + aes(x = .hat) + geom_vline(xintercept = 0, alpha = 0.5, linetype = "dashed") + geom_hline(yintercept = 0, alpha = 0.5, linetype = "dashed") + geom_line(aes(y = .lowess), color = "saddlebrown") + geom_point(aes(y = .std.resid, color = Species)) + labs(x = "Leverage", y = "Standardized residuals") ``` <img src="index_files/figure-html/cook-1.png" width="576" style="display: block; margin: auto;" /> --- ## Cook's distances ```r augmented %>% rowid_to_column("ID") %>% ggplot() + aes(x = ID, y = .cooksd, fill = Species) + geom_col(color = "grey30") + labs(x = "Observation", y = "Cook's Distance") ``` <img src="index_files/figure-html/unnamed-chunk-67-1.png" width="576" style="display: block; margin: auto;" /> --- ## Wilcoxon test ```r w_test <- wilcox.test(Height ~ Species, data = df_pines) w_test ``` ``` Wilcoxon rank sum test with continuity correction data: Height by Species W = 269.5, p-value = 0.0241 alternative hypothesis: true location shift is not equal to 0 ``` -- <br> ```r tidy(w_test) ``` ``` # A tibble: 1 x 4 statistic p.value method alternative <dbl> <dbl> <chr> <chr> 1 270. 0.0241 Wilcoxon rank sum test with continuity co… two.sided ``` --- ## Tidying methods for your package If you want to include `tidy()`, `glance()` and `augment()` methods for your models: 1. Re-export the `tidy()`, `glance()` or `augment()` generics from the {modelgenerics} package 2. Implement appropriate tidying methods in your own package 3. Test the tidying methods -- <br> General guidelines: * Reach a minimum 90% test coverage for new tidiers * `tidy()`, `glance()` and `augment()` methods must return tibbles * Use new tidyverse packages and follow the tidyverse style conventions * ... Complete information are on <a href="https://broom.tidyverse.org/articles/adding-tidiers.html" target="_blank">
Adding new tidiers to broom</a> and <a href="https://broom.tidyverse.org/articles/external-tidiers.html" target="_blank">
Exporting tidying methods from a package</a> --- class: center, middle, inverse <br> # Rsample <img src="img/hex-rsample.png", width=270> --- ## Estimation of parameters ```r bootstrap_pins <- df_pines %>% rowid_to_column("ID") %>% bootstraps(times = 500) bootstrap_pins ``` ``` # Bootstrap sampling # A tibble: 500 x 2 splits id <list> <chr> 1 <S3: rsplit> Bootstrap001 2 <S3: rsplit> Bootstrap002 3 <S3: rsplit> Bootstrap003 4 <S3: rsplit> Bootstrap004 5 <S3: rsplit> Bootstrap005 6 <S3: rsplit> Bootstrap006 7 <S3: rsplit> Bootstrap007 8 <S3: rsplit> Bootstrap008 9 <S3: rsplit> Bootstrap009 10 <S3: rsplit> Bootstrap010 # ... with 490 more rows ``` --- ## `rsplit` object ```r bootstrap_pins %>% pull(splits) %>% first() ``` ``` <39/12/39> ``` --- count:false ## `rsplit` object ```r bootstrap_pins %>% pull(splits) %>% first() %>% as_tibble() ``` ``` # A tibble: 39 x 4 ID Species Diameter Height <int> <chr> <dbl> <int> 1 11 White 5.6 55 2 16 White 14.4 107 3 1 White 21.2 127 4 15 White 12.6 108 5 34 Yellow 12.2 57 6 14 White 13.1 116 7 19 White 7 56 8 24 Yellow 14.4 84 9 20 White 9.3 70 10 8 White 10 75 # ... with 29 more rows ``` --- ## Analysis & assessment sets ```r bootstrap_pins %>% pull(splits) %>% first() %>% as_tibble() %>% # analysis() pull(ID) %>% table() ``` ``` . 1 3 4 5 8 10 11 12 13 14 15 16 17 18 19 20 21 22 24 26 27 28 29 31 33 2 1 1 1 1 1 1 1 1 1 1 2 1 2 3 1 2 1 1 2 1 1 2 3 2 34 36 2 1 ``` -- ```r bootstrap_pins %>% pull(splits) %>% first() %>% as_tibble(data = "assessment") %>% # assessment() pull(ID) %>% table() ``` ``` . 2 6 7 9 23 25 30 32 35 37 38 39 1 1 1 1 1 1 1 1 1 1 1 1 ``` --- ## Bootstrapped models ```r my_lm <- partial(lm, formula = Height ~ Diameter * Species) ``` -- ```r (bootstrap_pins <- bootstrap_pins %>% mutate(reg = map(splits, my_lm))) ``` ``` # Bootstrap sampling # A tibble: 500 x 3 splits id reg * <list> <chr> <list> 1 <S3: rsplit> Bootstrap001 <S3: lm> 2 <S3: rsplit> Bootstrap002 <S3: lm> 3 <S3: rsplit> Bootstrap003 <S3: lm> 4 <S3: rsplit> Bootstrap004 <S3: lm> 5 <S3: rsplit> Bootstrap005 <S3: lm> 6 <S3: rsplit> Bootstrap006 <S3: lm> 7 <S3: rsplit> Bootstrap007 <S3: lm> 8 <S3: rsplit> Bootstrap008 <S3: lm> 9 <S3: rsplit> Bootstrap009 <S3: lm> 10 <S3: rsplit> Bootstrap010 <S3: lm> # ... with 490 more rows ``` --- count: false ## Bootstrapped models ```r my_lm <- partial(lm, formula = Height ~ Diameter * Species) ``` ```r (bootstrap_pins <- bootstrap_pins %>% mutate(reg = map(splits, my_lm), tidied = map(reg, tidy))) ``` ``` # Bootstrap sampling # A tibble: 500 x 4 splits id reg tidied * <list> <chr> <list> <list> 1 <S3: rsplit> Bootstrap001 <S3: lm> <tibble [4 × 5]> 2 <S3: rsplit> Bootstrap002 <S3: lm> <tibble [4 × 5]> 3 <S3: rsplit> Bootstrap003 <S3: lm> <tibble [4 × 5]> 4 <S3: rsplit> Bootstrap004 <S3: lm> <tibble [4 × 5]> 5 <S3: rsplit> Bootstrap005 <S3: lm> <tibble [4 × 5]> 6 <S3: rsplit> Bootstrap006 <S3: lm> <tibble [4 × 5]> 7 <S3: rsplit> Bootstrap007 <S3: lm> <tibble [4 × 5]> 8 <S3: rsplit> Bootstrap008 <S3: lm> <tibble [4 × 5]> 9 <S3: rsplit> Bootstrap009 <S3: lm> <tibble [4 × 5]> 10 <S3: rsplit> Bootstrap010 <S3: lm> <tibble [4 × 5]> # ... with 490 more rows ``` --- count: false ## Bootstrapped models ```r my_lm <- partial(lm, formula = Height ~ Diameter * Species) ``` ```r (bootstrap_pins <- bootstrap_pins %>% mutate(reg = map(splits, my_lm), tidied = map(reg, tidy)) %>% unnest(tidied)) ``` ``` # A tibble: 2,000 x 6 id term estimate std.error statistic p.value <chr> <chr> <dbl> <dbl> <dbl> <dbl> 1 Bootstrap001 (Intercept) 35.0 4.93 7.09 2.95e- 8 2 Bootstrap001 Diameter 4.32 0.299 14.4 2.57e-16 3 Bootstrap001 SpeciesYellow -37.0 8.26 -4.48 7.56e- 5 4 Bootstrap001 Diameter:SpeciesYel… 0.387 0.516 0.750 4.58e- 1 5 Bootstrap002 (Intercept) 49.6 6.64 7.47 9.53e- 9 6 Bootstrap002 Diameter 3.68 0.394 9.32 5.19e-11 7 Bootstrap002 SpeciesYellow -53.4 9.91 -5.38 5.00e- 6 8 Bootstrap002 Diameter:SpeciesYel… 1.37 0.605 2.26 2.99e- 2 9 Bootstrap003 (Intercept) 33.4 7.70 4.34 1.14e- 4 10 Bootstrap003 Diameter 4.31 0.449 9.59 2.52e-11 # ... with 1,990 more rows ``` --- ## Bootstrapped models ```r bootstrap_pins %>% mutate(term = fct_relevel(term, "Diameter:SpeciesYellow", after = Inf)) %>% ggplot(aes(x = estimate, y = term, fill = term)) + ggridges::geom_density_ridges() + theme(legend.position = "none") ``` <img src="index_files/figure-html/unnamed-chunk-80-1.png" width="576" style="display: block; margin: auto;" /> --- ## Other resampling techniques * `initial_split(prop = 3/4)`, `training()`, `testing()` * `loo_cv()` * `mc_cv()` * `vfold_cv()` * `group_vfold_cv()` * `rolling_origin()` * `nested_cv()` -- <br> Since the original data is not modified,
does not make an automatic copy so creating `\(N\)` bootstraps of a data set does not create an object that is `\(N\)`-fold larger in memory. ```r pryr::object_size(bootstraps(mtcars, 100)) / pryr::object_size(mtcars) ``` ``` 15.2 B ``` --- class: center, middle, inverse # Your turn! ## Application to human microbiome --- class: nologo ## Data ```r df_abund <- read_csv("abund_table.txt") df_abund ``` ``` # A tibble: 256 x 654 Clade SRS011086 SRS011090 SRS011098 SRS011111 SRS011115 SRS011126 <chr> <int> <int> <int> <int> <int> <int> 1 k__B… 0 0 0 0 0 0 2 k__B… 0 0 0 0 0 0 3 k__B… 0 0 0 0 0 0 4 k__B… 0 0 0 0 0 0 5 k__B… 0 0 0 0 0 0 6 k__B… 0 0 0 0 0 0 7 k__B… 0 5366 0 0 1182 0 8 k__B… 0 0 0 0 0 0 9 k__B… 0 0 0 0 0 0 10 k__B… 0 0 0 0 0 0 # ... with 246 more rows, and 647 more variables: SRS011140 <int>, # SRS011144 <int>, SRS011152 <int>, SRS011243 <int>, SRS011247 <int>, … ``` ```r df_abund$Clade[1:2] ``` ``` [1] "k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides" [2] "k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Porphyromonadaceae|g__Parabacteroides" ``` --- ## Data ```r df_samples <- read_csv("samples_table.txt") df_samples ``` ``` # A tibble: 653 x 6 Sample Body_site Body_subsite Age Age_category Gender <chr> <chr> <chr> <int> <chr> <chr> 1 SRS011086 oralcavity tongue_dorsum 30 adult female 2 SRS011090 oralcavity buccal_mucosa 30 adult female 3 SRS011098 oralcavity supragingival_plaque 30 adult female 4 SRS011111 oralcavity posterior_fornix 30 adult female 5 SRS011115 oralcavity tongue_dorsum 28 adult male 6 SRS011126 oralcavity supragingival_plaque 28 adult male 7 SRS011140 oralcavity tongue_dorsum 25 adult male 8 SRS011144 oralcavity buccal_mucosa 25 adult male 9 SRS011152 oralcavity supragingival_plaque 25 adult male 10 SRS011243 oralcavity tongue_dorsum 27 adult female # ... with 643 more rows ``` --- ## Create a taxonomic table Like this: ``` # A tibble: 256 x 6 Kingdom Phylum Class Order Family Genus <chr> <chr> <chr> <chr> <chr> <chr> 1 Archaea Euryarcha… Methanobac… Methanobac… Methanobac… Methanobreviba… 2 Archaea Euryarcha… Methanobac… Methanobac… Methanobac… Methanosphaera 3 Bacteria Acidobact… Acidobacte… Acidobacte… Acidobacte… Candidatus_Chl… 4 Bacteria Acidobact… Acidobacte… Acidobacte… Acidobacte… Granulicella 5 Bacteria Acidobact… Acidobacte… Acidobacte… Acidobacte… Terriglobus 6 Bacteria Actinobac… Actinobact… Actinomyce… Actinomyce… Actinobaculum 7 Bacteria Actinobac… Actinobact… Actinomyce… Actinomyce… Actinomyces 8 Bacteria Actinobac… Actinobact… Actinomyce… Actinomyce… Mobiluncus 9 Bacteria Actinobac… Actinobact… Actinomyce… Actinomyce… Varibaculum 10 Bacteria Actinobac… Actinobact… Actinomyce… Brevibacte… Brevibacterium # ... with 246 more rows ``` -- <br> Hint: use `tidyr::separate()` and `stringr` --- count: false ## Create a taxonomic table ```r df_tax <- df_abund %>% select(Clade) %>% separate(Clade, sep = "\\|", into = c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus")) %>% mutate_all(str_remove, ".__") %>% arrange_all() df_tax ``` ``` # A tibble: 256 x 6 Kingdom Phylum Class Order Family Genus <chr> <chr> <chr> <chr> <chr> <chr> 1 Archaea Euryarcha… Methanobac… Methanobac… Methanobac… Methanobreviba… 2 Archaea Euryarcha… Methanobac… Methanobac… Methanobac… Methanosphaera 3 Bacteria Acidobact… Acidobacte… Acidobacte… Acidobacte… Candidatus_Chl… 4 Bacteria Acidobact… Acidobacte… Acidobacte… Acidobacte… Granulicella 5 Bacteria Acidobact… Acidobacte… Acidobacte… Acidobacte… Terriglobus 6 Bacteria Actinobac… Actinobact… Actinomyce… Actinomyce… Actinobaculum 7 Bacteria Actinobac… Actinobact… Actinomyce… Actinomyce… Actinomyces 8 Bacteria Actinobac… Actinobact… Actinomyce… Actinomyce… Mobiluncus 9 Bacteria Actinobac… Actinobact… Actinomyce… Actinomyce… Varibaculum 10 Bacteria Actinobac… Actinobact… Actinomyce… Brevibacte… Brevibacterium # ... with 246 more rows ``` --- ## Put `df_abund` in a tidy way Like this: ``` # A tibble: 167,168 x 3 Genus Sample Abundance <chr> <chr> <dbl> 1 Bacteroides SRS011086 0 2 Parabacteroides SRS011086 0 3 Alistipes SRS011086 0 4 Burkholderiales_noname SRS011086 0 5 Faecalibacterium SRS011086 0 6 Roseburia SRS011086 0 7 Subdoligranulum SRS011086 0 8 Bacteroidales_noname SRS011086 0 9 Parasutterella SRS011086 0 10 Oscillibacter SRS011086 0 # ... with 167,158 more rows ``` -- <br> Hint: use a regular expression and `tidyr::gather()` --- count: false ## Put `df_abund` in a tidy way ```r df_abund_long <- df_abund %>% mutate(Genus = str_remove(Clade, ".*__")) %>% select(-Clade) %>% gather(key = "Sample", value = "Abundance", -Genus) df_abund_long ``` ``` # A tibble: 167,168 x 3 Genus Sample Abundance <chr> <chr> <dbl> 1 Bacteroides SRS011086 0 2 Parabacteroides SRS011086 0 3 Alistipes SRS011086 0 4 Burkholderiales_noname SRS011086 0 5 Faecalibacterium SRS011086 0 6 Roseburia SRS011086 0 7 Subdoligranulum SRS011086 0 8 Bacteroidales_noname SRS011086 0 9 Parasutterella SRS011086 0 10 Oscillibacter SRS011086 0 # ... with 167,158 more rows ``` --- ## Remove genera present in less than 25 samples -- ```r df_abund_filtered <- df_abund_long %>% group_by(Genus) %>% mutate(Preval = sum(Abundance > 0)) %>% filter(Preval > 25) %>% select(-Preval) %>% ungroup() df_abund_filtered ``` ``` # A tibble: 85,543 x 3 Genus Sample Abundance <chr> <chr> <dbl> 1 Bacteroides SRS011086 0 2 Parabacteroides SRS011086 0 3 Alistipes SRS011086 0 4 Burkholderiales_noname SRS011086 0 5 Faecalibacterium SRS011086 0 6 Roseburia SRS011086 0 7 Subdoligranulum SRS011086 0 8 Bacteroidales_noname SRS011086 0 9 Parasutterella SRS011086 0 10 Oscillibacter SRS011086 0 # ... with 85,533 more rows ``` --- ## Find differentially abundant genera <br> * Between subsites `buccal_mucosa` and `supragingival_plaque` <br> * With a Wilcowon-text: `wilcox.test()` <br> * Corrected for false discovery rate: `p.adjust(method = "fdr")` --- ## Find differentially abundant genera ```r df_abund_filtered %>% left_join(df_samples, by = "Sample") %>% filter(Body_subsite %in% c("buccal_mucosa", "supragingival_plaque")) %>% select(Genus, Body_subsite, Abundance) ``` ``` # A tibble: 32,357 x 3 Genus Body_subsite Abundance <chr> <chr> <dbl> 1 Bacteroides buccal_mucosa 0 2 Parabacteroides buccal_mucosa 0 3 Alistipes buccal_mucosa 0 4 Burkholderiales_noname buccal_mucosa 0 5 Faecalibacterium buccal_mucosa 0 6 Roseburia buccal_mucosa 0 7 Subdoligranulum buccal_mucosa 5366 8 Bacteroidales_noname buccal_mucosa 0 9 Parasutterella buccal_mucosa 0 10 Oscillibacter buccal_mucosa 0 # ... with 32,347 more rows ``` --- count: false ## Find differentially abundant genera ```r df_abund_filtered %>% left_join(df_samples, by = "Sample") %>% filter(Body_subsite %in% c("buccal_mucosa", "supragingival_plaque")) %>% select(Genus, Body_subsite, Abundance) %>% group_by(Genus) %>% nest() ``` ``` # A tibble: 131 x 2 Genus data <chr> <list> 1 Bacteroides <tibble [247 × 2]> 2 Parabacteroides <tibble [247 × 2]> 3 Alistipes <tibble [247 × 2]> 4 Burkholderiales_noname <tibble [247 × 2]> 5 Faecalibacterium <tibble [247 × 2]> 6 Roseburia <tibble [247 × 2]> 7 Subdoligranulum <tibble [247 × 2]> 8 Bacteroidales_noname <tibble [247 × 2]> 9 Parasutterella <tibble [247 × 2]> 10 Oscillibacter <tibble [247 × 2]> # ... with 121 more rows ``` --- count: false ## Find differentially abundant genera ```r df_abund_filtered %>% left_join(df_samples, by = "Sample") %>% filter(Body_subsite %in% c("buccal_mucosa", "supragingival_plaque")) %>% select(Genus, Body_subsite, Abundance) %>% group_by(Genus) %>% nest() %>% mutate(wt = map(data, ~ wilcox.test(formula = Abundance ~ Body_subsite, data = .))) ``` ``` # A tibble: 131 x 3 Genus data wt <chr> <list> <list> 1 Bacteroides <tibble [247 × 2]> <S3: htest> 2 Parabacteroides <tibble [247 × 2]> <S3: htest> 3 Alistipes <tibble [247 × 2]> <S3: htest> 4 Burkholderiales_noname <tibble [247 × 2]> <S3: htest> 5 Faecalibacterium <tibble [247 × 2]> <S3: htest> 6 Roseburia <tibble [247 × 2]> <S3: htest> 7 Subdoligranulum <tibble [247 × 2]> <S3: htest> 8 Bacteroidales_noname <tibble [247 × 2]> <S3: htest> 9 Parasutterella <tibble [247 × 2]> <S3: htest> 10 Oscillibacter <tibble [247 × 2]> <S3: htest> # ... with 121 more rows ``` --- count: false class: nologo ## Find differentially abundant genera ```r df_abund_filtered %>% left_join(df_samples, by = "Sample") %>% filter(Body_subsite %in% c("buccal_mucosa", "supragingival_plaque")) %>% select(Genus, Body_subsite, Abundance) %>% group_by(Genus) %>% nest() %>% mutate(wt = map(data, ~ wilcox.test(formula = Abundance ~ Body_subsite, data = .)), tidied = map(wt, tidy)) ``` ``` # A tibble: 131 x 4 Genus data wt tidied <chr> <list> <list> <list> 1 Bacteroides <tibble [247 × 2]> <S3: htest> <tibble [1 × 4]> 2 Parabacteroides <tibble [247 × 2]> <S3: htest> <tibble [1 × 4]> 3 Alistipes <tibble [247 × 2]> <S3: htest> <tibble [1 × 4]> 4 Burkholderiales_noname <tibble [247 × 2]> <S3: htest> <tibble [1 × 4]> 5 Faecalibacterium <tibble [247 × 2]> <S3: htest> <tibble [1 × 4]> 6 Roseburia <tibble [247 × 2]> <S3: htest> <tibble [1 × 4]> 7 Subdoligranulum <tibble [247 × 2]> <S3: htest> <tibble [1 × 4]> 8 Bacteroidales_noname <tibble [247 × 2]> <S3: htest> <tibble [1 × 4]> 9 Parasutterella <tibble [247 × 2]> <S3: htest> <tibble [1 × 4]> 10 Oscillibacter <tibble [247 × 2]> <S3: htest> <tibble [1 × 4]> # ... with 121 more rows ``` --- count: false class: nologo ## Find differentially abundant genera ```r df_abund_filtered %>% left_join(df_samples, by = "Sample") %>% filter(Body_subsite %in% c("buccal_mucosa", "supragingival_plaque")) %>% select(Genus, Body_subsite, Abundance) %>% group_by(Genus) %>% nest() %>% mutate(wt = map(data, ~ wilcox.test(formula = Abundance ~ Body_subsite, data = .)), tidied = map(wt, tidy)) %>% unnest(tidied, .drop = TRUE) ``` ``` # A tibble: 131 x 5 Genus statistic p.value method alternative <chr> <dbl> <dbl> <chr> <chr> 1 Bacteroides 8248 0.142 Wilcoxon rank sum test wi… two.sided 2 Parabacteroid… 7722. 0.651 Wilcoxon rank sum test wi… two.sided 3 Alistipes 8073 0.127 Wilcoxon rank sum test wi… two.sided 4 Burkholderial… 7621 0.959 Wilcoxon rank sum test wi… two.sided 5 Faecalibacter… 7645 0.882 Wilcoxon rank sum test wi… two.sided 6 Roseburia 7754 0.358 Wilcoxon rank sum test wi… two.sided 7 Subdoligranul… 7824 0.579 Wilcoxon rank sum test wi… two.sided 8 Bacteroidales… 7808 0.0720 Wilcoxon rank sum test wi… two.sided 9 Parasutterella 7621 0.959 Wilcoxon rank sum test wi… two.sided 10 Oscillibacter 7684. 0.529 Wilcoxon rank sum test wi… two.sided # ... with 121 more rows ``` --- count: false class: nologo ## Find differentially abundant genera ```r df_abund_filtered %>% left_join(df_samples, by = "Sample") %>% filter(Body_subsite %in% c("buccal_mucosa", "supragingival_plaque")) %>% select(Genus, Body_subsite, Abundance) %>% group_by(Genus) %>% nest() %>% mutate(wt = map(data, ~ wilcox.test(formula = Abundance ~ Body_subsite, data = .)), tidied = map(wt, tidy)) %>% unnest(tidied, .drop = TRUE) %>% mutate(q.value = p.adjust(p.value, method = "fdr")) %>% select(Genus, p.value, q.value) %>% filter(q.value < 0.05) %>% arrange(q.value) ``` ``` # A tibble: 55 x 3 Genus p.value q.value <chr> <dbl> <dbl> 1 Corynebacterium 4.99e-37 2.87e-35 2 Cardiobacterium 2.74e-37 2.87e-35 3 Actinomyces 1.27e-36 4.85e-35 4 Capnocytophaga 3.38e-36 9.72e-35 5 Kingella 8.21e-36 1.89e-34 6 Eikenella 2.78e-35 5.33e-34 7 Actinobaculum 1.75e-33 2.87e-32 8 Propionibacterium 2.14e-33 3.08e-32 9 Peptostreptococcaceae_noname 2.88e-31 3.68e-30 10 Campylobacter 8.68e-31 9.98e-30 # ... with 45 more rows ``` --- ## Make a PCA with samples as observations ```r df_abund_transp <- df_abund_filtered %>% spread(key = Genus, value = Abundance) ``` -- <br> * Don't forget to remove the `Sample` column <br> * Use `prcomp()` for PCA (there is no `broom` methods for `FactoMineR` functions yet) <br> * Set `scale. = TRUE` <br> * Keep only the four first principal components --- ## Make a PCA with samples as observations ```r df_abund_transp %>% select(-Sample) %>% prcomp(scale. = TRUE) %>% augment(df_abund_transp) %>% select(Sample, .fittedPC1:.fittedPC4) ``` ``` # A tibble: 653 x 5 Sample .fittedPC1 .fittedPC2 .fittedPC3 .fittedPC4 <chr> <dbl> <dbl> <dbl> <dbl> 1 SRS011061 4.34 0.430 0.751 -2.60 2 SRS011084 10.3 1.45 4.48 -1.19 3 SRS011086 -1.51 -2.67 -0.586 0.726 4 SRS011090 0.499 -0.493 -2.51 -0.314 5 SRS011098 -1.51 1.42 -1.68 2.12 6 SRS011111 0.196 -2.31 -2.72 -3.91 7 SRS011115 -9.51 -4.81 9.64 5.33 8 SRS011126 -6.38 8.50 2.13 4.04 9 SRS011134 6.27 0.489 2.17 -0.815 10 SRS011140 -2.23 -2.54 0.111 1.86 # ... with 643 more rows ``` --- count: false ## Make a PCA with samples as observations ```r df_abund_transp %>% select(-Sample) %>% prcomp(scale. = TRUE) %>% augment(df_abund_transp) %>% select(Sample, .fittedPC1:.fittedPC4) %>% left_join(df_samples, by = "Sample") -> df_PCA df_PCA ``` ``` # A tibble: 653 x 10 Sample .fittedPC1 .fittedPC2 .fittedPC3 .fittedPC4 Body_site <chr> <dbl> <dbl> <dbl> <dbl> <chr> 1 SRS01… 4.34 0.430 0.751 -2.60 stool 2 SRS01… 10.3 1.45 4.48 -1.19 stool 3 SRS01… -1.51 -2.67 -0.586 0.726 oralcavi… 4 SRS01… 0.499 -0.493 -2.51 -0.314 oralcavi… 5 SRS01… -1.51 1.42 -1.68 2.12 oralcavi… 6 SRS01… 0.196 -2.31 -2.72 -3.91 oralcavi… 7 SRS01… -9.51 -4.81 9.64 5.33 oralcavi… 8 SRS01… -6.38 8.50 2.13 4.04 oralcavi… 9 SRS01… 6.27 0.489 2.17 -0.815 stool 10 SRS01… -2.23 -2.54 0.111 1.86 oralcavi… # ... with 643 more rows, and 4 more variables: Body_subsite <chr>, # Age <int>, Age_category <chr>, Gender <chr> ``` --- ## Make a PCA with samples as observations ```r ggplot(df_PCA) + aes(x = .fittedPC1, y = .fittedPC2, color = Body_site) + geom_point(alpha = 0.6) ``` <img src="index_files/figure-html/unnamed-chunk-98-1.png" width="576" style="display: block; margin: auto;" /> --- count: false ## Make a PCA with samples as observations ```r ggplot(df_PCA) + aes(x = .fittedPC1, y = .fittedPC2, color = Body_subsite) + geom_point(alpha = 0.6) ``` <img src="index_files/figure-html/unnamed-chunk-99-1.png" width="576" style="display: block; margin: auto;" /> --- ## Compute k-means for several K ```r tibble(K = 1:10) ``` ``` # A tibble: 10 x 1 K <int> 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 ``` --- count: false ## Compute k-means for several K ```r tibble(K = 1:10) %>% mutate(kmeans = map(K, kmeans, nstart = 10, x = select(df_abund_transp, -Sample))) ``` ``` # A tibble: 10 x 2 K kmeans <int> <list> 1 1 <S3: kmeans> 2 2 <S3: kmeans> 3 3 <S3: kmeans> 4 4 <S3: kmeans> 5 5 <S3: kmeans> 6 6 <S3: kmeans> 7 7 <S3: kmeans> 8 8 <S3: kmeans> 9 9 <S3: kmeans> 10 10 <S3: kmeans> ``` --- count: false ## Compute k-means for several K ```r tibble(K = 1:10) %>% mutate(kmeans = map(K, kmeans, nstart = 10, x = select(df_abund_transp, -Sample)), tidied = map(kmeans, tidy), glanced = map(kmeans, glance), augmented = map(kmeans, augment, df_PCA)) ``` ``` # A tibble: 10 x 5 K kmeans tidied glanced augmented <int> <list> <list> <list> <list> 1 1 <S3: kmeans> <tibble [1 × 134]> <tibble [1 × 4… <tibble [653 × 1… 2 2 <S3: kmeans> <tibble [2 × 134]> <tibble [1 × 4… <tibble [653 × 1… 3 3 <S3: kmeans> <tibble [3 × 134]> <tibble [1 × 4… <tibble [653 × 1… 4 4 <S3: kmeans> <tibble [4 × 134]> <tibble [1 × 4… <tibble [653 × 1… 5 5 <S3: kmeans> <tibble [5 × 134]> <tibble [1 × 4… <tibble [653 × 1… 6 6 <S3: kmeans> <tibble [6 × 134]> <tibble [1 × 4… <tibble [653 × 1… 7 7 <S3: kmeans> <tibble [7 × 134]> <tibble [1 × 4… <tibble [653 × 1… 8 8 <S3: kmeans> <tibble [8 × 134]> <tibble [1 × 4… <tibble [653 × 1… 9 9 <S3: kmeans> <tibble [9 × 134]> <tibble [1 × 4… <tibble [653 × 1… 10 10 <S3: kmeans> <tibble [10 × 134… <tibble [1 × 4… <tibble [653 × 1… ``` --- count: false ## Compute k-means for several K ```r tibble(K = 1:10) %>% mutate(kmeans = map(K, kmeans, nstart = 10, x = select(df_abund_transp, -Sample)), tidied = map(kmeans, tidy), glanced = map(kmeans, glance), augmented = map(kmeans, augment, df_PCA)) %>% unnest(glanced) -> df_kmeans df_kmeans ``` ``` # A tibble: 10 x 8 K kmeans tidied augmented totss tot.withinss betweenss iter <int> <list> <list> <list> <dbl> <dbl> <dbl> <int> 1 1 <S3: km… <tibble… <tibble [… 7.55e17 7.55e17 -2.42e 4 1 2 2 <S3: km… <tibble… <tibble [… 7.55e17 3.79e17 3.76e17 1 3 3 <S3: km… <tibble… <tibble [… 7.55e17 3.15e17 4.40e17 2 4 4 <S3: km… <tibble… <tibble [… 7.55e17 2.73e17 4.82e17 2 5 5 <S3: km… <tibble… <tibble [… 7.55e17 2.54e17 5.01e17 2 6 6 <S3: km… <tibble… <tibble [… 7.55e17 2.35e17 5.20e17 3 7 7 <S3: km… <tibble… <tibble [… 7.55e17 2.13e17 5.42e17 4 8 8 <S3: km… <tibble… <tibble [… 7.55e17 2.02e17 5.53e17 4 9 9 <S3: km… <tibble… <tibble [… 7.55e17 1.84e17 5.71e17 4 10 10 <S3: km… <tibble… <tibble [… 7.55e17 1.78e17 5.77e17 5 ``` --- ## Compute k-means for several K ```r ggplot(df_kmeans) + aes(x = K, y = betweenss) + geom_line(color = "grey30") + geom_point(color = "tomato") ``` <img src="index_files/figure-html/unnamed-chunk-104-1.png" width="576" style="display: block; margin: auto;" /> --- ## Plots of k-means on the principal plane ```r df_kmeans %>% filter(K %in% 2:4) ``` ``` # A tibble: 3 x 8 K kmeans tidied augmented totss tot.withinss betweenss iter <int> <list> <list> <list> <dbl> <dbl> <dbl> <int> 1 2 <S3: km… <tibble… <tibble [6… 7.55e17 3.79e17 3.76e17 1 2 3 <S3: km… <tibble… <tibble [6… 7.55e17 3.15e17 4.40e17 2 3 4 <S3: km… <tibble… <tibble [6… 7.55e17 2.73e17 4.82e17 2 ``` --- count: false ## Plots of k-means on the principal plane ```r df_kmeans %>% filter(K %in% 2:4) %>% select(K, augmented) %>% unnest(augmented) -> df_clust df_clust ``` ``` # A tibble: 1,959 x 12 K Sample .fittedPC1 .fittedPC2 .fittedPC3 .fittedPC4 Body_site <int> <chr> <dbl> <dbl> <dbl> <dbl> <chr> 1 2 SRS01… 4.34 0.430 0.751 -2.60 stool 2 2 SRS01… 10.3 1.45 4.48 -1.19 stool 3 2 SRS01… -1.51 -2.67 -0.586 0.726 oralcavi… 4 2 SRS01… 0.499 -0.493 -2.51 -0.314 oralcavi… 5 2 SRS01… -1.51 1.42 -1.68 2.12 oralcavi… 6 2 SRS01… 0.196 -2.31 -2.72 -3.91 oralcavi… 7 2 SRS01… -9.51 -4.81 9.64 5.33 oralcavi… 8 2 SRS01… -6.38 8.50 2.13 4.04 oralcavi… 9 2 SRS01… 6.27 0.489 2.17 -0.815 stool 10 2 SRS01… -2.23 -2.54 0.111 1.86 oralcavi… # ... with 1,949 more rows, and 5 more variables: Body_subsite <chr>, # Age <int>, Age_category <chr>, Gender <chr>, .cluster <chr> ``` --- class: nologo ## Plots of k-means on the principal plane ```r ggplot(df_clust) + aes(x = .fittedPC1, y = .fittedPC2, color = .cluster, shape = Body_site) + geom_point(alpha = 0.6) + scale_shape_manual(values = 0:2) + facet_grid(. ~ K) + theme(legend.position = "bottom") ``` <img src="index_files/figure-html/unnamed-chunk-107-1.png" width="792" style="display: block; margin: auto;" /> --- ## Contingency tables of the clusterings * Use the `janitor::tabyl()` function to create a contingency table by K -- ```r df_clust %>% nest(-K) ``` ``` # A tibble: 3 x 2 K data <int> <list> 1 2 <tibble [653 × 11]> 2 3 <tibble [653 × 11]> 3 4 <tibble [653 × 11]> ``` --- count: false ## Contingency tables of the clusterings * Use the `janitor::tabyl()` function to create a contingency table by K ```r df_clust %>% nest(-K) %>% mutate(CT = map(data, janitor::tabyl, Body_site, .cluster)) ``` ``` # A tibble: 3 x 3 K data CT <int> <list> <list> 1 2 <tibble [653 × 11]> <tabyl [3 × 3]> 2 3 <tibble [653 × 11]> <tabyl [3 × 4]> 3 4 <tibble [653 × 11]> <tabyl [3 × 5]> ``` --- count: false ## Contingency tables of the clusterings * Use the `janitor::tabyl()` function to create a contingency table by K ```r df_clust %>% nest(-K) %>% mutate(CT = map(data, janitor::tabyl, .cluster, Body_site)) %>% pull(CT) ``` ``` [[1]] .cluster oralcavity skin stool 1 0 0 114 2 479 27 33 [[2]] .cluster oralcavity skin stool 1 347 27 28 2 0 0 113 3 132 0 6 [[3]] .cluster oralcavity skin stool 1 0 0 63 2 0 0 79 3 344 27 5 4 135 0 0 ``` --- ## Agregate the abundance at phylum level Like this: ``` # A tibble: 17 x 654 Phylum SRS011086 SRS011090 SRS011098 SRS011111 SRS011115 SRS011126 <chr> <int> <int> <int> <int> <int> <int> 1 Acido… 0 0 0 0 0 0 2 Actin… 19795422 261680 40149371 70894509 17912678 36577594 3 Ascom… 0 0 0 0 0 0 4 Bacte… 5173832 161067 4307084 1532049 31986039 19951810 5 Basid… 0 0 0 0 0 0 6 Candi… 0 0 4925 0 59724 3161196 7 Chlor… 0 0 0 0 0 0 8 Deino… 0 0 0 0 0 0 9 Eurya… 0 0 0 0 0 0 10 Firmi… 52604350 2118788 4312759 10592022 47823300 8196029 11 Fusob… 319848 122811 1206832 1387 19118873 7588485 12 Prote… 3646454 1066799 5391435 0 49033824 31115434 13 Spiro… 0 0 214 0 114106 1021118 14 Syner… 0 0 0 0 0 0 15 Tener… 0 0 0 36345 345065 0 16 Verru… 0 0 0 0 0 0 17 Virus… 124898 126218 1041431 0 15143 430516 # ... with 647 more variables: SRS011140 <int>, SRS011144 <int>, # SRS011152 <int>, SRS011243 <int>, SRS011247 <int>, … ``` ``` # A tibble: 17 x 654 Phylum SRS011086 SRS011090 SRS011098 SRS011111 SRS011115 SRS011126 <chr> <int> <int> <int> <int> <int> <int> 1 Acido… 0 0 0 0 0 0 2 Actin… 19795422 261680 40149371 70894509 17912678 36577594 3 Ascom… 0 0 0 0 0 0 4 Bacte… 5173832 161067 4307084 1532049 31986039 19951810 5 Basid… 0 0 0 0 0 0 6 Candi… 0 0 4925 0 59724 3161196 7 Chlor… 0 0 0 0 0 0 8 Deino… 0 0 0 0 0 0 9 Eurya… 0 0 0 0 0 0 10 Firmi… 52604350 2118788 4312759 10592022 47823300 8196029 11 Fusob… 319848 122811 1206832 1387 19118873 7588485 12 Prote… 3646454 1066799 5391435 0 49033824 31115434 13 Spiro… 0 0 214 0 114106 1021118 14 Syner… 0 0 0 0 0 0 15 Tener… 0 0 0 36345 345065 0 16 Verru… 0 0 0 0 0 0 17 Virus… 124898 126218 1041431 0 15143 430516 # ... with 647 more variables: SRS011140 <int>, SRS011144 <int>, # SRS011152 <int>, SRS011243 <int>, SRS011247 <int>, … ``` --- count: false ## Agregate the abundance at phylum level ```r df_abund %>% mutate(Clade = str_remove(Clade, ".*__")) %>% rename(Genus = Clade) ``` ``` # A tibble: 256 x 654 Genus SRS011086 SRS011090 SRS011098 SRS011111 SRS011115 SRS011126 <chr> <int> <int> <int> <int> <int> <int> 1 Bact… 0 0 0 0 0 0 2 Para… 0 0 0 0 0 0 3 Alis… 0 0 0 0 0 0 4 Burk… 0 0 0 0 0 0 5 Faec… 0 0 0 0 0 0 6 Rose… 0 0 0 0 0 0 7 Subd… 0 5366 0 0 1182 0 8 Bact… 0 0 0 0 0 0 9 Para… 0 0 0 0 0 0 10 Osci… 0 0 0 0 0 0 # ... with 246 more rows, and 647 more variables: SRS011140 <int>, # SRS011144 <int>, SRS011152 <int>, SRS011243 <int>, SRS011247 <int>, … ``` --- count: false ## Agregate the abundance at phylum level ```r df_abund %>% mutate(Clade = str_remove(Clade, ".*__")) %>% rename(Genus = Clade)%>% left_join(df_tax, ., by = "Genus") ``` ``` # A tibble: 256 x 659 Kingdom Phylum Class Order Family Genus SRS011086 SRS011090 SRS011098 <chr> <chr> <chr> <chr> <chr> <chr> <int> <int> <int> 1 Archaea Eurya… Meth… Meth… Metha… Meth… 0 0 0 2 Archaea Eurya… Meth… Meth… Metha… Meth… 0 0 0 3 Bacter… Acido… Acid… Acid… Acido… Cand… 0 0 0 4 Bacter… Acido… Acid… Acid… Acido… Gran… 0 0 0 5 Bacter… Acido… Acid… Acid… Acido… Terr… 0 0 0 6 Bacter… Actin… Acti… Acti… Actin… Acti… 4949 0 810698 7 Bacter… Actin… Acti… Acti… Actin… Acti… 9596511 105544 6691711 8 Bacter… Actin… Acti… Acti… Actin… Mobi… 0 0 0 9 Bacter… Actin… Acti… Acti… Actin… Vari… 0 0 0 10 Bacter… Actin… Acti… Acti… Brevi… Brev… 15933 0 1630 # ... with 246 more rows, and 650 more variables: SRS011111 <int>, # SRS011115 <int>, SRS011126 <int>, SRS011140 <int>, SRS011144 <int>, … ``` --- count: false ## Agregate the abundance at phylum level ```r df_abund %>% mutate(Clade = str_remove(Clade, ".*__")) %>% rename(Genus = Clade)%>% left_join(df_tax, ., by = "Genus") %>% select(-c(Kingdom, Class:Genus)) ``` ``` # A tibble: 256 x 654 Phylum SRS011086 SRS011090 SRS011098 SRS011111 SRS011115 SRS011126 <chr> <int> <int> <int> <int> <int> <int> 1 Eurya… 0 0 0 0 0 0 2 Eurya… 0 0 0 0 0 0 3 Acido… 0 0 0 0 0 0 4 Acido… 0 0 0 0 0 0 5 Acido… 0 0 0 0 0 0 6 Actin… 4949 0 810698 0 33598 725860 7 Actin… 9596511 105544 6691711 0 11614891 7299386 8 Actin… 0 0 0 142209 0 0 9 Actin… 0 0 0 4610 0 0 10 Actin… 15933 0 1630 0 10634 1826 # ... with 246 more rows, and 647 more variables: SRS011140 <int>, # SRS011144 <int>, SRS011152 <int>, SRS011243 <int>, SRS011247 <int>, … ``` --- count: false ## Agregate the abundance at phylum level ```r df_abund %>% mutate(Clade = str_remove(Clade, ".*__")) %>% rename(Genus = Clade)%>% left_join(df_tax, ., by = "Genus") %>% select(-c(Kingdom, Class:Genus)) %>% group_by(Phylum) %>% summarise_all(sum) -> df_phylum df_phylum ``` ``` # A tibble: 17 x 654 Phylum SRS011086 SRS011090 SRS011098 SRS011111 SRS011115 SRS011126 <chr> <int> <int> <int> <int> <int> <int> 1 Acido… 0 0 0 0 0 0 2 Actin… 19795422 261680 40149371 70894509 17912678 36577594 3 Ascom… 0 0 0 0 0 0 4 Bacte… 5173832 161067 4307084 1532049 31986039 19951810 5 Basid… 0 0 0 0 0 0 6 Candi… 0 0 4925 0 59724 3161196 7 Chlor… 0 0 0 0 0 0 8 Deino… 0 0 0 0 0 0 9 Eurya… 0 0 0 0 0 0 10 Firmi… 52604350 2118788 4312759 10592022 47823300 8196029 11 Fusob… 319848 122811 1206832 1387 19118873 7588485 12 Prote… 3646454 1066799 5391435 0 49033824 31115434 13 Spiro… 0 0 214 0 114106 1021118 14 Syner… 0 0 0 0 0 0 15 Tener… 0 0 0 36345 345065 0 16 Verru… 0 0 0 0 0 0 17 Virus… 124898 126218 1041431 0 15143 430516 # ... with 647 more variables: SRS011140 <int>, SRS011144 <int>, # SRS011152 <int>, SRS011243 <int>, SRS011247 <int>, … ``` --- ## Transpose the dataset and choose stool samples -- ```r (df_phylum_transp_stool <- df_phylum %>% gather(key = "Sample", value = "Abundance", -Phylum) %>% spread(key = "Phylum", value = "Abundance") %>% left_join(df_samples, by = "Sample") %>% filter(Body_site == "stool")) ``` ``` # A tibble: 147 x 23 Sample Acidobacteria Actinobacteria Ascomycota Bacteroidetes <chr> <dbl> <dbl> <dbl> <dbl> 1 SRS01… 0 37836 0 82049580 2 SRS01… 0 925978 0 199905409 3 SRS01… 0 80513 0 74422158 4 SRS01… 0 1101941 1825 112432267 5 SRS01… 0 277822 0 106439366 6 SRS01… 0 547275 0 58763094 7 SRS01… 0 20241 0 119708033 8 SRS01… 0 47477 0 33479123 9 SRS01… 0 416032 0 111204570 10 SRS01… 0 138253 0 76713977 # ... with 137 more rows, and 18 more variables: Basidiomycota <dbl>, # Candidatus_Saccharibacteria <dbl>, Chloroflexi <dbl>, # Deinococcus_Thermus <dbl>, Euryarchaeota <dbl>, … ``` --- ## Correlation between Firmicutes and Bacteroidetes ```r ggplot(df_phylum_transp_stool) + aes(x = Firmicutes, y = Bacteroidetes) + geom_point(alpha = 0.7) + geom_smooth(method = "lm") ``` <img src="index_files/figure-html/unnamed-chunk-117-1.png" width="576" style="display: block; margin: auto;" /> --- ## Correlation between Firmicutes and Bacteroidetes * Create 500 bootstrapped datasets to find the distribution of the correlation <br> * Use both Pearson and Spearman methods <br> * Plot it --- ## Correlation between Firmicutes and Bacteroidetes ```r df_phylum_transp_stool %>% select(Firmicutes, Bacteroidetes) %>% bootstraps(times = 500) ``` ``` # Bootstrap sampling # A tibble: 500 x 2 splits id <list> <chr> 1 <S3: rsplit> Bootstrap001 2 <S3: rsplit> Bootstrap002 3 <S3: rsplit> Bootstrap003 4 <S3: rsplit> Bootstrap004 5 <S3: rsplit> Bootstrap005 6 <S3: rsplit> Bootstrap006 7 <S3: rsplit> Bootstrap007 8 <S3: rsplit> Bootstrap008 9 <S3: rsplit> Bootstrap009 10 <S3: rsplit> Bootstrap010 # ... with 490 more rows ``` --- count: false ## Correlation between Firmicutes and Bacteroidetes ```r df_phylum_transp_stool %>% select(Firmicutes, Bacteroidetes) %>% bootstraps(times = 500) %>% mutate(splits = map(splits, as_tibble), Firmicutes = map(splits, pull, Firmicutes), Bacteroidetes = map(splits, pull, Bacteroidetes)) ``` ``` # Bootstrap sampling # A tibble: 500 x 4 splits id Firmicutes Bacteroidetes * <list> <chr> <list> <list> 1 <tibble [147 × 2]> Bootstrap001 <dbl [147]> <dbl [147]> 2 <tibble [147 × 2]> Bootstrap002 <dbl [147]> <dbl [147]> 3 <tibble [147 × 2]> Bootstrap003 <dbl [147]> <dbl [147]> 4 <tibble [147 × 2]> Bootstrap004 <dbl [147]> <dbl [147]> 5 <tibble [147 × 2]> Bootstrap005 <dbl [147]> <dbl [147]> 6 <tibble [147 × 2]> Bootstrap006 <dbl [147]> <dbl [147]> 7 <tibble [147 × 2]> Bootstrap007 <dbl [147]> <dbl [147]> 8 <tibble [147 × 2]> Bootstrap008 <dbl [147]> <dbl [147]> 9 <tibble [147 × 2]> Bootstrap009 <dbl [147]> <dbl [147]> 10 <tibble [147 × 2]> Bootstrap010 <dbl [147]> <dbl [147]> # ... with 490 more rows ``` --- count: false class: nologo ## Correlation between Firmicutes and Bacteroidetes ```r df_phylum_transp_stool %>% select(Firmicutes, Bacteroidetes) %>% bootstraps(times = 500) %>% mutate(splits = map(splits, as_tibble), Firmicutes = map(splits, pull, Firmicutes), Bacteroidetes = map(splits, pull, Bacteroidetes), Pearson = map2_dbl(Firmicutes, Bacteroidetes, cor), Spearman = map2_dbl(Firmicutes, Bacteroidetes, cor, method = "spearman")) ``` ``` # Bootstrap sampling # A tibble: 500 x 6 splits id Firmicutes Bacteroidetes Pearson Spearman * <list> <chr> <list> <list> <dbl> <dbl> 1 <tibble [147 × 2… Bootstrap0… <dbl [147… <dbl [147]> -0.489 -0.515 2 <tibble [147 × 2… Bootstrap0… <dbl [147… <dbl [147]> -0.510 -0.422 3 <tibble [147 × 2… Bootstrap0… <dbl [147… <dbl [147]> -0.535 -0.530 4 <tibble [147 × 2… Bootstrap0… <dbl [147… <dbl [147]> -0.714 -0.634 5 <tibble [147 × 2… Bootstrap0… <dbl [147… <dbl [147]> -0.542 -0.460 6 <tibble [147 × 2… Bootstrap0… <dbl [147… <dbl [147]> -0.444 -0.456 7 <tibble [147 × 2… Bootstrap0… <dbl [147… <dbl [147]> -0.633 -0.681 8 <tibble [147 × 2… Bootstrap0… <dbl [147… <dbl [147]> -0.392 -0.379 9 <tibble [147 × 2… Bootstrap0… <dbl [147… <dbl [147]> -0.557 -0.382 10 <tibble [147 × 2… Bootstrap0… <dbl [147… <dbl [147]> -0.543 -0.525 # ... with 490 more rows ``` --- count: false class: nologo ## Correlation between Firmicutes and Bacteroidetes ```r df_phylum_transp_stool %>% select(Firmicutes, Bacteroidetes) %>% bootstraps(times = 500) %>% mutate(splits = map(splits, as_tibble), Firmicutes = map(splits, pull, Firmicutes), Bacteroidetes = map(splits, pull, Bacteroidetes), Pearson = map2_dbl(Firmicutes, Bacteroidetes, cor), Spearman = map2_dbl(Firmicutes, Bacteroidetes, cor, method = "spearman")) %>% select(Pearson, Spearman) %>% {.[, ]} %>% # To use gather() and not gather.rset() gather(key = "Method", value = "Correlation") -> df_cor df_cor ``` ``` # A tibble: 1,000 x 2 Method Correlation <chr> <dbl> 1 Pearson -0.489 2 Pearson -0.510 3 Pearson -0.535 4 Pearson -0.714 5 Pearson -0.542 6 Pearson -0.444 7 Pearson -0.633 8 Pearson -0.392 9 Pearson -0.557 10 Pearson -0.543 # ... with 990 more rows ``` --- ## Correlation between Firmicutes and Bacteroidetes ```r ggplot(df_cor) + aes(x = Correlation, fill = Method) + geom_density(alpha = 0.5) + geom_vline(xintercept = 0, color = "grey") ``` <img src="index_files/figure-html/unnamed-chunk-124-1.png" width="576" style="display: block; margin: auto;" /> --- class: center, middle, inverse # References
--- <br> ### Purrr * <a href="https://stackoverflow.com/questions/45101045/why-use-purrrmap-instead-of-lapply/47123420" target="_blank">
Why use purrr::map instead of lapply? (H. Wickham)</a> ### Tibble * <a href="https://www.rstudio.com/resources/videos/using-list-cols-in-your-dataframe/" target="_blank">
Putting square pegs in round holes: Using list-cols in your dataframe (J. Bryan)</a> ### Tidymodels * <a href="https://static1.squarespace.com/static/51156277e4b0b8b2ffe11c00/t/5b75871e21c67c985a06f481/1534428959053/RPharma_18_Kuhn.pdf" target="_blank">
Modeling in the Tidyverse (M. Kuhn)</a> ### Metagenomic data * <a href="http://waldronlab.io/curatedMetagenomicData/" target="_blank">
Curated Metagenomic Data (L. Waldron)</a> --- class: center, middle, inverse count:false <br> # Thanks for you attention! <br> ####
<a href="https://github.com/abichat" target="_blank">@abichat</a> <div style = "margin-top: -10px"></div> ####
<a href="https://twitter.com/_abichat" target="_blank">@_abichat</a> <div style = "margin-top: -10px"></div> ####
<a href="https://www.linkedin.com/in/antoinebichat" target="_blank">antoinebichat</a> <div style = "margin-top: -10px"></div> ####
<a href="https://abichat.github.io" target="_blank">abichat.github.io</a> <div style = "margin-top: -10px"></div> ####
<a href="mailto:antoine.bichat@mines-nancy.org?subject=SOTR">antoine.bichat@mines-nancy.org</a>