class: center, middle, inverse, title-slide #
Introduction to the Tidyverse ## How to take care of your data ### Antoine Bichat
State of the R - AgroParisTech
May 2, 2018 --- class: center, middle, inverse <br> # {magrittr} <img src="http://hexb.in/vector/pipe.svg", width=270> --- # Make Tom eat an apple * Everyday language > Tom eats an apple <br> Subject - Verb - Complement <br> -- .pull-left[ *** <br> * Programming language > `eat(Tom, apple)` <br> Verb - Subject - Complement ] --- # Pipe `%>%` ```r library(magrittr) ``` * `x %>% f()` is equivalent to `f(x)` * `x %>% f(y)` is equivalent to `f(x, y)` -- ```r 2^mean(log(seq_len(10), base = 2), na.rm = TRUE) ``` ``` ## [1] 4.528729 ``` -- ```r 10 %>% seq_len() %>% log(base = 2) %>% mean(na.rm = TRUE) %>% {2^.} ``` ``` ## [1] 4.528729 ``` -- When you read code, `%>%` is pronounced "then" -- The keybord shortcut for `%>%` is Ctrl/⌘ + ⇧ + M --- # Reassignment pipe `%<>%` * `x %<>% f()` is equivalent to `x <- f(x)` -- ```r library(ape) Tree <- "((A,B),(C,D));" %>% read.tree(text = .) is.rooted(Tree) ``` ``` ## [1] TRUE ``` ```r Tree %<>% unroot() is.rooted(Tree) ``` ``` ## [1] FALSE ``` --- # T-Pipe `%T>%` * `x %T>% f() %>% g()` is equivalent to `f(x); g(x)` when f is a side-effect function -- ```r iris %>% lm(data = ., Sepal.Length ~ Petal.Length) %T>% plot(1) %>% coefficients() ``` <img src="index_files/figure-html/Tpipe-1.png" width="504" style="display: block; margin: auto;" /> ``` ## (Intercept) Petal.Length ## 4.3066034 0.4089223 ``` --- class: center, middle, inverse <br> # {dplyr} <img src="http://hexb.in/vector/dplyr.svg", width=270> --- # {dplyr} ```r library(dplyr) ``` {dplyr} is a package which allows you to solve the vast majority of your data-manipulation challenge: * create variables * pick variables * reorder observations * pick observations * create summaries * ... Functions in this package are verbs and work similarly. --- # mtcars dataset ```r DT::datatable(mtcars, class = "compact cell-border", options = list(pageLength = 8)) ```
--- # `filter()` ```r mtcars %>% filter(cyl == 4) ``` <table> <thead> <tr> <th style="text-align:right;"> mpg </th> <th style="text-align:right;"> cyl </th> <th style="text-align:right;"> disp </th> <th style="text-align:right;"> hp </th> <th style="text-align:right;"> drat </th> <th style="text-align:right;"> wt </th> <th style="text-align:right;"> qsec </th> <th style="text-align:right;"> vs </th> <th style="text-align:right;"> am </th> <th style="text-align:right;"> gear </th> <th style="text-align:right;"> carb </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 22.8 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 108.0 </td> <td style="text-align:right;"> 93 </td> <td style="text-align:right;"> 3.85 </td> <td style="text-align:right;"> 2.320 </td> <td style="text-align:right;"> 18.61 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 24.4 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 146.7 </td> <td style="text-align:right;"> 62 </td> <td style="text-align:right;"> 3.69 </td> <td style="text-align:right;"> 3.190 </td> <td style="text-align:right;"> 20.00 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:right;"> 22.8 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 140.8 </td> <td style="text-align:right;"> 95 </td> <td style="text-align:right;"> 3.92 </td> <td style="text-align:right;"> 3.150 </td> <td style="text-align:right;"> 22.90 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:right;"> 32.4 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 78.7 </td> <td style="text-align:right;"> 66 </td> <td style="text-align:right;"> 4.08 </td> <td style="text-align:right;"> 2.200 </td> <td style="text-align:right;"> 19.47 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 30.4 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 75.7 </td> <td style="text-align:right;"> 52 </td> <td style="text-align:right;"> 4.93 </td> <td style="text-align:right;"> 1.615 </td> <td style="text-align:right;"> 18.52 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:right;"> 33.9 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 71.1 </td> <td style="text-align:right;"> 65 </td> <td style="text-align:right;"> 4.22 </td> <td style="text-align:right;"> 1.835 </td> <td style="text-align:right;"> 19.90 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 21.5 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 120.1 </td> <td style="text-align:right;"> 97 </td> <td style="text-align:right;"> 3.70 </td> <td style="text-align:right;"> 2.465 </td> <td style="text-align:right;"> 20.01 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 27.3 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 79.0 </td> <td style="text-align:right;"> 66 </td> <td style="text-align:right;"> 4.08 </td> <td style="text-align:right;"> 1.935 </td> <td style="text-align:right;"> 18.90 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 26.0 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 120.3 </td> <td style="text-align:right;"> 91 </td> <td style="text-align:right;"> 4.43 </td> <td style="text-align:right;"> 2.140 </td> <td style="text-align:right;"> 16.70 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:right;"> 30.4 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 95.1 </td> <td style="text-align:right;"> 113 </td> <td style="text-align:right;"> 3.77 </td> <td style="text-align:right;"> 1.513 </td> <td style="text-align:right;"> 16.90 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:right;"> 21.4 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 121.0 </td> <td style="text-align:right;"> 109 </td> <td style="text-align:right;"> 4.11 </td> <td style="text-align:right;"> 2.780 </td> <td style="text-align:right;"> 18.60 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 2 </td> </tr> </tbody> </table> --- # `select()` ```r mtcars %>% select(mpg, wt, cyl) ``` <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> mpg </th> <th style="text-align:right;"> wt </th> <th style="text-align:right;"> cyl </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Mazda RX4 </td> <td style="text-align:right;"> 21.0 </td> <td style="text-align:right;"> 2.620 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> Mazda RX4 Wag </td> <td style="text-align:right;"> 21.0 </td> <td style="text-align:right;"> 2.875 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> Datsun 710 </td> <td style="text-align:right;"> 22.8 </td> <td style="text-align:right;"> 2.320 </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:left;"> Hornet 4 Drive </td> <td style="text-align:right;"> 21.4 </td> <td style="text-align:right;"> 3.215 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> Hornet Sportabout </td> <td style="text-align:right;"> 18.7 </td> <td style="text-align:right;"> 3.440 </td> <td style="text-align:right;"> 8 </td> </tr> <tr> <td style="text-align:left;"> Valiant </td> <td style="text-align:right;"> 18.1 </td> <td style="text-align:right;"> 3.460 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> Duster 360 </td> <td style="text-align:right;"> 14.3 </td> <td style="text-align:right;"> 3.570 </td> <td style="text-align:right;"> 8 </td> </tr> <tr> <td style="text-align:left;"> Merc 240D </td> <td style="text-align:right;"> 24.4 </td> <td style="text-align:right;"> 3.190 </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:left;"> Merc 230 </td> <td style="text-align:right;"> 22.8 </td> <td style="text-align:right;"> 3.150 </td> <td style="text-align:right;"> 4 </td> </tr> </tbody> </table> -- See also `select_if()` --- # `mutate()` ```r mtcars %>% mutate(cyl2 = 2 * cyl, cyl4 = 2 * cyl2, disp = disp * 0.0163871, drat = NULL) ``` <table> <thead> <tr> <th style="text-align:right;"> mpg </th> <th style="text-align:right;"> cyl </th> <th style="text-align:right;"> disp </th> <th style="text-align:right;"> hp </th> <th style="text-align:right;"> wt </th> <th style="text-align:right;"> qsec </th> <th style="text-align:right;"> vs </th> <th style="text-align:right;"> am </th> <th style="text-align:right;"> gear </th> <th style="text-align:right;"> carb </th> <th style="text-align:right;"> cyl2 </th> <th style="text-align:right;"> cyl4 </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 21.0 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 2.621936 </td> <td style="text-align:right;"> 110 </td> <td style="text-align:right;"> 2.620 </td> <td style="text-align:right;"> 16.46 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 24 </td> </tr> <tr> <td style="text-align:right;"> 21.0 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 2.621936 </td> <td style="text-align:right;"> 110 </td> <td style="text-align:right;"> 2.875 </td> <td style="text-align:right;"> 17.02 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 24 </td> </tr> <tr> <td style="text-align:right;"> 22.8 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 1.769807 </td> <td style="text-align:right;"> 93 </td> <td style="text-align:right;"> 2.320 </td> <td style="text-align:right;"> 18.61 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 16 </td> </tr> <tr> <td style="text-align:right;"> 21.4 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 4.227872 </td> <td style="text-align:right;"> 110 </td> <td style="text-align:right;"> 3.215 </td> <td style="text-align:right;"> 19.44 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 24 </td> </tr> <tr> <td style="text-align:right;"> 18.7 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 5.899356 </td> <td style="text-align:right;"> 175 </td> <td style="text-align:right;"> 3.440 </td> <td style="text-align:right;"> 17.02 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 32 </td> </tr> <tr> <td style="text-align:right;"> 18.1 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 3.687098 </td> <td style="text-align:right;"> 105 </td> <td style="text-align:right;"> 3.460 </td> <td style="text-align:right;"> 20.22 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 24 </td> </tr> <tr> <td style="text-align:right;"> 14.3 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 5.899356 </td> <td style="text-align:right;"> 245 </td> <td style="text-align:right;"> 3.570 </td> <td style="text-align:right;"> 15.84 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 32 </td> </tr> </tbody> </table> -- See also `mutate_if()` --- # `arrange()` ```r mtcars %>% arrange(desc(carb), mpg) ``` <table> <thead> <tr> <th style="text-align:right;"> mpg </th> <th style="text-align:right;"> cyl </th> <th style="text-align:right;"> disp </th> <th style="text-align:right;"> hp </th> <th style="text-align:right;"> drat </th> <th style="text-align:right;"> wt </th> <th style="text-align:right;"> qsec </th> <th style="text-align:right;"> vs </th> <th style="text-align:right;"> am </th> <th style="text-align:right;"> gear </th> <th style="text-align:right;"> carb </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 15.0 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 301.0 </td> <td style="text-align:right;"> 335 </td> <td style="text-align:right;"> 3.54 </td> <td style="text-align:right;"> 3.570 </td> <td style="text-align:right;"> 14.60 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 8 </td> </tr> <tr> <td style="text-align:right;"> 19.7 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 145.0 </td> <td style="text-align:right;"> 175 </td> <td style="text-align:right;"> 3.62 </td> <td style="text-align:right;"> 2.770 </td> <td style="text-align:right;"> 15.50 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:right;"> 10.4 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 472.0 </td> <td style="text-align:right;"> 205 </td> <td style="text-align:right;"> 2.93 </td> <td style="text-align:right;"> 5.250 </td> <td style="text-align:right;"> 17.98 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:right;"> 10.4 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 460.0 </td> <td style="text-align:right;"> 215 </td> <td style="text-align:right;"> 3.00 </td> <td style="text-align:right;"> 5.424 </td> <td style="text-align:right;"> 17.82 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:right;"> 13.3 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 350.0 </td> <td style="text-align:right;"> 245 </td> <td style="text-align:right;"> 3.73 </td> <td style="text-align:right;"> 3.840 </td> <td style="text-align:right;"> 15.41 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:right;"> 14.3 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 360.0 </td> <td style="text-align:right;"> 245 </td> <td style="text-align:right;"> 3.21 </td> <td style="text-align:right;"> 3.570 </td> <td style="text-align:right;"> 15.84 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:right;"> 14.7 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 440.0 </td> <td style="text-align:right;"> 230 </td> <td style="text-align:right;"> 3.23 </td> <td style="text-align:right;"> 5.345 </td> <td style="text-align:right;"> 17.42 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:right;"> 15.8 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 351.0 </td> <td style="text-align:right;"> 264 </td> <td style="text-align:right;"> 4.22 </td> <td style="text-align:right;"> 3.170 </td> <td style="text-align:right;"> 14.50 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:right;"> 17.8 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 167.6 </td> <td style="text-align:right;"> 123 </td> <td style="text-align:right;"> 3.92 </td> <td style="text-align:right;"> 3.440 </td> <td style="text-align:right;"> 18.90 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:right;"> 19.2 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 167.6 </td> <td style="text-align:right;"> 123 </td> <td style="text-align:right;"> 3.92 </td> <td style="text-align:right;"> 3.440 </td> <td style="text-align:right;"> 18.30 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:right;"> 21.0 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 160.0 </td> <td style="text-align:right;"> 110 </td> <td style="text-align:right;"> 3.90 </td> <td style="text-align:right;"> 2.620 </td> <td style="text-align:right;"> 16.46 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 4 </td> </tr> </tbody> </table> --- # `pull()` ```r mtcars %>% pull(qsec) ``` ``` ## [1] 16.46 17.02 18.61 19.44 17.02 20.22 15.84 20.00 22.90 18.30 18.90 ## [12] 17.40 17.60 18.00 17.98 17.82 17.42 19.47 18.52 19.90 20.01 16.87 ## [23] 17.30 15.41 17.05 18.90 16.70 16.90 14.50 15.50 14.60 18.60 ``` --- # `summarise()` ```r mtcars %>% summarise(Mean_mpg = mean(mpg), Var_disp = var(disp)) ``` <table> <thead> <tr> <th style="text-align:right;"> Mean_mpg </th> <th style="text-align:right;"> Var_disp </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 20.09062 </td> <td style="text-align:right;"> 15360.8 </td> </tr> </tbody> </table> --- # `group_by()` ```r mtcars %>% group_by(cyl, carb) %>% summarise(Count = n(), Mean_mpg = mean(mpg), Var_disp = var(disp)) ``` <table> <thead> <tr> <th style="text-align:right;"> cyl </th> <th style="text-align:right;"> carb </th> <th style="text-align:right;"> Count </th> <th style="text-align:right;"> Mean_mpg </th> <th style="text-align:right;"> Var_disp </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 27.58 </td> <td style="text-align:right;"> 456.59700 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 25.90 </td> <td style="text-align:right;"> 731.95200 </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 19.75 </td> <td style="text-align:right;"> 544.50000 </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 19.75 </td> <td style="text-align:right;"> 19.25333 </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 19.70 </td> <td style="text-align:right;"> NA </td> </tr> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 17.15 </td> <td style="text-align:right;"> 1886.33333 </td> </tr> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 16.30 </td> <td style="text-align:right;"> 0.00000 </td> </tr> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 13.15 </td> <td style="text-align:right;"> 3340.70000 </td> </tr> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 15.00 </td> <td style="text-align:right;"> NA </td> </tr> </tbody> </table> --- # Programming with dplyr? ```r iris %>% group_by(Species) %>% summarise(m = median(Sepal.Length)) ``` ``` ## # A tibble: 3 x 2 ## Species m ## <fct> <dbl> ## 1 setosa 5.00 ## 2 versicolor 5.90 ## 3 virginica 6.50 ``` -- ```r median_by_group <- function(group, var){ iris %>% group_by(group) %>% summarise(m = median(var)) } median_by_group(Species, Sepal.Length) ``` ``` ## Error in grouped_df_impl(data, unname(vars), drop): Column `group` is unknown ``` --- # `enquo()` and `!!` Keywords: Tidyeval & Non Standard Evaluation ```r median_by_group <- function(group, var){ group <- enquo(group) var <- enquo(var) iris %>% group_by(!!group) %>% summarise(m = median(!!var)) } median_by_group(Species, Sepal.Length) ``` ``` ## # A tibble: 3 x 2 ## Species m ## <fct> <dbl> ## 1 setosa 5.00 ## 2 versicolor 5.90 ## 3 virginica 6.50 ``` --- # And with strings? ```r median_by_group <- function(group, var){ group <- as.name(group) var <- as.name(var) iris %>% group_by(!!group) %>% summarise(m = median(!!var)) } median_by_group("Species", "Sepal.Length") ``` ``` ## # A tibble: 3 x 2 ## Species m ## <fct> <dbl> ## 1 setosa 5.00 ## 2 versicolor 5.90 ## 3 virginica 6.50 ``` --- class: center, middle, inverse <br> #{tidyr} <img src="http://hexb.in/vector/tidyr.svg", width=270> --- # {tidyr} ```r library(tidyr) ``` {tidyr} is a package which helps you to transform messy datasets into tidy datasets. -- <br> <br> There are three interrelated rules which make your dataset tidy: .full-width[ .content-box-sotr[ 1. Each variable must have its own column 2. Each observation must have its own row 3. Each value must have its own cell ] ] --- # grades dataset ```r grades <- tibble( Name = c("Tommy", "Mary", "Gary", "Cathy"), Sexage = c("m.15", "f.15", "m.16", "f.14"), Test1 = c(10, 15, 16, 14), Test2 = c(11, 13, 10, 12), Test3 = c(12, 13, 17, 10) ) ``` <table> <thead> <tr> <th style="text-align:left;"> Name </th> <th style="text-align:left;"> Sexage </th> <th style="text-align:right;"> Test1 </th> <th style="text-align:right;"> Test2 </th> <th style="text-align:right;"> Test3 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Tommy </td> <td style="text-align:left;"> m.15 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 11 </td> <td style="text-align:right;"> 12 </td> </tr> <tr> <td style="text-align:left;"> Mary </td> <td style="text-align:left;"> f.15 </td> <td style="text-align:right;"> 15 </td> <td style="text-align:right;"> 13 </td> <td style="text-align:right;"> 13 </td> </tr> <tr> <td style="text-align:left;"> Gary </td> <td style="text-align:left;"> m.16 </td> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 17 </td> </tr> <tr> <td style="text-align:left;"> Cathy </td> <td style="text-align:left;"> f.14 </td> <td style="text-align:right;"> 14 </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 10 </td> </tr> </tbody> </table> --- # `separate()` ```r grades <- grades %>% separate(Sexage, into = c("Sex", "Age")) ``` <table> <thead> <tr> <th style="text-align:left;"> Name </th> <th style="text-align:left;"> Sex </th> <th style="text-align:left;"> Age </th> <th style="text-align:right;"> Test1 </th> <th style="text-align:right;"> Test2 </th> <th style="text-align:right;"> Test3 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Tommy </td> <td style="text-align:left;"> m </td> <td style="text-align:left;"> 15 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 11 </td> <td style="text-align:right;"> 12 </td> </tr> <tr> <td style="text-align:left;"> Mary </td> <td style="text-align:left;"> f </td> <td style="text-align:left;"> 15 </td> <td style="text-align:right;"> 15 </td> <td style="text-align:right;"> 13 </td> <td style="text-align:right;"> 13 </td> </tr> <tr> <td style="text-align:left;"> Gary </td> <td style="text-align:left;"> m </td> <td style="text-align:left;"> 16 </td> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 17 </td> </tr> <tr> <td style="text-align:left;"> Cathy </td> <td style="text-align:left;"> f </td> <td style="text-align:left;"> 14 </td> <td style="text-align:right;"> 14 </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 10 </td> </tr> </tbody> </table> -- The inverse of `separate()` is `unite()` --- # `gather()` ```r grades <- grades %>% gather(Test1, Test2, Test3, key = Test, value = Grade) ``` <table> <thead> <tr> <th style="text-align:left;"> Name </th> <th style="text-align:left;"> Sex </th> <th style="text-align:left;"> Age </th> <th style="text-align:left;"> Test </th> <th style="text-align:right;"> Grade </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Tommy </td> <td style="text-align:left;"> m </td> <td style="text-align:left;"> 15 </td> <td style="text-align:left;"> Test1 </td> <td style="text-align:right;"> 10 </td> </tr> <tr> <td style="text-align:left;"> Mary </td> <td style="text-align:left;"> f </td> <td style="text-align:left;"> 15 </td> <td style="text-align:left;"> Test1 </td> <td style="text-align:right;"> 15 </td> </tr> <tr> <td style="text-align:left;"> Gary </td> <td style="text-align:left;"> m </td> <td style="text-align:left;"> 16 </td> <td style="text-align:left;"> Test1 </td> <td style="text-align:right;"> 16 </td> </tr> <tr> <td style="text-align:left;"> Cathy </td> <td style="text-align:left;"> f </td> <td style="text-align:left;"> 14 </td> <td style="text-align:left;"> Test1 </td> <td style="text-align:right;"> 14 </td> </tr> <tr> <td style="text-align:left;"> Tommy </td> <td style="text-align:left;"> m </td> <td style="text-align:left;"> 15 </td> <td style="text-align:left;"> Test2 </td> <td style="text-align:right;"> 11 </td> </tr> <tr> <td style="text-align:left;"> Mary </td> <td style="text-align:left;"> f </td> <td style="text-align:left;"> 15 </td> <td style="text-align:left;"> Test2 </td> <td style="text-align:right;"> 13 </td> </tr> <tr> <td style="text-align:left;"> Gary </td> <td style="text-align:left;"> m </td> <td style="text-align:left;"> 16 </td> <td style="text-align:left;"> Test2 </td> <td style="text-align:right;"> 10 </td> </tr> <tr> <td style="text-align:left;"> Cathy </td> <td style="text-align:left;"> f </td> <td style="text-align:left;"> 14 </td> <td style="text-align:left;"> Test2 </td> <td style="text-align:right;"> 12 </td> </tr> </tbody> </table> -- The inverse of `gather()` is `spread()` --- # More efficient code: summarising ```r grades %>% group_by(Name) %>% summarise(Mean = mean(Grade)) ``` ``` ## # A tibble: 4 x 2 ## Name Mean ## <chr> <dbl> ## 1 Cathy 12.0 ## 2 Gary 14.3 ## 3 Mary 13.7 ## 4 Tommy 11.0 ``` ```r grades %>% group_by(Test) %>% summarise(Mean = mean(Grade)) ``` ``` ## # A tibble: 3 x 2 ## Test Mean ## <chr> <dbl> ## 1 Test1 13.8 ## 2 Test2 11.5 ## 3 Test3 13.0 ``` --- # More efficient code: plotting ```r library(ggplot2) grades %>% ggplot(aes(Test, Grade, color = Name)) + geom_point() + geom_line(aes(group = Name)) + theme_bw() ``` <img src="index_files/figure-html/plottidy-1.png" width="504" style="display: block; margin: auto;" /> --- class: center, middle, inverse <br> # {tibble} <img src="http://hexb.in/vector/tibble.svg", width=270> --- # Classical data.frame ```r swiss ``` ``` ## Fertility Agriculture Examination Education Catholic ## Courtelary 80.2 17.0 15 12 9.96 ## Delemont 83.1 45.1 6 9 84.84 ## Franches-Mnt 92.5 39.7 5 5 93.40 ## Moutier 85.8 36.5 12 7 33.77 ## Neuveville 76.9 43.5 17 15 5.16 ## Porrentruy 76.1 35.3 9 7 90.57 ## Broye 83.8 70.2 16 7 92.85 ## Glane 92.4 67.8 14 8 97.16 ## Gruyere 82.4 53.3 12 7 97.67 ## Sarine 82.9 45.2 16 13 91.38 ## Veveyse 87.1 64.5 14 6 98.61 ## Aigle 64.1 62.0 21 12 8.52 ## Aubonne 66.9 67.5 14 7 2.27 ## Avenches 68.9 60.7 19 12 4.43 ## Cossonay 61.7 69.3 22 5 2.82 ## Echallens 68.3 72.6 18 2 24.20 ## Grandson 71.7 34.0 17 8 3.30 ## Lausanne 55.7 19.4 26 28 12.11 ## La Vallee 54.3 15.2 31 20 2.15 ## Lavaux 65.1 73.0 19 9 2.84 ## Morges 65.5 59.8 22 10 5.23 ## Moudon 65.0 55.1 14 3 4.52 ## Nyone 56.6 50.9 22 12 15.14 ## Orbe 57.4 54.1 20 6 4.20 ## Oron 72.5 71.2 12 1 2.40 ## Payerne 74.2 58.1 14 8 5.23 ## Paysd'enhaut 72.0 63.5 6 3 2.56 ## Rolle 60.5 60.8 16 10 7.72 ## Vevey 58.3 26.8 25 19 18.46 ## Yverdon 65.4 49.5 15 8 6.10 ## Conthey 75.5 85.9 3 2 99.71 ## Entremont 69.3 84.9 7 6 99.68 ## Herens 77.3 89.7 5 2 100.00 ## Martigwy 70.5 78.2 12 6 98.96 ## Monthey 79.4 64.9 7 3 98.22 ## St Maurice 65.0 75.9 9 9 99.06 ## Sierre 92.2 84.6 3 3 99.46 ## Sion 79.3 63.1 13 13 96.83 ## Boudry 70.4 38.4 26 12 5.62 ## La Chauxdfnd 65.7 7.7 29 11 13.79 ## Le Locle 72.7 16.7 22 13 11.22 ## Neuchatel 64.4 17.6 35 32 16.92 ## Val de Ruz 77.6 37.6 15 7 4.97 ## ValdeTravers 67.6 18.7 25 7 8.65 ## V. De Geneve 35.0 1.2 37 53 42.34 ## Rive Droite 44.7 46.6 16 29 50.43 ## Rive Gauche 42.8 27.7 22 29 58.33 ## Infant.Mortality ## Courtelary 22.2 ## Delemont 22.2 ## Franches-Mnt 20.2 ## Moutier 20.3 ## Neuveville 20.6 ## Porrentruy 26.6 ## Broye 23.6 ## Glane 24.9 ## Gruyere 21.0 ## Sarine 24.4 ## Veveyse 24.5 ## Aigle 16.5 ## Aubonne 19.1 ## Avenches 22.7 ## Cossonay 18.7 ## Echallens 21.2 ## Grandson 20.0 ## Lausanne 20.2 ## La Vallee 10.8 ## Lavaux 20.0 ## Morges 18.0 ## Moudon 22.4 ## Nyone 16.7 ## Orbe 15.3 ## Oron 21.0 ## Payerne 23.8 ## Paysd'enhaut 18.0 ## Rolle 16.3 ## Vevey 20.9 ## Yverdon 22.5 ## Conthey 15.1 ## Entremont 19.8 ## Herens 18.3 ## Martigwy 19.4 ## Monthey 20.2 ## St Maurice 17.8 ## Sierre 16.3 ## Sion 18.1 ## Boudry 20.3 ## La Chauxdfnd 20.5 ## Le Locle 18.9 ## Neuchatel 23.0 ## Val de Ruz 20.0 ## ValdeTravers 19.5 ## V. De Geneve 18.0 ## Rive Droite 18.2 ## Rive Gauche 19.3 ``` --- # Tibble ```r library(tibble) swiss %>% as_tibble(rownames = "Province") ``` ``` ## # A tibble: 47 x 7 ## Province Fertility Agriculture Examination Education Catholic ## <chr> <dbl> <dbl> <int> <int> <dbl> ## 1 Courtelary 80.2 17.0 15 12 9.96 ## 2 Delemont 83.1 45.1 6 9 84.8 ## 3 Franches-Mnt 92.5 39.7 5 5 93.4 ## 4 Moutier 85.8 36.5 12 7 33.8 ## 5 Neuveville 76.9 43.5 17 15 5.16 ## 6 Porrentruy 76.1 35.3 9 7 90.6 ## 7 Broye 83.8 70.2 16 7 92.8 ## 8 Glane 92.4 67.8 14 8 97.2 ## 9 Gruyere 82.4 53.3 12 7 97.7 ## 10 Sarine 82.9 45.2 16 13 91.4 ## # ... with 37 more rows, and 1 more variable: Infant.Mortality <dbl> ``` -- .full-width[.content-box-sotr[Row names are variables so they must have their own column. `rownames_to_column()` can help you.]] --- # Column names ```r tibble( x = 1:5, `2x` = 2 * (1:5), `Some letters` = letters[1:5], `;-)` = c(TRUE, FALSE, FALSE, TRUE, TRUE) ) ``` ``` ## # A tibble: 5 x 4 ## x `2x` `Some letters` `;-)` ## <int> <dbl> <chr> <lgl> ## 1 1 2. a TRUE ## 2 2 4. b FALSE ## 3 3 6. c FALSE ## 4 4 8. d TRUE ## 5 5 10. e TRUE ``` --- # Consistency in subsetting ```r df <- data.frame(x = 1:9, y = LETTERS[1:9]) tbl <- tibble(x = 1:9, y = LETTERS[1:9]) ``` -- ```r class(df[, 1:2]) ``` ``` ## [1] "data.frame" ``` ```r class(tbl[, 1:2]) ``` ``` ## [1] "tbl_df" "tbl" "data.frame" ``` -- ```r class(df[, 1]) ``` ``` ## [1] "integer" ``` ```r class(tbl[, 1]) ``` ``` ## [1] "tbl_df" "tbl" "data.frame" ``` --- # List-column ```r starwars %>% select(name, height, mass, hair_color, films, vehicles) ``` ``` ## # A tibble: 87 x 6 ## name height mass hair_color films vehicles ## <chr> <int> <dbl> <chr> <list> <list> ## 1 Luke Skywalker 172 77. blond <chr [5]> <chr [2]> ## 2 C-3PO 167 75. <NA> <chr [6]> <chr [0]> ## 3 R2-D2 96 32. <NA> <chr [7]> <chr [0]> ## 4 Darth Vader 202 136. none <chr [4]> <chr [0]> ## 5 Leia Organa 150 49. brown <chr [5]> <chr [1]> ## 6 Owen Lars 178 120. brown, grey <chr [3]> <chr [0]> ## 7 Beru Whitesun lars 165 75. brown <chr [3]> <chr [0]> ## 8 R5-D4 97 32. <NA> <chr [1]> <chr [0]> ## 9 Biggs Darklighter 183 84. black <chr [1]> <chr [0]> ## 10 Obi-Wan Kenobi 182 77. auburn, white <chr [6]> <chr [1]> ## # ... with 77 more rows ``` --- # List-column: put a vector in each case ```r starwars %>% pull(films) %>% head(4) ``` ``` ## [[1]] ## [1] "Revenge of the Sith" "Return of the Jedi" ## [3] "The Empire Strikes Back" "A New Hope" ## [5] "The Force Awakens" ## ## [[2]] ## [1] "Attack of the Clones" "The Phantom Menace" ## [3] "Revenge of the Sith" "Return of the Jedi" ## [5] "The Empire Strikes Back" "A New Hope" ## ## [[3]] ## [1] "Attack of the Clones" "The Phantom Menace" ## [3] "Revenge of the Sith" "Return of the Jedi" ## [5] "The Empire Strikes Back" "A New Hope" ## [7] "The Force Awakens" ## ## [[4]] ## [1] "Revenge of the Sith" "Return of the Jedi" ## [3] "The Empire Strikes Back" "A New Hope" ``` --- # List-column: put what you want in each case ```r iris %>% group_by(Species) %>% nest(.key = Data) %>% mutate(Model = purrr::map(Data, ~ lm(data = ., Sepal.Length ~ Petal.Length))) %>% mutate(Summary = purrr::map(Model, summary)) %>% mutate(`R squared` = purrr::map_dbl(Summary, ~ .$r.squared)) ``` ``` ## # A tibble: 3 x 5 ## Species Data Model Summary `R squared` ## <fct> <list> <list> <list> <dbl> ## 1 setosa <tibble [50 × 4]> <S3: lm> <S3: summary.lm> 0.0714 ## 2 versicolor <tibble [50 × 4]> <S3: lm> <S3: summary.lm> 0.569 ## 3 virginica <tibble [50 × 4]> <S3: lm> <S3: summary.lm> 0.747 ``` --- class: center, middle, inverse <br> # {tidyverse} <img src="https://www.tidyverse.org/images/hex-tidyverse.png", width=270> --- ```r library(tidyverse) ``` ``` ## ── Attaching packages ────────────────────────────────────────────────── tidyverse 1.2.1 ── ``` ``` ## ✔ ggplot2 2.2.1 ✔ purrr 0.2.4 ## ✔ tibble 1.4.2 ✔ dplyr 0.7.4 ## ✔ tidyr 0.8.0 ✔ stringr 1.3.0 ## ✔ readr 1.1.1 ✔ forcats 0.3.0 ``` ``` ## ── Conflicts ───────────────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ tidyr::extract() masks magrittr::extract() ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() ## ✖ purrr::set_names() masks magrittr::set_names() ``` -- * {ggplot2}: build graph * {readr}: read rectangular data * {purrr}: functionnal programming * {stringr}: manipulate strings * {forcats}: manipulate factors -- Imports `%>%` from {magrittr} --- class: center, middle, inverse # Syracuse conjecture ## Application example --- # Definition `\begin{equation} \left\{ \begin{aligned} u_0 & = N \in \mathbb{N}, \\ u_{n+1} & = \left\{ \begin{aligned} \frac{u_n}{2} & \quad \text{ if } u_n\in 2\mathbb{N}, \\ 3u_n+1 & \quad \text{ else.} \end{aligned} \right. \end{aligned} \right. \end{equation}` -- ```r syracuse <- function(x) { c(x, if(x > 1) Recall(if (x %% 2) x * 3 + 1 else x / 2)) } ``` -- ```r syracuse(6) ``` ``` ## [1] 6 3 10 5 16 8 4 2 1 ``` ```r map(7:8, syracuse) ``` ``` ## [[1]] ## [1] 7 22 11 34 17 52 26 13 40 20 10 5 16 8 4 2 1 ## ## [[2]] ## [1] 8 4 2 1 ``` --- # Creation of data ```r data_syracuse <- 100 %>% seq_len() %>% tibble(N = .) %>% mutate(Sequence = map(N, syracuse)) ``` --
--- ```r data_syracuse <- data_syracuse %>% mutate(Length = map_int(Sequence, length), Max = map_dbl(Sequence, max)) ``` --
--- # Shortest length? ```r data_syracuse %>% select(-Max) %>% filter(Length != log2(N) + 1) %>% arrange(Length, desc(N)) ``` --
--- # Plots ```r theme_set(theme_bw()) plot_syrac <- function(data, var){ var <- enquo(var) ggplot(data) + aes(x = N, y = !!var) + geom_line() } ``` -- Unfortunately, this syntax is not supported by {ggplot2} yet. It is still in developpement. -- But one can use the old syntax, which will be depreciated: ```r plot_syrac <- function(data, var){ ggplot(data) + aes_(x = ~N, y = substitute(var)) + geom_line() } ``` --- ```r p1 <- plot_syrac(data_syracuse, Length) p2 <- plot_syrac(data_syracuse, log10(Max)) + ylab("log(Max)") cowplot::plot_grid(p1, p2) ``` <img src="index_files/figure-html/plotssyrac-1.png" width="792" style="display: block; margin: auto;" /> --- class: middle, center, inverse <br> # References
--- ### General * <a href="http://r4ds.had.co.nz" target="_blank"><i>R for Data Science</i></a>, Garrett Grolemund & Hadley Wickham * <a href="https://www.tidyverse.org" target="_blank"><i>www.tidyverse.org</i></a> ### Non-standard evaluation * <a href="https://thinkr.fr/tidyeval/" target="_blank"><i>Tidyeval</i></a>, ThinkR (in French) ### For all your questions * <a href="https://stackoverflow.com/" target="_blank"><i>Stack Overflow</i></a> --- class: center, middle, inverse <br> # Thanks for your attention! ####
<a href="https://github.com/abichat" target="_blank">@abichat</a> <div style = "margin-top: -20px"></div> ####
<a href="https://twitter.com/_abichat" target="_blank">@_abichat</a> <div style = "margin-top: -20px"></div> ####
<a href="https://www.linkedin.com/in/antoinebichat" target="_blank">antoinebichat</a> <div style = "margin-top: -20px"></div> ####
<a href="https://abichat.github.io" target="_blank">abichat.github.io</a> <div style = "margin-top: -20px"></div> ####
<a href="mailto:antoine.bichat@mines-nancy.org?subject=Science%20Communication%20with%20R">antoine.bichat@mines-nancy.org</a> .footnote[Slides created via the R package <b><a href="https://github.com/yihui/xaringan" target="_blank">xaringan</a></b>.]