class: center, middle, inverse, title-slide # Manipulating objects in the Tidyverse ## Factors, dates and strings ### Antoine Bichat ### AgroParisTech ### December 13, 2019 --- # Configuration
```r library(tidyverse) library(lubridate) library(tidytuesdayR) set.seed(42) theme_set(theme_minimal()) Sys.setlocale("LC_TIME", "C") ``` ``` # A tibble: 9 x 2 Package Version <chr> <chr> 1 dplyr 0.8.3 2 forcats 0.4.0 3 ggexpanse 0.1.0 4 gghalves 0.0.1 5 ggplot2 3.2.1 6 lubridate 1.7.4 7 stringr 1.4.0 8 tidyr 1.0.0 9 tidytuesdayR 0.2.2 ``` --- # Disclaimer
Almost every functions presented in these slides could be replaced by (ugly?) portions of code. As always, there is a trade-off between simplicity and readability (and consistency for R) on one side, and speed and dependencies to other packages on the other side. <br> - lubridate: <img src="https://tinyverse.netlify.com/badge/lubridate"> - stringr: <img src="https://tinyverse.netlify.com/badge/stringr"> - forcats: <img src="https://tinyverse.netlify.com/badge/forcats"> - ggplot2: <img src="https://tinyverse.netlify.com/badge/ggplot2"> - dplyr: <img src="https://tinyverse.netlify.com/badge/dplyr"> - tidyr: <img src="https://tinyverse.netlify.com/badge/tidyr"> - tidyverse: <img src="https://tinyverse.netlify.com/badge/tidyverse"> --- class: inverse, center, middle background-image: url(img/hex_tidytuesday.png) background-size: 15% background-position: right 20px bottom 20px .slide-in-right[ # TidyTuesday ] --- # A weekly social data project in R * Every Monday, a dataset is proposed on
[rfordatascience/tidytuesday](https://github.com/rfordatascience/tidytuesday). * Every Tuesday (or after), everyone could post their visualizations on Twitter
with the hashtag `#TidyTuesday`. * There is a package to download proposed datasets: **tidytuesdayR**. * And a shiny app to see previous contributions: [tidytuesdayrocks](https://nsgrantham.shinyapps.io/tidytuesdayrocks/). * It's a great way to learn and discover new possibilities. <br> ### .center[.cursive[Practice makes perfect.]] --- class: noslidenumber # Some submissions .scroll-output[ .pull-left[ <blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/hashtag/TidyTuesday?src=hash&ref_src=twsrc%5Etfw">#TidyTuesday</a> Week 2019-42 - Updates: More car racing, more fun!<br><br>Here is an animation showing energy efficiency on highways with a starting sequence and start line plus the suggested change of title and axis, thx <a href="https://twitter.com/JonTheGeek?ref_src=twsrc%5Etfw">@JonTheGeek</a>!<a href="https://twitter.com/R4DScommunity?ref_src=twsrc%5Etfw">@R4DScommunity</a> <a href="https://twitter.com/hashtag/ggplot2?src=hash&ref_src=twsrc%5Etfw">#ggplot2</a> <a href="https://twitter.com/hashtag/rstats?src=hash&ref_src=twsrc%5Etfw">#rstats</a> <a href="https://twitter.com/hashtag/dataviz?src=hash&ref_src=twsrc%5Etfw">#dataviz</a> <a href="https://t.co/Hooej9MXnF">pic.twitter.com/Hooej9MXnF</a></p>— Cédric Scherer (@CedScherer) <a href="https://twitter.com/CedScherer/status/1186335139925757952?ref_src=twsrc%5Etfw">October 21, 2019</a></blockquote> ] .pull-right[ <blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/hashtag/TidyTuesday?src=hash&ref_src=twsrc%5Etfw">#TidyTuesday</a> contribution for this week required patience and perseverance but I did it 💪🏻🐦<a href="https://twitter.com/hashtag/rstats?src=hash&ref_src=twsrc%5Etfw">#rstats</a> <a href="https://twitter.com/hashtag/tidyverse?src=hash&ref_src=twsrc%5Etfw">#tidyverse</a> <a href="https://twitter.com/hashtag/dataviz?src=hash&ref_src=twsrc%5Etfw">#dataviz</a> <a href="https://twitter.com/hashtag/birds?src=hash&ref_src=twsrc%5Etfw">#birds</a> <a href="https://t.co/8P7Zjgvw0e">pic.twitter.com/8P7Zjgvw0e</a></p>— Antoine (@_abichat) <a href="https://twitter.com/_abichat/status/1123214724928241665?ref_src=twsrc%5Etfw">April 30, 2019</a></blockquote> ] ] --- # Nuclear explosions
```r df_nuclear <- tt_load("2019-08-20")$nuclear_explosions df_nuclear ``` ``` # A tibble: 2,051 x 16 date_long year id_no country region source latitude longitude magnitude_body <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> 1 19450716 1945 45001 USA ALAMO… DOE 32.5 -106. 0 2 19450805 1945 45002 USA HIROS… DOE 34.2 132. 0 3 19450809 1945 45003 USA NAGAS… DOE 32.4 130. 0 4 19460630 1946 46001 USA BIKINI DOE 11.4 165. 0 5 19460724 1946 46002 USA BIKINI DOE 11.4 165. 0 6 19480414 1948 48001 USA ENEWE… DOE 11.3 162. 0 7 19480430 1948 48002 USA ENEWE… DOE 11.3 162. 0 8 19480514 1948 48003 USA ENEWE… DOE 11.3 162. 0 9 19490829 1949 49001 USSR SEMI … DOE 48 76 0 10 19510127 1951 51001 USA NTS DOE 37 -116 0 # … with 2,041 more rows, and 7 more variables: magnitude_surface <dbl>, # depth <dbl>, yield_lower <dbl>, yield_upper <dbl>, purpose <chr>, # name <chr>, type <chr> ``` --- class: noslidenumber # Contributions .pull-left[ <blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">📊 My contribution to this week's <a href="https://twitter.com/hashtag/TidyTuesday?src=hash&ref_src=twsrc%5Etfw">#TidyTuesday</a> (this time actually on a Tuesday): nuclear explosions since 1945!💥<a href="https://twitter.com/R4DScommunity?ref_src=twsrc%5Etfw">@R4DScommunity</a><br> <a href="https://twitter.com/hashtag/dataviz?src=hash&ref_src=twsrc%5Etfw">#dataviz</a> <a href="https://twitter.com/hashtag/rstats?src=hash&ref_src=twsrc%5Etfw">#rstats</a> <a href="https://twitter.com/hashtag/ggplot2?src=hash&ref_src=twsrc%5Etfw">#ggplot2</a><br><br>(Code below) <a href="https://t.co/GjyUhgoogX">pic.twitter.com/GjyUhgoogX</a></p>— Gil Henriques 🌹 (@_Gil_Henriques) <a href="https://twitter.com/_Gil_Henriques/status/1163836007743025152?ref_src=twsrc%5Etfw">August 20, 2019</a></blockquote> ] .pull-right[ <blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/hashtag/TidyTuesday?src=hash&ref_src=twsrc%5Etfw">#TidyTuesday</a> Nuclear explosions. Got some inspiration from the PDF of the original report. Hopefully I will be forgiven for the double axis graph, lol<a href="https://twitter.com/hashtag/r4ds?src=hash&ref_src=twsrc%5Etfw">#r4ds</a> <a href="https://twitter.com/hashtag/rstats?src=hash&ref_src=twsrc%5Etfw">#rstats</a> <a href="https://twitter.com/hashtag/dataviz?src=hash&ref_src=twsrc%5Etfw">#dataviz</a> <a href="https://twitter.com/hashtag/nuclearweapons?src=hash&ref_src=twsrc%5Etfw">#nuclearweapons</a> <a href="https://t.co/vAgydjrCHK">pic.twitter.com/vAgydjrCHK</a></p>— Harro Cyranka 🔎 (@harrocyranka) <a href="https://twitter.com/harrocyranka/status/1163805331929141248?ref_src=twsrc%5Etfw">August 20, 2019</a></blockquote> ] --- # Roman emperors
```r df_emperors <- tt_load("2019-08-13")$emperors df_emperors ``` ``` # A tibble: 68 x 16 index name name_full birth death birth_cty birth_prv rise <dbl> <chr> <chr> <date> <date> <chr> <chr> <chr> 1 1 Augu… IMPERATO… 0062-09-23 0014-08-19 Rome Italia Birt… 2 2 Tibe… TIBERIVS… 0041-11-16 0037-03-16 Rome Italia Birt… 3 3 Cali… GAIVS IV… 0012-08-31 0041-01-24 Antitum Italia Birt… 4 4 Clau… TIBERIVS… 0009-08-01 0054-10-13 Lugdunum Gallia L… Birt… 5 5 Nero NERO CLA… 0037-12-15 0068-06-09 Antitum Italia Birt… 6 6 Galba SERVIVS … 0002-12-24 0069-01-15 Terracina Italia Seiz… 7 7 Otho MARCVS S… 0032-04-28 0069-04-16 Terentin… Italia Appo… 8 8 Vite… AVLVS VI… 0015-09-24 0069-12-20 Rome Italia Seiz… 9 9 Vesp… TITVS FL… 0009-11-17 0079-06-24 Falacrine Italia Seiz… 10 10 Titus TITVS FL… 0039-12-30 0081-09-13 Rome Italia Birt… # … with 58 more rows, and 8 more variables: reign_start <date>, # reign_end <date>, cause <chr>, killer <chr>, dynasty <chr>, era <chr>, # notes <chr>, verif_who <chr> ``` --- class: noslidenumber # Contributions .scroll-output[ .pull-left[ <blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Inspired by the periodic table of elements, I present you the unperiodic table of the Roman emperors for this week’s <a href="https://twitter.com/hashtag/TidyTuesday?src=hash&ref_src=twsrc%5Etfw">#TidyTuesday</a>!<br><br>code: <a href="https://t.co/yYmzriAURg">https://t.co/yYmzriAURg</a><a href="https://twitter.com/hashtag/dataviz?src=hash&ref_src=twsrc%5Etfw">#dataviz</a> <a href="https://twitter.com/hashtag/rstats?src=hash&ref_src=twsrc%5Etfw">#rstats</a> <a href="https://twitter.com/hashtag/ggplot?src=hash&ref_src=twsrc%5Etfw">#ggplot</a> <a href="https://t.co/fNd21Xl4kl">pic.twitter.com/fNd21Xl4kl</a></p>— Georgios Karamanis (@geokaramanis) <a href="https://twitter.com/geokaramanis/status/1162035459884589057?ref_src=twsrc%5Etfw">August 15, 2019</a></blockquote> ] .pull-right[ <blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">My <a href="https://twitter.com/hashtag/TidyTuesday?src=hash&ref_src=twsrc%5Etfw">#TidyTuesday</a> contribution. Was fun to work with <a href="https://twitter.com/hashtag/ggforce?src=hash&ref_src=twsrc%5Etfw">#ggforce</a> annotations. <a href="https://t.co/QSZdt8bQMy">pic.twitter.com/QSZdt8bQMy</a></p>— Philippe Massicotte (@philmassicotte) <a href="https://twitter.com/philmassicotte/status/1161728575734722560?ref_src=twsrc%5Etfw">August 14, 2019</a></blockquote> ] ] --- class: inverse, center, middle background-image: url(img/hex_forcats.png) background-size: 15% background-position: right 20px bottom 20px .slide-in-left[ # Dealing with factors ] --- # What is a factor? * To represent categorical variables. * Fixed and known set of possible values (even not present in the dataset). * Could be ordered. * Essential for modeling. * Stored as integer in their underlying representation (but now strings too, so no more memory advantage). .footnote[
[stringsAsFactors: An unauthorized biography](https://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/) ] --- # Convert to factor ```r fruits <- c("banana", "apple", "mango", "apple", "pear", "apple", "banana", "pitaya", "mango", "mango", "apple") as.factor(fruits) # Use alphabetical order ``` ``` [1] banana apple mango apple pear apple banana pitaya mango mango [11] apple Levels: apple banana mango pear pitaya ``` -- ```r as_factor(fruits) # Use appearance order ``` ``` [1] banana apple mango apple pear apple banana pitaya mango mango [11] apple Levels: banana apple mango pear pitaya ``` Using appearance order increased reproducibility because it's independent from `locale()`. .footnote[ Everything could be done with base R: `factor(countries, levels = unique(countries))`. ] --- # Change level names ```r fct_recode(fruits, dragonfruit = "pitaya") ``` ``` [1] banana apple mango apple pear apple [7] banana dragonfruit mango mango apple Levels: apple banana mango pear dragonfruit ``` -- ```r fct_relabel(fruits, str_to_title) ``` ``` [1] Banana Apple Mango Apple Pear Apple Banana Pitaya Mango Mango [11] Apple Levels: Apple Banana Mango Pear Pitaya ``` .footnote[ Note that when converting from strings to factors, `fct_recode()` and `fct_relabel()` use alphabetical order. ] --- # Reorder levels You can reorder levels: * manually with `fct_relevel()`, * by appearance with `fct_inorder()`, * by frequency with `fct_infreq()`, * according to another variable with `fct_reorder()`, * according to the last value of another variable with `fct_reorder2()`, * randomly with `fct_shuffle()`, * by reversing order with `fct_rev()`... --- # By frequency ```r fct_count(fruits, sort = TRUE) ``` ``` # A tibble: 5 x 2 f n <fct> <int> 1 apple 4 2 mango 3 3 banana 2 4 pear 1 5 pitaya 1 ``` ```r fct_infreq(fruits) ``` ``` [1] banana apple mango apple pear apple banana pitaya mango mango [11] apple Levels: apple mango banana pear pitaya ``` --- # According to another variable .pull-left-60[ ```r levels(iris$Species) ``` ``` [1] "setosa" "versicolor" "virginica" ``` ```r iris %>% * mutate(Species = fct_reorder(Species, Sepal.Width)) %>% ggplot() + aes(x = Species, y = Sepal.Width, fill = Species) + geom_boxplot(notch = TRUE, show.legend = FALSE) ``` ] .pull-right-40[ <img src="index_files/figure-html/plot-irisreorder-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Tidy WorldPhones
```r WorldPhones %>% as_tibble(rownames = "Year") ``` ``` # A tibble: 7 x 8 Year N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1951 45939 21574 2876 1815 1646 89 555 2 1956 60423 29990 4708 2568 2366 1411 733 3 1957 64721 32510 5230 2695 2526 1546 773 4 1958 68484 35218 6662 2845 2691 1663 836 5 1959 71799 37598 6856 3000 2868 1769 911 6 1960 76036 40341 8220 3145 3054 1905 1008 7 1961 79831 43173 9053 3338 3224 2005 1076 ``` --- count: false # Tidy WorldPhones
```r WorldPhones %>% as_tibble(rownames = "Year") %>% * pivot_longer(-Year, names_to = "Region", values_to = "Count") ``` ``` # A tibble: 49 x 3 Year Region Count <chr> <chr> <dbl> 1 1951 N.Amer 45939 2 1951 Europe 21574 3 1951 Asia 2876 4 1951 S.Amer 1815 5 1951 Oceania 1646 6 1951 Africa 89 7 1951 Mid.Amer 555 8 1956 N.Amer 60423 9 1956 Europe 29990 10 1956 Asia 4708 # … with 39 more rows ``` --- # According to the last value
.pull-left-60[ ```r WorldPhones %>% as_tibble(rownames = "Year") %>% pivot_longer(-Year, names_to = "Region", values_to = "Count") %>% mutate(Year = as.numeric(Year), * Region = fct_reorder2(Region, Year, Count)) %>% ggplot() + aes(x = Year, y = Count, color = Region) + geom_line() + scale_y_log10() ``` ] .pull-right-40[ <img src="index_files/figure-html/plot-worldphonesreorder-1.png" width="504" style="display: block; margin: auto;" /> ] --- # {tidyr} digression .pull-left[ ```r anscombe ``` ``` x1 x2 x3 x4 y1 y2 y3 y4 1 10 10 10 8 8.04 9.14 7.46 6.58 2 8 8 8 8 6.95 8.14 6.77 5.76 3 13 13 13 8 7.58 8.74 12.74 7.71 4 9 9 9 8 8.81 8.77 7.11 8.84 5 11 11 11 8 8.33 9.26 7.81 8.47 6 14 14 14 8 9.96 8.10 8.84 7.04 7 6 6 6 8 7.24 6.13 6.08 5.25 8 4 4 4 19 4.26 3.10 5.39 12.50 9 12 12 12 8 10.84 9.13 8.15 5.56 10 7 7 7 8 4.82 7.26 6.42 7.91 11 5 5 5 8 5.68 4.74 5.73 6.89 ``` ] -- .pull-right[ ```r anscombe %>% pivot_longer( everything(), * names_to = c(".value", "group"), * names_pattern = "(.)(.)") ``` ``` # A tibble: 44 x 3 group x y <chr> <dbl> <dbl> 1 1 10 8.04 2 2 10 9.14 3 3 10 7.46 4 4 8 6.58 5 1 8 6.95 6 2 8 8.14 7 3 8 6.77 8 4 8 5.76 9 1 13 7.58 10 2 13 8.74 # … with 34 more rows ``` ] --- # Time to practice! <img src="index_files/figure-html/plot-col-nuclear-1.png" width="864" style="display: block; margin: auto;" /> --- count: false # Time to practice! .scroll-output[ ```r df_nuclear %>% count(country, sort = TRUE) %>% mutate(country = fct_inorder(country), country = fct_rev(country), country = fct_recode(country, France = "FRANCE", China = "CHINA", India = "INDIA", Pakistan = "PAKIST")) %>% ggplot() + aes(x = country, y = n, fill = country) + geom_col(show.legend = FALSE) + coord_flip() + labs(x = "Country", y = "Total number of nuclear explosions") + scale_fill_viridis_d(option = "E", direction = -1) ``` <img src="index_files/figure-html/col-nuclear-1.png" width="864" style="display: block; margin: auto;" /> ] --- # Too many levels? .scroll-box-16[ ```r df_nuclear$type <- str_to_title(df_nuclear$type) fct_count(df_nuclear$type, sort = TRUE) ``` ``` # A tibble: 20 x 2 f n <fct> <int> 1 Shaft 1015 2 Tunnel 310 3 Atmosph 185 4 Shaft/Gr 85 5 Airdrop 78 6 Tower 75 7 Balloon 62 8 Surface 62 9 Shaft/Lg 56 10 Barge 40 11 Ug 32 12 Gallery 13 13 Rocket 13 14 Crater 9 15 Uw 8 16 Space 4 17 Mine 1 18 Ship 1 19 Water Su 1 20 Watersur 1 ``` ] --- # Lump least common factors ```r df_nuclear$type %>% # Preserve the most common `n` values * fct_lump(n = 5) %>% table() ``` ``` . Airdrop Atmosph Shaft Shaft/Gr Tunnel Other 78 185 1015 85 310 378 ``` -- ```r df_nuclear$type %>% # Preserve the values that appear at least `min` number of times * fct_lump_min(min = 20) %>% table() ``` ``` . Airdrop Atmosph Balloon Barge Shaft Shaft/Gr Shaft/Lg Surface 78 185 62 40 1015 85 56 62 Tower Tunnel Ug Other 75 310 32 51 ``` --- # Manually collapse levels ```r df_nuclear$type %>% * fct_collapse(Air = c("Atmosph", "Airdrop", "Balloon", "Rocket"), Underground = c("Shaft", "Tunnel", "Shaft/Gr", "Shaft/Lg", "Ug", "Gallery"), Water = c("Barge", "Uw", "Ship", "Water Su", "Watersur"), * group_other = TRUE) %>% fct_count(sort = TRUE) ``` ``` # A tibble: 4 x 2 f n <fct> <int> 1 Underground 1136 2 Air 365 3 Other 352 4 Water 198 ``` --- # Factors are integers! ```r vegetables <- factor(c("carrot", "lettuce", "endive")) fruits <- as_factor(fruits) ``` -- ```r c(fruits, vegetables) ``` ``` [1] 1 2 3 2 4 2 1 5 3 3 2 1 3 2 ``` -- ```r fct_c(fruits, vegetables) ``` ``` [1] banana apple mango apple pear apple banana pitaya mango [10] mango apple carrot lettuce endive Levels: banana apple mango pear pitaya carrot endive lettuce ``` -- .footnote[ ```r as.numeric(factor(runif(4))) # Don't forget as.character() ``` ``` [1] 3 4 1 2 ``` ] --- # Time to practice! <img src="index_files/figure-html/plot-line-nuclear-1.png" width="864" style="display: block; margin: auto;" /> --- count: false # Time to practice! .scroll-output[ ```r df_nuclear %>% mutate(country = fct_collapse(country, `PAKISTAN\n& INDIA` = c("INDIA", "PAKIST"))) %>% count(year, country) %>% group_by(country) %>% mutate(cum = cumsum(n)) %>% ungroup() ``` ``` # A tibble: 165 x 4 year country n cum <dbl> <fct> <int> <int> 1 1945 USA 3 3 2 1946 USA 2 5 3 1948 USA 3 8 4 1949 USSR 1 1 5 1951 USA 16 24 6 1951 USSR 2 3 7 1952 UK 1 1 8 1952 USA 10 34 9 1953 UK 2 3 10 1953 USA 11 45 # … with 155 more rows ``` ] --- count: false # Time to practice! .scroll-output[ ```r df_nuclear %>% mutate(country = fct_collapse(country, `PAKISTAN\n& INDIA` = c("INDIA", "PAKIST"))) %>% count(year, country) %>% group_by(country) %>% mutate(cum = cumsum(n)) %>% ungroup() %>% mutate(country = fct_reorder2(country, year, cum)) %>% ggplot() + aes(x = year, y = cum, color = country) + geom_line(size = 1, key_glyph = "timeseries") + ggexpanse::scale_color_expanse() + labs(x = NULL, color = "Country", y = "Cumulative number of nuclear explosions") + ggexpanse::theme_expanse() ``` <img src="index_files/figure-html/line-nuclear-1.png" width="864" style="display: block; margin: auto;" /> ] --- class: inverse, center, middle background-image: url(img/hex_lubridate.png) background-size: 15% background-position: right 20px bottom 20px .slide-in-right[ # Dealing with dates ] .slide-in-left[ ## and hours ] --- # ISO 8601 .pull-left[ Convention for dates: .Large[.center[.content-box[YYYY-MM-DD]]] <br> Convention for time: .Large[.center[.content-box[HH:MM:SS]]] ] .pull-right[ .center[ <img src="img/xkcd_iso8601.png" height="450"> ] ] .footnote[
[XKCD](https://xkcd.com/1179/) ] --- # Parse date 6 functions are available to parse dates from **y**ear, **m**onth, and **d**ay components: `ymd()`, `ydm()`, `mdy()`, `myd()`, `dmy()`, `dym()`. ```r first_landing <- ymd("1969-07-20") class(first_landing) ``` ``` [1] "Date" ``` ```r first_landing ``` ``` [1] "1969-07-20" ``` -- Formatted dates could be very different, as long as the specified order is respected. ```r mdy(c("7/20 69","July 20, 1969", "First step was on July, the 20th (1969)")) ``` ``` [1] "1969-07-20" "1969-07-20" "1969-07-20" ``` --- # Parse time Each previous function could be suffixed by `_h`, `_hm` or `_hms` to take into account **h**our, **m**inute, and **s**econd components. ```r first_step <- ymd_hm("1969-07-20 20:17") class(first_step) ``` ``` [1] "POSIXct" "POSIXt" ``` ```r first_step ``` ``` [1] "1969-07-20 20:17:00 UTC" ``` --- # Extract components .pull-left[ ```r year(first_landing) ``` ``` [1] 1969 ``` ```r month(first_landing) ``` ``` [1] 7 ``` ```r day(first_landing) ``` ``` [1] 20 ``` ```r hour(first_step) ``` ``` [1] 20 ``` ] -- .pull-right[ .scroll-output[ ```r month(first_landing, label = TRUE) ``` ``` [1] Jul 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec ``` ```r wday(first_landing, label = TRUE, abbr = FALSE) ``` ``` [1] Sunday 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday ``` ```r hour(first_landing) # returns 0 ``` ``` [1] 0 ``` ] .footnote[ There is also `yday()`, `quarter()`, `semester()`, `dst()`, `am()`, `pm()`, `tz()`, `leap()`... ] ] --- # Change and round components You can change components with a simple assignation. ```r second(first_step) <- 30 first_step ``` ``` [1] "1969-07-20 20:17:30 UTC" ``` -- Rounding (to the nearest, down or up) dates is easy. ```r round_date(first_step, unit = "hours") # ceiling_date() / floor_date() ``` ``` [1] "1969-07-20 20:00:00 UTC" ``` ```r round_date(first_step, unit = "15mins") ``` ``` [1] "1969-07-20 20:15:00 UTC" ``` --- # Time to practice! <img src="index_files/figure-html/plot-calendar-nuclear-1.png" width="864" style="display: block; margin: auto;" /> --- count: false # Time to practice! .scroll-output[ ```r df_nuclear %>% select(date_long, country) %>% mutate(date_long = ymd(date_long), month = month(date_long, label = TRUE), wday = wday(date_long, label = TRUE, abbr = FALSE, week_start = 1), wday = fct_rev(wday)) ``` ``` # A tibble: 2,051 x 4 date_long country month wday <date> <chr> <ord> <ord> 1 1945-07-16 USA Jul Monday 2 1945-08-05 USA Aug Sunday 3 1945-08-09 USA Aug Thursday 4 1946-06-30 USA Jun Sunday 5 1946-07-24 USA Jul Wednesday 6 1948-04-14 USA Apr Wednesday 7 1948-04-30 USA Apr Friday 8 1948-05-14 USA May Friday 9 1949-08-29 USSR Aug Monday 10 1951-01-27 USA Jan Saturday # … with 2,041 more rows ``` ] --- count: false # Time to practice! .scroll-output[ ```r df_nuclear %>% select(date_long, country) %>% mutate(date_long = ymd(date_long), month = month(date_long, label = TRUE), wday = wday(date_long, label = TRUE, abbr = FALSE, week_start = 1), wday = fct_rev(wday)) %>% filter(country %in% c("USA", "USSR")) %>% count(month, wday, country) %>% ggplot() + aes(x = month, y = wday, fill = n) + geom_tile() + scale_fill_viridis_c(option = "E") + facet_grid(~ country) + labs(x = NULL, y = NULL, fill = "Number of\nexplosions") + theme_minimal() + theme(panel.grid = element_blank()) ``` <img src="index_files/figure-html/calendar-nuclear-1.png" width="864" style="display: block; margin: auto;" /> ] --- # Current date and time ```r today() ``` ``` [1] "2019-12-13" ``` ```r now() ``` ``` [1] "2019-12-13 10:17:05 CET" ``` -- ```r today() == Sys.Date() ``` ``` [1] TRUE ``` ```r now() == Sys.time() # would be TRUE if computer processed both at the same instant ``` ``` [1] FALSE ``` --- # Intervals Intervals are objects composed by a starting date and an ending date. An interval is created by `interval()` or `%--%`. -- Several functions for intervals: * `time_length()` computes the length of an time span (unit could be specified), * `int_start()` and `int_end()` extract start and end dates, * `int_overlaps()` checks if intervals overlap, * `int_aligns()` checks if intervals share a boundary, * `%within%` checks if a date-time falls within an interval, * `int_diff()` computes intervals between a vector of dates... --- # Practice intervals ```r df_nuclear %>% select(date_long, country) %>% mutate(date_long = ymd(date_long)) %>% group_by(country) %>% summarise(start = min(date_long), end = max(date_long)) ``` ``` # A tibble: 7 x 3 country start end <chr> <date> <date> 1 CHINA 1964-10-16 1996-07-29 2 FRANCE 1960-02-13 1996-01-27 3 INDIA 1974-05-18 1998-05-13 4 PAKIST 1998-05-28 1998-05-30 5 UK 1952-10-03 1991-11-26 6 USA 1945-07-16 1992-09-23 7 USSR 1949-08-29 1990-10-24 ``` --- count: false # Practice intervals ```r df_nuclear %>% select(date_long, country) %>% mutate(date_long = ymd(date_long)) %>% group_by(country) %>% summarise(start = min(date_long), end = max(date_long)) %>% mutate(interval = interval(start, end)) ``` ``` # A tibble: 7 x 4 country start end interval <chr> <date> <date> <Interval> 1 CHINA 1964-10-16 1996-07-29 1964-10-16 UTC--1996-07-29 UTC 2 FRANCE 1960-02-13 1996-01-27 1960-02-13 UTC--1996-01-27 UTC 3 INDIA 1974-05-18 1998-05-13 1974-05-18 UTC--1998-05-13 UTC 4 PAKIST 1998-05-28 1998-05-30 1998-05-28 UTC--1998-05-30 UTC 5 UK 1952-10-03 1991-11-26 1952-10-03 UTC--1991-11-26 UTC 6 USA 1945-07-16 1992-09-23 1945-07-16 UTC--1992-09-23 UTC 7 USSR 1949-08-29 1990-10-24 1949-08-29 UTC--1990-10-24 UTC ``` --- count: false # Practice intervals ```r df_nuclear %>% select(date_long, country) %>% mutate(date_long = ymd(date_long)) %>% group_by(country) %>% summarise(start = min(date_long), end = max(date_long)) %>% mutate(interval = interval(start, end), length = time_length(interval, unit = "years")) ``` ``` # A tibble: 7 x 5 country start end interval length <chr> <date> <date> <Interval> <dbl> 1 CHINA 1964-10-16 1996-07-29 1964-10-16 UTC--1996-07-29 UTC 31.8 2 FRANCE 1960-02-13 1996-01-27 1960-02-13 UTC--1996-01-27 UTC 36.0 3 INDIA 1974-05-18 1998-05-13 1974-05-18 UTC--1998-05-13 UTC 24.0 4 PAKIST 1998-05-28 1998-05-30 1998-05-28 UTC--1998-05-30 UTC 0.00548 5 UK 1952-10-03 1991-11-26 1952-10-03 UTC--1991-11-26 UTC 39.1 6 USA 1945-07-16 1992-09-23 1945-07-16 UTC--1992-09-23 UTC 47.2 7 USSR 1949-08-29 1990-10-24 1949-08-29 UTC--1990-10-24 UTC 41.2 ``` --- count: false # Practice intervals ```r df_nuclear %>% select(date_long, country) %>% mutate(date_long = ymd(date_long)) %>% group_by(country) %>% summarise(start = min(date_long), end = max(date_long)) %>% mutate(interval = interval(start, end), length = time_length(interval, unit = "years"), landing = first_landing %within% interval) ``` ``` # A tibble: 7 x 6 country start end interval length landing <chr> <date> <date> <Interval> <dbl> <lgl> 1 CHINA 1964-10-16 1996-07-29 1964-10-16 UTC--1996-07-29 UTC 31.8 TRUE 2 FRANCE 1960-02-13 1996-01-27 1960-02-13 UTC--1996-01-27 UTC 36.0 TRUE 3 INDIA 1974-05-18 1998-05-13 1974-05-18 UTC--1998-05-13 UTC 24.0 FALSE 4 PAKIST 1998-05-28 1998-05-30 1998-05-28 UTC--1998-05-30 UTC 0.00548 FALSE 5 UK 1952-10-03 1991-11-26 1952-10-03 UTC--1991-11-26 UTC 39.1 TRUE 6 USA 1945-07-16 1992-09-23 1945-07-16 UTC--1992-09-23 UTC 47.2 TRUE 7 USSR 1949-08-29 1990-10-24 1949-08-29 UTC--1990-10-24 UTC 41.2 TRUE ``` --- # Periods Periods are time spans counted in human-readable units which ignore time line irregularities. ```r days(1) # Periods use pluralized unit names ``` ``` [1] "1d 0H 0M 0S" ``` ```r weeks(1) ``` ``` [1] "7d 0H 0M 0S" ``` ```r months(2) ``` ``` [1] "2m 0d 0H 0M 0S" ``` ```r time_length(years(1), unit = "days") ``` ``` [1] 365.25 ``` --- # Durations Durations are time spans counted in seconds which track physical time. ```r ddays(1) # Periods use pluralized unit names prefixed by d ``` ``` [1] "86400s (~1 days)" ``` ```r dweeks(1) ``` ``` [1] "604800s (~1 weeks)" ``` ```r time_length(dyears(1), unit = "days") ``` ``` [1] 365 ``` .footnote[`dmonths()` doesn't exist.] --- # Date algebra ```r 2 * days(3) + hours(3) + minutes(65) - 15 * seconds() ``` ``` [1] "6d 3H 65M -15S" ``` -- ```r seconds_to_period(2 * days(3) + hours(3) + minutes(65) - 15 * seconds()) ``` ``` [1] "6d 4H 4M 45S" ``` -- ```r now() + weeks(1) + hours(2) ``` ``` [1] "2019-12-20 12:17:06 CET" ``` ```r now() + dweeks(1) + dhours(2) ``` ``` [1] "2019-12-20 12:17:06 CET" ``` --- # Leap years ```r now() ``` ``` [1] "2019-12-13 10:17:06 CET" ``` ```r now() + years(6) ``` ``` [1] "2025-12-13 10:17:06 CET" ``` ```r now() + dyears(6) ``` ``` [1] "2025-12-11 10:17:06 CET" ``` ```r leap_year(now() + years(0:6)) ``` ``` [1] FALSE TRUE FALSE FALSE FALSE TRUE FALSE ``` --- # February 31st ```r ymd("2020-01-31") ``` ``` [1] "2020-01-31" ``` ```r ymd("2020-01-31") + months(1) ``` ``` [1] NA ``` ```r ymd("2020-01-31") %m+% months(1) # %m-% exists too ``` ``` [1] "2020-02-29" ``` ```r add_with_rollback(ymd("2020-01-31"), months(1), roll_to_first = TRUE) ``` ``` [1] "2020-03-01" ``` --- # Time zones There are 593 different time zones. ```r sample(OlsonNames(), 9) ``` ``` [1] "America/Asuncion" "Antarctica/Rothera" "America/Lima" [4] "Africa/Porto-Novo" "America/Indiana/Knox" "Asia/Qyzylorda" [7] "Africa/Gaborone" "Australia/Lord_Howe" "America/Montevideo" ``` -- By default, the time zone is UTC for Universal Coordinated time zone. ```r first_step ``` ``` [1] "1969-07-20 20:17:30 UTC" ``` ```r tz(first_step) ``` ``` [1] "UTC" ``` --- # Manipulating time zones You can change the time zone within which a time is measured in with `with_tz()`. ```r first_step ``` ``` [1] "1969-07-20 20:17:30 UTC" ``` ```r with_tz(first_step, "US/Eastern") ``` ``` [1] "1969-07-20 16:17:30 EDT" ``` -- `force_tz()` will coerce the clock time in a new time zone. ```r force_tz(first_step, "US/Eastern") ``` ``` [1] "1969-07-20 20:17:30 EDT" ``` ```r first_step %>% force_tz("US/Eastern") %>% with_tz("UTC") ``` ``` [1] "1969-07-21 00:17:30 UTC" ``` --- # Nice print with template ```r st <- stamp("Created on Sunday 1 December 2019") st(first_landing) ``` ``` [1] "Created on Sunday 20 July 1969" ``` ```r st(today() + months(0:4)) ``` ``` [1] "Created on Friday 13 December 2019" [2] "Created on Monday 13 January 2020" [3] "Created on Thursday 13 February 2020" [4] "Created on Friday 13 March 2020" [5] "Created on Monday 13 April 2020" ``` --- # Time to practice! <img src="index_files/figure-html/plot-cause-death-1.png" width="864" style="display: block; margin: auto;" /> --- count: false # Time to practice! .scroll-output[ ```r df_emperors %>% select(birth, death, cause) %>% filter(birth <= death) %>% mutate(age = time_length(death - birth, unit = "year"), cause = fct_lump_min(cause, min = 5), cause = fct_reorder(cause, age)) ``` ``` # A tibble: 61 x 4 birth death cause age <date> <date> <fct> <dbl> 1 0012-08-31 0041-01-24 Assassination 28.4 2 0009-08-01 0054-10-13 Assassination 45.2 3 0037-12-15 0068-06-09 Suicide 30.5 4 0002-12-24 0069-01-15 Assassination 66.1 5 0032-04-28 0069-04-16 Suicide 37.0 6 0015-09-24 0069-12-20 Assassination 54.3 7 0009-11-17 0079-06-24 Natural Causes 69.6 8 0039-12-30 0081-09-13 Natural Causes 41.7 9 0051-10-24 0096-09-18 Assassination 44.9 10 0030-11-08 0098-01-27 Natural Causes 67.3 # … with 51 more rows ``` ] --- count: false # Time to practice! .scroll-output[ ```r df_emperors %>% select(birth, death, cause) %>% filter(birth <= death) %>% mutate(age = time_length(death - birth, unit = "year"), cause = fct_lump_min(cause, min = 5), cause = fct_reorder(cause, age)) %>% ggplot() + aes(x = cause, y = age, fill = cause, color = cause) + gghalves::geom_half_violin(alpha = 0.8) + gghalves::geom_half_dotplot(binwidth = 1.5, alpha = 0.8) + gghalves::geom_half_boxplot(color = "black", alpha = 0) + scale_fill_viridis_d() + scale_color_viridis_d() + labs(x = "Cause of death", y = "Age at death") + theme(legend.position = "none") ``` <img src="index_files/figure-html/cause-death-1.png" width="864" style="display: block; margin: auto;" /> ] --- class: inverse, center, middle background-image: url(img/hex_stringr.png) background-size: 15% background-position: right 20px bottom 20px .slide-in-left[ # Dealing with strings ] .slide-in-right[ ## and regular expressions ] --- # Fruits and vegetables ```r frvg <- head(sort(c(levels(fruits), levels(vegetables)))) frvg ``` ``` [1] "apple" "banana" "carrot" "endive" "lettuce" "mango" ``` -- <br> To get the number of characters* in a string or a factor, use `str_length()`. ```r str_length(frvg) ``` ``` [1] 5 6 6 6 7 5 ``` `nchar()` does not work on factors. .footnote[[\*] Technically, it's the number of *code points*.] --- # Convert case ```r str_to_upper(frvg) ``` ``` [1] "APPLE" "BANANA" "CARROT" "ENDIVE" "LETTUCE" "MANGO" ``` ```r str_to_title(frvg) ``` ``` [1] "Apple" "Banana" "Carrot" "Endive" "Lettuce" "Mango" ``` <br> `str_to_lower()` and `str_to_sentence()` are also available. --- # Pattern matching When you have a string and a pattern, you can do a lot of funny things: -- * count the number of occurrences of the pattern, -- * detect if the pattern is present, -- * extract* the pattern, .footnote[[*] You can do it on the first occurrence of the pattern or on all occurrences.] -- * locate* the position of the pattern, -- * remove\* or replace\* the pattern, -- * split according to the pattern... --- # Count ```r frvg ``` ``` [1] "apple" "banana" "carrot" "endive" "lettuce" "mango" ``` ```r str_count(string = frvg, pattern = "a") ``` ``` [1] 1 3 1 0 0 1 ``` -- <br> This function and the next ones always take `string` and `pattern` as first arguments, and are vectorized over them. ```r str_count(string = frvg, pattern = c("a", "b", "c", "d", "e", "f")) ``` ``` [1] 1 1 1 1 2 0 ``` --- # Detect ```r str_detect(frvg, "e") ``` ``` [1] TRUE FALSE FALSE TRUE TRUE FALSE ``` ```r frvg[str_detect(frvg, "e")] ``` ``` [1] "apple" "endive" "lettuce" ``` --- # Extract .scroll-output[ ```r str_extract(frvg, "a") ``` ``` [1] "a" "a" "a" NA NA "a" ``` ```r str_extract_all(frvg, "a") ``` ``` [[1]] [1] "a" [[2]] [1] "a" "a" "a" [[3]] [1] "a" [[4]] character(0) [[5]] character(0) [[6]] [1] "a" ``` ] --- # Locate .scroll-output[ ```r str_locate(frvg, "a") ``` ``` start end [1,] 1 1 [2,] 2 2 [3,] 2 2 [4,] NA NA [5,] NA NA [6,] 2 2 ``` ```r str_locate_all(frvg, "a") ``` ``` [[1]] start end [1,] 1 1 [[2]] start end [1,] 2 2 [2,] 4 4 [3,] 6 6 [[3]] start end [1,] 2 2 [[4]] start end [[5]] start end [[6]] start end [1,] 2 2 ``` ] --- # Remove or replace ```r str_remove(frvg, "a") ``` ``` [1] "pple" "bnana" "crrot" "endive" "lettuce" "mngo" ``` ```r str_remove_all(frvg, "a") ``` ``` [1] "pple" "bnn" "crrot" "endive" "lettuce" "mngo" ``` ```r str_replace(frvg, "a", replacement = "AAA") ``` ``` [1] "AAApple" "bAAAnana" "cAAArrot" "endive" "lettuce" "mAAAngo" ``` ```r str_replace_all(frvg, "a", replacement = "AAA") ``` ``` [1] "AAApple" "bAAAnAAAnAAA" "cAAArrot" "endive" "lettuce" [6] "mAAAngo" ``` --- # Split ```r str_split(frvg, "n") ``` ``` [[1]] [1] "apple" [[2]] [1] "ba" "a" "a" [[3]] [1] "carrot" [[4]] [1] "e" "dive" [[5]] [1] "lettuce" [[6]] [1] "ma" "go" ``` --- # Regular expressions A regular expression, or regex, is a sequence of characters that define a search pattern. .center[ <img src="img/xkcd_regex.png" height="350"> ] .footnote[
[XKCD](https://xkcd.com/208/) ] --- # Exact strings .pull-left[ ```r str_view_all(frvg, "a") ```
] -- .pull-right[ ```r str_view_all(frvg, "ana") ```
] .footnote[ When patterns overlap, only the first one is detected. ] --- # Match any character The dot `.` matches any character (except a newline). .pull-left[ ```r str_view_all(frvg, "a.") ```
] -- .pull-right[ ```r str_view_all(frvg, "e...") ```
] --- # Repeat a match .pull-left[ You can specify if a pattern will match * 0 or more times with `*`, * 1 or more with `+`, * 0 or 1 time with `?`, * exactly n with `{n}`, * n or more with `{n,}`, * at least m with `{,m}`, * between n and m with `{n,m}`. ] -- .pull-right[ ```r str_view_all(frvg, "a.*n") ```
] --- count: false # Repeat a match .pull-left[ ```r str_view_all(frvg, "a.+n") ```
] -- .pull-right[ ```r str_view_all(frvg, "(an){2}") ```
] --- # Alternatives You can use `(a|d)` to match `a` or `d`, and `[a-d]` to match every character between `a` and `d`. .pull-left[ ```r str_view_all(frvg, "(m|an)an") ```
] -- .pull-right[ ```r str_view_all(frvg, "a[p-z]") ```
] --- # Anchors Anchors are useful to match the beginning (`^`) or the end (`$`) of a string. .pull-left[ ```r str_view_all(frvg, "e$") ```
] -- .pull-right[ ```r str_view_all(frvg, "^(a|e).*") ```
] --- # Except Use `[^abc]` if you want to match every character but `a`, `b` or `c`. .pull-left[ ```r str_view_all(frvg, "^[^e].*") ```
] -- .pull-right[ ```r str_view_all(frvg, "[^aeiouy]+") ```
] --- # Escape special characters To match a literal `.`, `$`, `(` or any regex meaningful character, you need to escape it with two* backslash: `\\.`, `\\$`, `\\(`... .footnote[ [*] You need two backslashes because `\` is an escape character in both R strings and the for regex engine to which you're ultimately passing your patterns.
[Replacing Backslashes](https://stackoverflow.com/a/27492072/8031980) ] .pull-left[ ```r str_view_all(c("abc", "a.c"), "a.c") ```
] -- .pull-right[ ```r str_view_all(c("abc", "a.c"), "a\\.c") ```
] --- # Character classes Character classes are special pattern that match more that one character. * `\s` matches any whitespace, * `\d` or `[:digit:]` matches any digit, * `[:punct:]` matches any punctuation, * `[:alpha:]` matches any letters, * `[:lower:]` matches any lowercase letters, * `[:upper:]` matches any upperclass letters. -- You have already created your own character classes like `[a-d]` or `[^abc]`. --- # Backreferences Parenthesis can be used to defined groups of patterns than can be referred to with backreferences like `\\1`, `\\2`... .pull-left[ ```r str_view_all(frvg, "^(.).*\\1$") ```
] -- .pull-right[ ```r str_view_all(frvg, "(.).*(.)\\2.*\\1") ```
] --- # Regex crossword level 1 .footnote[
[Regex Crossword](https://regexcrossword.com) ] .pull-left[ .center[<img src="img/rc_1.png" width="300">] ] -- .pull-right[ .center[<img src="img/rc_1_full.png" width="300">] ] --- # Regex crossword level 2 .pull-left[ .center[<img src="img/rc_2.png" width="350">] ] -- .pull-right[ .center[<img src="img/rc_2_full.png" width="350">] ] --- # Regex crossword level 3 .pull-left[ .center[<img src="img/rc_3.png" width="350">] ] -- .pull-right[ .center[<img src="img/rc_3_full.png" width="350">] ] --- # Regex crossword level 4 .pull-left[ .center[<img src="img/rc_4.png" width="350">] ] -- .pull-right[ .center[<img src="img/rc_4_full.png" width="350">] ] --- # Regex crossword level 5 .pull-left[ .center[<img src="img/rc_5.png" width="350">] ] -- .pull-right[ .center[<img src="img/rc_5_full.png" width="350">] ] --- # Regex crossword level over 9000 .center[<img src="img/rc_9000.png" width="450">] --- # References .pull-left[ <br> * [forcats.tidyverse.org]() <br> * [lubridate.tidyverse.org]() <br> * [stringr.tidyverse.org]() ] .pull-right[ .center[<img src="img/book_r4ds.png" height="350"> <img src="img/book_advr.png" height="350">] ] --- class: end-slide # Thanks! ##
<a href="mailto:antoine.bichat@mines-nancy.org?subject=SOTR">antoine.bichat@mines-nancy.org</a> ##
<a href="https://abichat.github.io" target="_blank">abichat.github.io</a> ##
<a href="https://twitter.com/_abichat" target="_blank">@_abichat</a> ##
<a href="https://github.com/abichat" target="_blank">@abichat</a> .pull-right[.blue-logo[.pull-down[ ]]]