Créer un pipeline de machine learning complet avec {tidymodels}

Julie Aubert & Antoine Bichat

MIA Paris Saclay X Servier

Rencontres R 2023 – Avignon

Plan

Introduction
Jeu de données
Construire un modèle avec {parsnip}
Pré-traiter les données avec {recipes}
Evaluer son modèle avec {rsample} et {yardstick}
Optimiser les paramètres du modèle avec {tune} ou {finetune}
Construire un ou plusieurs workflows avec {workflows} et {workflowsets}

Julie Aubert

Ingénieure de recherche en statistiques

Développement et application de méthodes statistiques en environnement et sciences du vivant
R, omiques, écologie microbienne

Antoine Bichat

Data scientist chez Servier

Analyses exploratoires
Oncologie, cancers pédiatriques
R, packages, applications

Un peu de pub pour nos collègues

21/06 à 14h40 B. Chassagnol, {DeCovarT}
21/06 à 17h J. Chiquet, Utilisation de quarto pour le journal Computo
22/06 à 09h50 T. Vanrenterghem, {ShinySBM}

Contenu du tutoriel

Ce que ce tutoriel n’est pas

Un tutoriel sur R ou sur le tidyverse
Un cours de machine learning ou d’inférence statistique

Contenu du tutoriel

Ce que ce tutoriel n’est pas

Un tutoriel sur R ou sur le tidyverse
Un cours de machine learning ou d’inférence statistique

Ce que ce tutoriel est

Un tutoriel sur comment utiliser des méthodes de ML dans l’écosystème {tidymodels}

Machine learning

Crédit : https://apreshill.github.io/tidymodels-it/

Ecosystème `{tidymodels}`

Naviguer dans l’écosystème

Différentes façons de faire

Ajuster un modèle seulement ({parsnip}).
Utiliser un workflow (intégration étapes de pré-traitement et modèlisation) ({workflows}).
Optimiser des hyperparamètres ({tune}).
Comparer plusieurs workflows ({workflowsets}).

Avantages

Format/notation/workflow standardisé pour différents algos/méthodes.
Encapsule les différentes parties (notamment estimation test/train) dans un même objet.
Étapes de prétraitement, choix de modèles, optimisation d’hyperparamètres facilités.
Très modulable, chaque étape correspond à un package.

Packages et options

# install.packages(c("tidyverse", "tidymodels",      # metapackages
#                    "glmnet", "ranger", "xgboost",  # modèles
#                    "finetune", "corrr", "vip",     # facilitateurs
#                    "ggforce", "ggrain"))           # dataviz

library(tidyverse) 
library(tidymodels)

theme_set(theme_light())
options(pillar.print_min = 6)

Données

Jeu de données de dégustation de café Coffee Quality Database, fourni par James LeDoux à partir de pages de revues du Coffee Quality Institute.

Données data_coffee.csv disponibles sur le dépôt GitHub abichat/rr23-tuto-tidymodels.

Objectif

Prédire cupper_points (score de 0 à 10) à partir de variables :

de caractéristiques aromatiques et gustatives (aroma, flavor, aftertaste…)
de caractéristiques des grains (species, color…)
de caractéristiques environnementales (country, altitude…)

Importation des données

coffee_raw <- read_csv("data_coffee.csv")
coffee_raw

# A tibble: 1,339 × 13
  cupper_points aroma flavor aftertaste acidity sweetness species
          <dbl> <dbl>  <dbl>      <dbl>   <dbl>     <dbl> <chr>  
1          8.75  8.67   8.83       8.67    8.75        10 Arabica
2          8.58  8.75   8.67       8.5     8.58        10 Arabica
3          9.25  8.42   8.5        8.42    8.42        10 Arabica
4          8.67  8.17   8.58       8.42    8.42        10 Arabica
5          8.58  8.25   8.5        8.25    8.5         10 Arabica
6          8.33  8.58   8.42       8.42    8.5         10 Arabica
# ℹ 1,333 more rows
# ℹ 6 more variables: country_of_origin <chr>, variety <chr>,
#   processing_method <chr>, color <chr>, altitude <dbl>, unit <chr>

À votre tour

Familiarisez-vous avec le jeu de données coffee_raw. Y a-t-il des observations aberrantes ou des variables à adapter ?

10:00

Code

coffee_raw %>% 
  select(cupper_points:acidity) %>% 
  pivot_longer(everything()) %>% 
  ggplot() +
  aes(x = value, y = name, fill = name) +
  geom_violin() +
  geom_boxplot(alpha = 0) +
  ggforce::geom_sina(size = 0.5) +
  labs(x = "Note", y = NULL) +
  theme(legend.position = "none")

Code

ggplot(coffee_raw) +
  aes(x = unit, y = altitude, color = unit) +
  ggrain::geom_rain() +
  scale_y_log10() +
  labs(x = "Unité", y = "Altitude") +
  theme(legend.position = "none")

Code

library(corrr)
coffee_raw %>% 
  select(where(is.numeric)) %>% 
  correlate(method = "pearson", use = "complete.obs") %>%
  shave() %>% 
  rplot(print_cor = TRUE)

Nettoyage des données

coffee <-
  coffee_raw %>% 
  filter(if_all(cupper_points:acidity, ~ . > 4)) %>% 
  mutate(across(where(is.character), as_factor),
         altitude = if_else(unit == "ft", altitude * 0.3048, altitude),
         altitude = if_else(altitude > 8000, NA, altitude))
coffee

# A tibble: 1,338 × 13
  cupper_points aroma flavor aftertaste acidity sweetness species
          <dbl> <dbl>  <dbl>      <dbl>   <dbl>     <dbl> <fct>  
1          8.75  8.67   8.83       8.67    8.75        10 Arabica
2          8.58  8.75   8.67       8.5     8.58        10 Arabica
3          9.25  8.42   8.5        8.42    8.42        10 Arabica
4          8.67  8.17   8.58       8.42    8.42        10 Arabica
5          8.58  8.25   8.5        8.25    8.5         10 Arabica
6          8.33  8.58   8.42       8.42    8.5         10 Arabica
# ℹ 1,332 more rows
# ℹ 6 more variables: country_of_origin <fct>, variety <fct>,
#   processing_method <fct>, color <fct>, altitude <dbl>, unit <fct>

Spécifier un modèle avec `{parsnip}`

Crédit : Allison Horst

Spécifier un modèle avec `{parsnip}`

Un model (rand_forest(), linear_reg()…)
Un engine (ranger, randomForest…)
Un mode (regression, classification…)
Des hyperparamètres (trees, penalty…)

Tous les modèles

https://www.tidymodels.org/find/parsnip/

Que faire avec `{parsnip}` ?

Création du modèle

linear_reg(mode = "regression", engine = "lm")

Linear Regression Model Specification (regression)

Computational engine: lm

Que faire avec `{parsnip}` ?

Estimation du modèle

linear_reg(mode = "regression", engine = "lm") %>% 
  fit(cupper_points ~ aroma + flavor + species, data = coffee)

parsnip model object


Call:
stats::lm(formula = cupper_points ~ aroma + flavor + species, 
    data = data)

Coefficients:
   (Intercept)           aroma          flavor  speciesRobusta  
        0.2299          0.1467          0.8192          0.1508

Que faire avec `{parsnip}` ?

Prédiction

linear_reg(mode = "regression", engine = "lm") %>% 
  fit(cupper_points ~ aroma + flavor + species, data = coffee) %>% 
  predict(coffee)

# A tibble: 1,338 × 1
  .pred
  <dbl>
1  8.74
2  8.62
3  8.43
4  8.46
5  8.40
6  8.39
# ℹ 1,332 more rows

Que faire avec `{parsnip}` ?

Statistiques et anova de type I

linear_reg(mode = "regression", engine = "lm") %>% 
  fit(cupper_points ~ aroma + flavor + species, data = coffee) %>% 
  extract_fit_engine() %>% # besoin d'extraire l'objet lm
  summary()


Call:
stats::lm(formula = cupper_points ~ aroma + flavor + species, 
    data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.61088 -0.12361 -0.00840  0.09759  2.94352 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     0.22989    0.19437   1.183  0.23712    
aroma           0.14671    0.03684   3.982 7.19e-05 ***
flavor          0.81916    0.03406  24.048  < 2e-16 ***
speciesRobusta  0.15077    0.05469   2.757  0.00592 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2858 on 1334 degrees of freedom
Multiple R-squared:  0.5525,    Adjusted R-squared:  0.5515 
F-statistic: 549.1 on 3 and 1334 DF,  p-value: < 2.2e-16

Que faire avec `{parsnip}` ?

Anova de type I en format tidy

linear_reg(mode = "regression", engine = "lm") %>% 
  fit(cupper_points ~ aroma + flavor + species, data = coffee) %>% 
  # extract_fit_engine() %>% # pas nécessaire
  tidy()

# A tibble: 4 × 5
  term           estimate std.error statistic   p.value
  <chr>             <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)       0.230    0.194       1.18 2.37e-  1
2 aroma             0.147    0.0368      3.98 7.19e-  5
3 flavor            0.819    0.0341     24.0  1.91e-106
4 speciesRobusta    0.151    0.0547      2.76 5.92e-  3

Que faire avec `{parsnip}` ?

Importance des variables

linear_reg(mode = "regression", engine = "lm") %>% 
  fit(cupper_points ~ aroma + flavor + species, data = coffee) %>% 
  vip::vip()

Changement de modèle

Régression linéaire
Forêt aléatoire
XGBoost
Elastic net

linear_reg(mode = "regression", engine = "lm") %>% 
  fit(cupper_points ~ aroma + flavor + species, data = coffee) %>% 
  predict(coffee)

# A tibble: 1,338 × 1
  .pred
  <dbl>
1  8.74
2  8.62
3  8.43
4  8.46
5  8.40
6  8.39
# ℹ 1,332 more rows

rand_forest(mode = "regression", engine = "ranger") %>% 
  fit(cupper_points ~ aroma + flavor + species, data = coffee) %>% 
  predict(coffee)

# A tibble: 1,338 × 1
  .pred
  <dbl>
1  8.22
2  8.21
3  8.19
4  8.16
5  8.17
6  8.11
# ℹ 1,332 more rows

boost_tree(mode = "regression", engine = "xgboost") %>%
  fit(cupper_points ~ aroma + flavor + species, data = coffee) %>%
  predict(coffee)

# A tibble: 1,338 × 1
  .pred
  <dbl>
1  8.51
2  8.51
3  8.58
4  8.52
5  8.52
6  8.32
# ℹ 1,332 more rows

linear_reg(mode = "regression", engine = "glmnet", 
           penalty = 0.1, mixture = 0.5) %>% 
  fit(cupper_points ~ aroma + flavor + species, data = coffee) %>% 
  predict(coffee)

# A tibble: 1,338 × 1
  .pred
  <dbl>
1  8.46
2  8.37
3  8.22
4  8.23
5  8.20
6  8.20
# ℹ 1,332 more rows

Rééchantillonnage avec `{rsample}`

Intérêt principal : éviter le sur-ajustement.

Utilisation ici pour évaluer les performances de modèle dans le cadre d’un jeu “hold-out”

Différents types de rééchantillonnage et classes d’objet associées

class rsplit pour des rééchantillonnages individuels
class rset pour une collection de rééchantillonnage

Schéma classique

Crédit : Feature Engineering and Selection, Max Kuhn et Kjell Johnson

dans le cas rset, on parle d’analysis et d’assessment plutôt que de training et testing
pas de copie de données modifiées

Dépenser le budget données

set.seed(123)
cf_split <- initial_split(coffee, strata = "species", prop = 3/4)
cf_split

<Training/Testing/Total>
<1003/335/1338>

Ensembles d’apprentissage et de test

Apprentissage
Test

cf_train <- training(cf_split)
cf_train

# A tibble: 1,003 × 13
  cupper_points aroma flavor aftertaste acidity sweetness species
          <dbl> <dbl>  <dbl>      <dbl>   <dbl>     <dbl> <fct>  
1          8.58  8.75   8.67       8.5     8.58     10    Arabica
2          9.25  8.42   8.5        8.42    8.42     10    Arabica
3          8.67  8.17   8.58       8.42    8.42     10    Arabica
4          8.58  8.25   8.5        8.25    8.5      10    Arabica
5          8.33  8.58   8.42       8.42    8.5      10    Arabica
6          9     8.25   8.33       8.5     8.42      9.33 Arabica
# ℹ 997 more rows
# ℹ 6 more variables: country_of_origin <fct>, variety <fct>,
#   processing_method <fct>, color <fct>, altitude <dbl>, unit <fct>

cf_test <- testing(cf_split)
cf_test

# A tibble: 335 × 13
  cupper_points aroma flavor aftertaste acidity sweetness species
          <dbl> <dbl>  <dbl>      <dbl>   <dbl>     <dbl> <fct>  
1          8.75  8.67   8.83       8.67    8.75     10    Arabica
2          8.5   8.42   8.5        8.33    8.5      10    Arabica
3          8.58  8.33   8.42       8.08    8.25     10    Arabica
4          8.5   8.25   8.33       8.5     8.25      9.33 Arabica
5          8.17  8      8.25       8.08    8.5      10    Arabica
6          8.33  8.08   8.25       8       8.17     10    Arabica
# ℹ 329 more rows
# ℹ 6 more variables: country_of_origin <fct>, variety <fct>,
#   processing_method <fct>, color <fct>, altitude <dbl>, unit <fct>

Données de validation croisée

set.seed(234)
cf_cv <- vfold_cv(cf_train, v = 10, repeats = 1) 
cf_cv

#  10-fold cross-validation 
# A tibble: 10 × 2
   splits            id    
   <list>            <chr> 
 1 <split [902/101]> Fold01
 2 <split [902/101]> Fold02
 3 <split [902/101]> Fold03
 4 <split [903/100]> Fold04
 5 <split [903/100]> Fold05
 6 <split [903/100]> Fold06
 7 <split [903/100]> Fold07
 8 <split [903/100]> Fold08
 9 <split [903/100]> Fold09
10 <split [903/100]> Fold10

Données de validation croisée

first_resample <- cf_cv$splits[[1]]
analysis(first_resample) # premier jeu qui servira pour l'apprentissage

# A tibble: 902 × 13
  cupper_points aroma flavor aftertaste acidity sweetness species
          <dbl> <dbl>  <dbl>      <dbl>   <dbl>     <dbl> <fct>  
1          8.58  8.75   8.67       8.5     8.58     10    Arabica
2          8.67  8.17   8.58       8.42    8.42     10    Arabica
3          8.58  8.25   8.5        8.25    8.5      10    Arabica
4          9     8.25   8.33       8.5     8.42      9.33 Arabica
5          8.67  8.67   8.67       8.58    8.42      9.33 Arabica
6          8.5   8.08   8.58       8.5     8.5      10    Arabica
# ℹ 896 more rows
# ℹ 6 more variables: country_of_origin <fct>, variety <fct>,
#   processing_method <fct>, color <fct>, altitude <dbl>, unit <fct>

assessment(first_resample) # jeu complémentaire pour la partie test

# A tibble: 101 × 13
  cupper_points aroma flavor aftertaste acidity sweetness species
          <dbl> <dbl>  <dbl>      <dbl>   <dbl>     <dbl> <fct>  
1          9.25  8.42   8.5        8.42    8.42        10 Arabica
2          8.33  8.58   8.42       8.42    8.5         10 Arabica
3          8.42  8.17   7.83       8       8.08        10 Arabica
4          8     8      8          8       8.08        10 Arabica
5          7.92  7.83   8          8       7.75        10 Arabica
6          8.08  7.75   7.83       7.83    8.17        10 Arabica
# ℹ 95 more rows
# ℹ 6 more variables: country_of_origin <fct>, variety <fct>,
#   processing_method <fct>, color <fct>, altitude <dbl>, unit <fct>

Prétraitement avec `{recipes}`

Crédit : Allison Horst

Prétraitement avec `{recipes}`

Gérer les données manquantes, les erreurs, les données aberrantes.
Créer de nouvelles variables en transformant ou combinant des variables existantes.
Normaliser ou encoder différemment des variables existantes.
Dans un ordre défini par des fonctions step_*().

Toutes les recettes

https://www.tidymodels.org/find/recipes/

Prétraitement des données

Initialisation de la recette : formule et jeu de données d’entraînement.

recipe(cupper_points ~ ., data = cf_train)

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:    1
predictor: 12

Prétraitement des données numériques

Ajout des différentes étapes.

recipe(cupper_points ~ ., data = cf_train) %>% 
  step_normalize(all_numeric_predictors()) # centre et réduit

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:    1
predictor: 12

── Operations

• Centering and scaling for: all_numeric_predictors()

Prétraitement des données numériques

Estimation des paramètres du prétraitement.

recipe(cupper_points ~ ., data = cf_train) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  prep()

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:    1
predictor: 12

── Training information

Training data contained 1003 data points and 326 incomplete rows.

── Operations

• Centering and scaling for: aroma, flavor, aftertaste, acidity, ... | Trained

Prétraitement des données numériques

Application de la recette sur cf_train.

recipe(cupper_points ~ ., data = cf_train) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  prep() %>% 
  bake(new_data = NULL)

# A tibble: 1,003 × 13
  aroma flavor aftertaste acidity sweetness species country_of_origin variety
  <dbl>  <dbl>      <dbl>   <dbl>     <dbl> <fct>   <fct>             <fct>  
1  3.74   3.44       3.17    3.26     0.259 Arabica Ethiopia          Other  
2  2.70   2.93       2.94    2.75     0.259 Arabica Guatemala         Bourbon
3  1.90   3.17       2.94    2.75     0.259 Arabica Ethiopia          <NA>   
4  2.16   2.93       2.44    3.00     0.259 Arabica Ethiopia          Other  
5  3.20   2.68       2.94    3.00     0.259 Arabica Brazil            <NA>   
6  2.16   2.41       3.17    2.75    -1.05  Arabica Ethiopia          <NA>   
# ℹ 997 more rows
# ℹ 5 more variables: processing_method <fct>, color <fct>, altitude <dbl>,
#   unit <fct>, cupper_points <dbl>

Prétraitement des données numériques

On vérifie que les données sont centrées-réduites.

recipe(cupper_points ~ ., data = cf_train) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  prep() %>% 
  bake(new_data = NULL) %>% 
  summarise(across(c(aroma, flavor, aftertaste), 
                   list(mean = mean, sd = sd)))

# A tibble: 1 × 6
  aroma_mean aroma_sd flavor_mean flavor_sd aftertaste_mean aftertaste_sd
       <dbl>    <dbl>       <dbl>     <dbl>           <dbl>         <dbl>
1  -5.94e-16        1   -1.08e-16         1       -6.80e-16             1

Prétraitement des données catégorielles

recipe(cupper_points ~ ., data = cf_train) %>% 
  step_unknown(all_nominal_predictors()) %>% # transforme les NA en "unknown"
  step_dummy(all_nominal_predictors()) %>% # variables binaires exclusives
  prep()

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:    1
predictor: 12

── Training information

Training data contained 1003 data points and 326 incomplete rows.

── Operations

• Unknown factor level assignment for: species, ... | Trained

• Dummy variables from: species, country_of_origin, variety, ... | Trained

Prétraitement des données catégorielles

recipe(cupper_points ~ ., data = cf_train) %>% 
  step_unknown(all_nominal_predictors()) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  prep() %>% 
  bake(new_data = NULL) %>% 
  select(starts_with(c("species", "color")))

# A tibble: 1,003 × 6
  species_Robusta species_unknown color_Bluish.Green color_None color_Blue.Green
            <dbl>           <dbl>              <dbl>      <dbl>            <dbl>
1               0               0                  0          0                0
2               0               0                  0          0                0
3               0               0                  0          0                0
4               0               0                  0          0                0
5               0               0                  1          0                0
6               0               0                  0          0                0
# ℹ 997 more rows
# ℹ 1 more variable: color_unknown <dbl>

À votre tour

En utilisant les étapes disponibles dans {recipes} (https://recipes.tidymodels.org/reference), déterminer un prétaitement adéquat pour cf_train.

07:00

Solution

Définition
Aperçu
Estimation
Traitement

cf_rec <-
  recipe(cupper_points ~ ., data = cf_train) %>% 
  update_role(unit, new_role = "notused") %>% 
  step_unknown(variety, processing_method, country_of_origin,
               color, new_level = "unknown") %>%
  step_other(country_of_origin, threshold = 0.01) %>%
  step_other(processing_method, variety, threshold = 0.1) %>%
  step_impute_linear(altitude, 
                     impute_with = imp_vars(country_of_origin)) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_impute_median(all_numeric_predictors()) %>% 
  step_normalize(all_numeric_predictors())

cf_rec

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:    1
predictor: 11
notused:    1

── Operations

• Unknown factor level assignment for: variety, processing_method, ...

• Collapsing factor levels for: country_of_origin

• Collapsing factor levels for: processing_method, variety

• Linear regression imputation for: altitude

• Dummy variables from: all_nominal_predictors()

• Median imputation for: all_numeric_predictors()

• Centering and scaling for: all_numeric_predictors()

prep(cf_rec)

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:    1
predictor: 11
notused:    1

── Training information

Training data contained 1003 data points and 326 incomplete rows.

── Operations

• Unknown factor level assignment for: variety, ... | Trained

• Collapsing factor levels for: country_of_origin | Trained

• Collapsing factor levels for: processing_method, variety | Trained

• Linear regression imputation for: altitude | Trained

• Dummy variables from: species, country_of_origin, variety, ... | Trained

• Median imputation for: aroma, flavor, aftertaste, acidity, ... | Trained

• Centering and scaling for: aroma, flavor, aftertaste, acidity, ... | Trained

cf_rec %>% 
  prep() %>% 
  bake(new_data = NULL)

# A tibble: 1,003 × 37
  aroma flavor aftertaste acidity sweetness altitude unit  cupper_points
  <dbl>  <dbl>      <dbl>   <dbl>     <dbl>    <dbl> <fct>         <dbl>
1  3.74   3.44       3.17    3.26     0.259    1.58  m              8.58
2  2.70   2.93       2.94    2.75     0.259    0.924 m              9.25
3  1.90   3.17       2.94    2.75     0.259    1.45  m              8.67
4  2.16   2.93       2.44    3.00     0.259    1.58  m              8.58
5  3.20   2.68       2.94    3.00     0.259   -0.431 m              8.33
6  2.16   2.41       3.17    2.75    -1.05     0.811 m              9   
# ℹ 997 more rows
# ℹ 29 more variables: species_Robusta <dbl>,
#   country_of_origin_Guatemala <dbl>, country_of_origin_Brazil <dbl>,
#   country_of_origin_United.States..Hawaii. <dbl>,
#   country_of_origin_Indonesia <dbl>, country_of_origin_China <dbl>,
#   country_of_origin_Costa.Rica <dbl>, country_of_origin_Mexico <dbl>,
#   country_of_origin_Uganda <dbl>, country_of_origin_Honduras <dbl>, …

Assembler dans un workflow

Simplifier les étapes en associant le modèle et la recette ensemble.

Un seul objet à manipuler pour différentes étapes :

estimation des paramètres du prétraitement sur l’ensemble d’apprentissage,
estimation des paramètres du modèle sur l’ensemble d’apprentissage,
application du prétraitement sur l’ensemble de test,
prédiction et evaluation du modèle sur l’ensemble de test,
voir plus si validation croisée.

Evaluer son workflow avec `{yardstick}`

Ensemble de fonctions pour estimer la qualité du modèle.

en entrée : un data frame, la colonne des vraies valeurs et la colonne des prédictions,
en sortie : un data frame avec les différentes métriques demandées.

https://yardstick.tidymodels.org/reference/

Utilisation du workflow

workflow(preprocessor = cf_rec, 
         spec = linear_reg())

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
7 Recipe Steps

• step_unknown()
• step_other()
• step_other()
• step_impute_linear()
• step_dummy()
• step_impute_median()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: lm

Utilisation du workflow

workflow(preprocessor = cf_rec, 
         spec = linear_reg()) %>% 
  fit(cf_train)

══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
7 Recipe Steps

• step_unknown()
• step_other()
• step_other()
• step_impute_linear()
• step_dummy()
• step_impute_median()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────

Call:
stats::lm(formula = ..y ~ ., data = data)

Coefficients:
                                   (Intercept)  
                                      7.508674  
                                         aroma  
                                      0.011873  
                                        flavor  
                                      0.163501  
                                    aftertaste  
                                      0.118719  
                                       acidity  
                                      0.039276  
                                     sweetness  
                                      0.010274  
                                      altitude  
                                     -0.008988  
                               species_Robusta  
                                     -0.006244  
                   country_of_origin_Guatemala  
                                     -0.034174  
                      country_of_origin_Brazil  
                                     -0.014097  
      country_of_origin_United.States..Hawaii.  
                                     -0.028764  
                   country_of_origin_Indonesia  
                                     -0.024927  
                       country_of_origin_China  
                                      0.003308  
                  country_of_origin_Costa.Rica  
                                     -0.004014  
                      country_of_origin_Mexico  
                                     -0.004706  
                      country_of_origin_Uganda  
                                     -0.006009  
                    country_of_origin_Honduras  
                                     -0.013790  
                      country_of_origin_Taiwan  
                                      0.002220  
                   country_of_origin_Nicaragua  
                                     -0.022671  
country_of_origin_Tanzania..United.Republic.Of  
                                     -0.001615  
                       country_of_origin_Kenya  
                                     -0.021102  
                    country_of_origin_Thailand  
                                     -0.013158  
                    country_of_origin_Colombia  

...
and 28 more lines.

Utilisation du workflow

workflow(preprocessor = cf_rec, 
         spec = linear_reg()) %>% 
  fit(cf_train) %>% 
  predict(cf_train)

# A tibble: 1,003 × 1
  .pred
  <dbl>
1  8.63
2  8.52
3  8.55
4  8.43
5  8.53
6  8.56
# ℹ 997 more rows

Utilisation du workflow

workflow(preprocessor = cf_rec, 
         spec = linear_reg()) %>% 
  fit(cf_train) %>% 
  predict(cf_test)

# A tibble: 335 × 1
  .pred
  <dbl>
1  8.78
2  8.55
3  8.35
4  8.54
5  8.29
6  8.21
# ℹ 329 more rows

Utilisation du workflow

workflow(preprocessor = cf_rec, 
         spec = linear_reg()) %>% 
  fit(cf_train) %>% 
  predict(cf_test) %>% 
  bind_cols(cf_test)

# A tibble: 335 × 14
  .pred cupper_points aroma flavor aftertaste acidity sweetness species
  <dbl>         <dbl> <dbl>  <dbl>      <dbl>   <dbl>     <dbl> <fct>  
1  8.78          8.75  8.67   8.83       8.67    8.75     10    Arabica
2  8.55          8.5   8.42   8.5        8.33    8.5      10    Arabica
3  8.35          8.58  8.33   8.42       8.08    8.25     10    Arabica
4  8.54          8.5   8.25   8.33       8.5     8.25      9.33 Arabica
5  8.29          8.17  8      8.25       8.08    8.5      10    Arabica
6  8.21          8.33  8.08   8.25       8       8.17     10    Arabica
# ℹ 329 more rows
# ℹ 6 more variables: country_of_origin <fct>, variety <fct>,
#   processing_method <fct>, color <fct>, altitude <dbl>, unit <fct>

Utilisation du workflow

workflow(preprocessor = cf_rec, 
         spec = linear_reg()) %>% 
  fit(cf_train) %>% 
  predict(cf_test) %>% 
  bind_cols(cf_test) %>% 
  rmse(truth = cupper_points, estimate = .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       0.254

À votre tour

En utilisant la fonction tune::last_fit(), estimer le RMSE pour un modèle de forêt aléatoire et visualiser la correlation entre cupper_points et cupper_points prédits sur les données de test.

07:00

Solution

Forêts aléatoires
RMSE
Visualisation

cf_lf_rf <-
  workflow(preprocessor = cf_rec, 
           spec = rand_forest(mode = "regression")) %>% 
  last_fit(cf_split)
cf_lf_rf

# Resampling results
# Manual resampling 
# A tibble: 1 × 6
  splits             id               .metrics .notes   .predictions .workflow 
  <list>             <chr>            <list>   <list>   <list>       <list>    
1 <split [1003/335]> train/test split <tibble> <tibble> <tibble>     <workflow>

collect_metrics(cf_lf_rf)

# A tibble: 2 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard       0.231 Preprocessor1_Model1
2 rsq     standard       0.698 Preprocessor1_Model1

cf_lf_rf %>% 
  collect_predictions() %>% 
  ggplot() +
  aes(x = cupper_points, y = .pred) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey") +
  geom_point()

Utiliser son workflow pour faire de la prédiction

Construction du modèle.

Utiliser son workflow pour faire de la prédiction

Construction du modèle.
Créer une recette de prétraitement.

Utiliser son workflow pour faire de la prédiction

Construction du modèle.
Créer une recette de prétraitement.
Associer modèle et recette dans un workflow.

Utiliser son workflow pour faire de la prédiction

Construction du modèle.
Créer une recette de prétraitement.
Associer modèle et recette dans un workflow.
Entraîner le workflow grâce à un appel à la fonction fit().

Utiliser son workflow pour faire de la prédiction

Construction du modèle.
Créer une recette de prétraitement.
Associer modèle et recette dans un workflow.
Entraîner le workflow grâce à un appel à la fonction fit().
Utiliser le workflow entraîné pour prédire à partir de données non vues avec predict().

Utiliser son workflow pour faire de la prédiction

Construction du modèle.
Créer une recette de prétraitement.
Associer modèle et recette dans un workflow.
Entraîner le workflow sur l’ensemble d’entraînement et prédire sur l’ensemble de test avec last_fit().

Optimiser les hyperparamètres avec `{tune}`

Certains prétraitements et modèles demandent de choisir des hyperparamètres :

penalty, et mixture pour linear_reg()
trees, mtry et min_n pour rand_forest()
threshold pour step_other()
…

Comment choisir ses hyperparamètres ?

rand_forest(mode = "regression", trees = 500, mtry = 5, min_n = 5)

Comment choisir ses hyperparamètres ?

rand_forest(mode = "regression", trees = 500, mtry = 5, min_n = 5)
rand_forest(mode = "regression", trees = 1000, mtry = 3, min_n = 10)

Comment choisir ses hyperparamètres ?

rand_forest(mode = "regression", trees = 500, mtry = 5, min_n = 5)
rand_forest(mode = "regression", trees = 1000, mtry = 3, min_n = 10)
rand_forest(mode = "regression", trees = tune(), mtry = tune(), min_n = tune())

Comment choisir ses hyperparamètres ?

rf_tune <-
  rand_forest(mode = "regression", engine = "ranger",
              trees = 500, mtry = tune(), min_n = tune())

Comment choisir ses hyperparamètres ?

rf_tune <-
  rand_forest(mode = "regression", engine = "ranger",
              trees = 500, mtry = tune(), min_n = tune())

wkf_rf_tune <- workflow(preprocessor = cf_rec, spec = rf_tune) 
wkf_rf_tune

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
7 Recipe Steps

• step_unknown()
• step_other()
• step_other()
• step_impute_linear()
• step_dummy()
• step_impute_median()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)

Main Arguments:
  mtry = tune()
  trees = 500
  min_n = tune()

Computational engine: ranger

Comment choisir ses hyperparamètres ?

set.seed(345)
res_tune <- tune_grid(wkf_rf_tune, cf_cv, grid = 15, 
                      control = control_grid(verbose = FALSE))
res_tune

# Tuning results
# 10-fold cross-validation 
# A tibble: 10 × 4
   splits            id     .metrics          .notes          
   <list>            <chr>  <list>            <list>          
 1 <split [902/101]> Fold01 <tibble [30 × 6]> <tibble [0 × 3]>
 2 <split [902/101]> Fold02 <tibble [30 × 6]> <tibble [0 × 3]>
 3 <split [902/101]> Fold03 <tibble [30 × 6]> <tibble [0 × 3]>
 4 <split [903/100]> Fold04 <tibble [30 × 6]> <tibble [0 × 3]>
 5 <split [903/100]> Fold05 <tibble [30 × 6]> <tibble [0 × 3]>
 6 <split [903/100]> Fold06 <tibble [30 × 6]> <tibble [0 × 3]>
 7 <split [903/100]> Fold07 <tibble [30 × 6]> <tibble [0 × 3]>
 8 <split [903/100]> Fold08 <tibble [0 × 6]>  <tibble [1 × 3]>
 9 <split [903/100]> Fold09 <tibble [30 × 6]> <tibble [0 × 3]>
10 <split [903/100]> Fold10 <tibble [30 × 6]> <tibble [0 × 3]>

There were issues with some computations:

  - Error(s) x1: Error in `step_impute_linear()`: Caused by error in `model.frame....

Run `show_notes(.Last.tune.result)` for more information.

Comment choisir ses hyperparamètres ?

autoplot(res_tune)

Comment choisir ses hyperparamètres ?

collect_metrics(res_tune) %>%
  filter(.metric == "rmse") %>%
  ggplot() +
  aes(x = mtry, y = min_n, color = mean, size = mean) +
  geom_point()

Comment choisir ses hyperparamètres ?

show_best(res_tune, metric = "rmse")

# A tibble: 5 × 8
   mtry min_n .metric .estimator  mean     n std_err .config              
  <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
1     9    11 rmse    standard   0.259     9  0.0327 Preprocessor1_Model12
2    11    12 rmse    standard   0.261     9  0.0333 Preprocessor1_Model13
3     6    24 rmse    standard   0.262     9  0.0334 Preprocessor1_Model07
4    16    19 rmse    standard   0.268     9  0.0339 Preprocessor1_Model08
5    21     5 rmse    standard   0.268     9  0.0321 Preprocessor1_Model15

Comment choisir ses hyperparamètres ?

param_rf <- select_best(res_tune, metric = "rmse")
param_rf

# A tibble: 1 × 3
   mtry min_n .config              
  <int> <int> <chr>                
1     9    11 Preprocessor1_Model12

Comment choisir ses hyperparamètres ?

wkf_rf_tune

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
7 Recipe Steps

• step_unknown()
• step_other()
• step_other()
• step_impute_linear()
• step_dummy()
• step_impute_median()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)

Main Arguments:
  mtry = tune()
  trees = 500
  min_n = tune()

Computational engine: ranger

Comment choisir ses hyperparamètres ?

wkf_rf_tune %>%
  finalize_workflow(param_rf)

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
7 Recipe Steps

• step_unknown()
• step_other()
• step_other()
• step_impute_linear()
• step_dummy()
• step_impute_median()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)

Main Arguments:
  mtry = 9
  trees = 500
  min_n = 11

Computational engine: ranger

Comment choisir ses hyperparamètres ?

wkf_rf_tune %>%
  finalize_workflow(param_rf) %>%
  last_fit(cf_split) %>% 
  collect_metrics()

# A tibble: 2 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard       0.235 Preprocessor1_Model1
2 rsq     standard       0.684 Preprocessor1_Model1

À votre tour

En utilisant la fonction finetune::tune_race_anova(), optimiser les hyperparamètres d’une régression régularisée “elastic net”.

07:00

Solution

Course
Aperçu
Gain de temps
Métriques
Importance

library(finetune)
wkf_en_tune <- 
  workflow(preprocessor = cf_rec, 
           spec = linear_reg(penalty = tune(), mixture = tune(),
                             engine = "glmnet")) 
set.seed(456)
res_race <- tune_race_anova(wkf_en_tune, resamples = cf_cv, grid = 10,
                            control = control_race(verbose = FALSE,
                                                   verbose_elim = FALSE))

res_race

# Tuning results
# 10-fold cross-validation 
# A tibble: 25 × 5
  splits            id     .order .metrics          .notes          
  <list>            <chr>   <int> <list>            <list>          
1 <split [902/101]> Fold01      2 <tibble [20 × 6]> <tibble [0 × 3]>
2 <split [902/101]> Fold02      3 <tibble [20 × 6]> <tibble [0 × 3]>
3 <split [903/100]> Fold07      1 <tibble [20 × 6]> <tibble [0 × 3]>
4 <split [903/100]> Fold09      4 <tibble [18 × 6]> <tibble [0 × 3]>
5 <split [902/101]> Fold03      8 <tibble [2 × 6]>  <tibble [1 × 3]>
6 <split [903/100]> Fold04     10 <tibble [2 × 6]>  <tibble [0 × 3]>
# ℹ 19 more rows

There were issues with some computations:

  - Error(s) x1: Error in `step_impute_linear()`: Caused by error in `model.frame....

Run `show_notes(.Last.tune.result)` for more information.

plot_race(res_race) # + facet_wrap(~ .config)

wkf_en_tune %>% 
  finalize_workflow(select_best(res_race, "rmse")) %>% 
  last_fit(cf_split) %>% 
  collect_metrics()

# A tibble: 2 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard       0.249 Preprocessor1_Model1
2 rsq     standard       0.646 Preprocessor1_Model1

wkf_en_tune %>% 
  finalize_workflow(select_best(res_race, "rmse")) %>% 
  last_fit(cf_split) %>% 
  extract_fit_engine() %>% 
  vip::vip(mapping = aes(fill = Sign))

Utiliser son workflow pour optimiser ses hyperparamètres

Créer un workflow avec des paramètres à optimiser dans le modèle et/ou la recette.

Utiliser son workflow pour optimiser ses hyperparamètres

Créer un workflow avec des paramètres à optimiser dans le modèle et/ou la recette.
Entraîner et évaluer le modèles sur les différents jeux de données analysis/assessment de validation croisée avec tune_grid() ou équivalent.

Utiliser son workflow pour optimiser ses hyperparamètres

Créer un workflow avec des paramètres à optimiser dans le modèle et/ou la recette.
Entraîner et évaluer le modèles sur les différents jeux de données analysis/assessment de validation croisée avec tune_grid() ou équivalent.
Récupérer le workflow ayant la meilleure combinaison d’hyperparamètres avec select_best() ou équivalent.

Tout comparer avec `{workflowsets}`

Combiner dans un seul objet différentes recettes et modèles

all_models <- 
   workflow_set(
      preproc = list(normalized = cf_rec),
      models = list(lm = linear_reg(), 
                    rf = rand_forest(mode = "regression"), 
                    tuned_rf = rand_forest(mode = "regression", trees = 500,
                                           mtry = param_rf$mtry, min_n = param_rf$min_n), 
                    boost_tree = boost_tree(mode = "regression", engine = "xgboost")),
      cross = TRUE)
all_models

# A workflow set/tibble: 4 × 4
  wflow_id              info             option    result    
  <chr>                 <list>           <list>    <list>    
1 normalized_lm         <tibble [1 × 4]> <opts[0]> <list [0]>
2 normalized_rf         <tibble [1 × 4]> <opts[0]> <list [0]>
3 normalized_tuned_rf   <tibble [1 × 4]> <opts[0]> <list [0]>
4 normalized_boost_tree <tibble [1 × 4]> <opts[0]> <list [0]>

Tout comparer avec `{workflowsets}`

all_models %>% 
  extract_workflow(id = "normalized_rf")

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
7 Recipe Steps

• step_unknown()
• step_other()
• step_other()
• step_impute_linear()
• step_dummy()
• step_impute_median()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)

Computational engine: ranger

Tout comparer avec `{workflowsets}`

set.seed(567)
res_all_models <- 
   all_models %>% 
   workflow_map(fn = "fit_resamples", resamples = cf_cv)
res_all_models

# A workflow set/tibble: 4 × 4
  wflow_id              info             option    result   
  <chr>                 <list>           <list>    <list>   
1 normalized_lm         <tibble [1 × 4]> <opts[1]> <rsmp[+]>
2 normalized_rf         <tibble [1 × 4]> <opts[1]> <rsmp[+]>
3 normalized_tuned_rf   <tibble [1 × 4]> <opts[1]> <rsmp[+]>
4 normalized_boost_tree <tibble [1 × 4]> <opts[1]> <rsmp[+]>

Tout comparer avec `{workflowsets}`

autoplot(res_all_models)

Tout comparer avec `{workflowsets}`

rank_results(res_all_models, 
             rank_metric = "rmse", # <- how to order models
             select_best = TRUE   # <- one point per workflow
             ) %>% 
  select(rank, wflow_id, .metric, mean)

# A tibble: 8 × 4
   rank wflow_id              .metric  mean
  <int> <chr>                 <chr>   <dbl>
1     1 normalized_rf         rmse    0.258
2     1 normalized_rf         rsq     0.654
3     2 normalized_tuned_rf   rmse    0.261
4     2 normalized_tuned_rf   rsq     0.639
5     3 normalized_boost_tree rmse    0.269
6     3 normalized_boost_tree rsq     0.623
7     4 normalized_lm         rmse    0.273
8     4 normalized_lm         rsq     0.604

Aller plus loin

24 packages aujourd’hui :
- recettes spécifiques ({embed}, {themis}…)
- modèles spécifiques ({multilevelmod}, {modeltime}, {poissonreg}…)
- modes spécifiques ({censored})
- travailler avec des données spécifiques ({textrecipes}, {spatialsample}…)
- raffinement des pipelines ({desirability2}, {stacks}…)
Possibilité d’intégrer :
- sa propre recette
- son propre modèle
- sa propre métrique

Références

Documentation officielle https://www.tidymodels.org

Articles de blog https://www.tidyverse.org/tags/tidymodels

Livre Tidy Modeling with R, Max Kuhn et Julia Silge https://www.tmwr.org (version en ligne gratuite)

Livre Feature Engineering and Selection, Max Kuhn et Kjell Johnson https://bookdown.org/max/FES (version en ligne gratuite)

Sur la dégustation de café https://nomadbarista.com/cupping-cafe-ou-la-degustation-du-cafe/

Traçabilité

sessionInfo()

R version 4.2.2 (2022-10-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur ... 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] xgboost_1.7.5.1    rlang_1.1.1        glmnet_4.1-7       Matrix_1.5-3      
 [5] finetune_1.0.1     ranger_0.15.1      corrr_0.4.4        yardstick_1.2.0   
 [9] workflowsets_1.0.1 workflows_1.1.3    tune_1.1.1         rsample_1.1.1     
[13] recipes_1.0.6      parsnip_1.1.0      modeldata_1.1.0    infer_1.0.4       
[17] dials_1.2.0        scales_1.2.1       broom_1.0.5        tidymodels_1.1.0  
[21] lubridate_1.9.2    forcats_1.0.0      stringr_1.5.0      dplyr_1.1.2       
[25] purrr_1.0.1        readr_2.1.4        tidyr_1.3.0        tibble_3.2.1      
[29] ggplot2_3.4.2      tidyverse_2.0.0   

loaded via a namespace (and not attached):
 [1] minqa_1.2.5         colorspace_2.0-3    ellipsis_0.3.2     
 [4] class_7.3-20        rstudioapi_0.14     listenv_0.9.0      
 [7] furrr_0.3.1         farver_2.1.1        bit64_4.0.5        
[10] prodlim_2019.11.13  fansi_1.0.4         countdown_0.4.0    
[13] codetools_0.2-18    splines_4.2.2       knitr_1.41         
[16] polyclip_1.10-4     polynom_1.4-1       jsonlite_1.8.4     
[19] nloptr_2.0.3        ggforce_0.4.1       compiler_4.2.2     
[22] backports_1.4.1     fastmap_1.1.0       cli_3.6.1          
[25] tweenr_2.0.2        htmltools_0.5.4     tools_4.2.2        
[28] gtable_0.3.1        glue_1.6.2          Rcpp_1.0.10        
[31] DiceDesign_1.9      vctrs_0.6.2         nlme_3.1-161       
[34] iterators_1.0.14    gghalves_0.1.4      timeDate_4021.107  
[37] gower_1.0.1         xfun_0.36           globals_0.16.2     
[40] lme4_1.1-33         timechange_0.2.0    lifecycle_1.0.3    
[43] future_1.30.0       ggrain_0.0.3        MASS_7.3-58.1      
[46] ipred_0.9-13        vroom_1.6.0         hms_1.1.3          
[49] parallel_4.2.2      prismatic_1.1.1     yaml_2.3.6         
[52] gridExtra_2.3       rpart_4.1.19        stringi_1.7.12     
[55] foreach_1.5.2       lhs_1.1.6           boot_1.3-28.1      
[58] hardhat_1.3.0       ggpp_0.5.2          shape_1.4.6        
[61] lava_1.7.0          pkgconfig_2.0.3     evaluate_0.19      
[64] lattice_0.20-45     labeling_0.4.2      bit_4.0.5          
[67] tidyselect_1.2.0    parallelly_1.33.0   magrittr_2.0.3     
[70] R6_2.5.1            generics_0.1.3      pillar_1.9.0       
[73] whisker_0.4.1       withr_2.5.0         survival_3.4-0     
[76] nnet_7.3-18         future.apply_1.10.0 crayon_1.5.2       
[79] vip_0.3.2           utf8_1.2.3          tzdb_0.3.0         
[82] rmarkdown_2.19      grid_4.2.2          data.table_1.14.6  
[85] digest_0.6.31       GPfit_1.0-8         munsell_0.5.0

Merci pour votre attention !

Slides made with Quarto

Créer un pipeline de machine learning complet avec {tidymodels}

Plan

Julie Aubert

Antoine Bichat

Un peu de pub pour nos collègues

Contenu du tutoriel

Ce que ce tutoriel n’est pas

Contenu du tutoriel

Ce que ce tutoriel n’est pas

Ce que ce tutoriel est

Machine learning

Ecosystème {tidymodels}

Naviguer dans l’écosystème

Différentes façons de faire

Avantages

Packages et options

Données

Importation des données

À votre tour

Solution

Nettoyage des données

Spécifier un modèle avec {parsnip}

Spécifier un modèle avec {parsnip}

Tous les modèles

Que faire avec {parsnip} ?

Que faire avec {parsnip} ?

Que faire avec {parsnip} ?

Que faire avec {parsnip} ?

Que faire avec {parsnip} ?

Que faire avec {parsnip} ?

Changement de modèle

Rééchantillonnage avec {rsample}

Schéma classique

Dépenser le budget données

Ensembles d’apprentissage et de test

Données de validation croisée

Données de validation croisée

Prétraitement avec {recipes}

Prétraitement avec {recipes}

Toutes les recettes

Prétraitement des données

Prétraitement des données numériques

Prétraitement des données numériques

Prétraitement des données numériques

Prétraitement des données numériques

Prétraitement des données catégorielles

Prétraitement des données catégorielles

À votre tour

Solution

Assembler dans un workflow

Evaluer son workflow avec {yardstick}

Utilisation du workflow

Utilisation du workflow

Utilisation du workflow

Utilisation du workflow

Utilisation du workflow

Utilisation du workflow

À votre tour

Solution

Utiliser son workflow pour faire de la prédiction

Utiliser son workflow pour faire de la prédiction

Utiliser son workflow pour faire de la prédiction

Utiliser son workflow pour faire de la prédiction

Utiliser son workflow pour faire de la prédiction

Utiliser son workflow pour faire de la prédiction

Optimiser les hyperparamètres avec {tune}

Comment choisir ses hyperparamètres ?

Comment choisir ses hyperparamètres ?

Comment choisir ses hyperparamètres ?

Comment choisir ses hyperparamètres ?

Comment choisir ses hyperparamètres ?

Comment choisir ses hyperparamètres ?

Comment choisir ses hyperparamètres ?

Comment choisir ses hyperparamètres ?

Comment choisir ses hyperparamètres ?

Comment choisir ses hyperparamètres ?

Comment choisir ses hyperparamètres ?

Comment choisir ses hyperparamètres ?

Comment choisir ses hyperparamètres ?

À votre tour

Ecosystème `{tidymodels}`

Spécifier un modèle avec `{parsnip}`

Spécifier un modèle avec `{parsnip}`

Que faire avec `{parsnip}` ?

Que faire avec `{parsnip}` ?

Que faire avec `{parsnip}` ?

Que faire avec `{parsnip}` ?

Que faire avec `{parsnip}` ?

Que faire avec `{parsnip}` ?

Rééchantillonnage avec `{rsample}`

Prétraitement avec `{recipes}`

Prétraitement avec `{recipes}`

Evaluer son workflow avec `{yardstick}`

Optimiser les hyperparamètres avec `{tune}`

Tout comparer avec `{workflowsets}`

Tout comparer avec `{workflowsets}`

Tout comparer avec `{workflowsets}`

Tout comparer avec `{workflowsets}`

Tout comparer avec `{workflowsets}`