Feature selection step using the coefficient of variation

Select variables with highest coefficient of variation.

Usage

step_select_cv(
  recipe,
  ...,
  role = NA,
  trained = FALSE,
  n_kept = NULL,
  prop_kept = NULL,
  cutoff = NULL,
  res = NULL,
  skip = FALSE,
  id = rand_id("select_cv")
)

# S3 method for class 'step_select_cv'
tidy(x, ...)

Arguments

recipe: A recipe object. The step will be added to the sequence of operations for this recipe.
...: One or more selector functions to choose variables for this step. See recipes::selections() for more details.
role: Not used by this step since no new variables are created.
trained: A logical to indicate if the quantities for preprocessing have been estimated.
n_kept: Number of variables to keep.
prop_kept: A numeric value between 0 and 1 representing the proportion of variables to keep. n_kept and prop_kept are mutually exclusive.
cutoff: Threshold beyond which (below or above) the variables are discarded.
res: This parameter is only produced after the recipe has been trained.
skip: A logical. Should the step be skipped when the recipe is baked by recipes::bake()? While all operations are baked when recipes::prep() is run, some operations may not be able to be conducted on new data (e.g. processing the outcome variable(s)). Care should be taken when using skip = TRUE as it may affect the computations for subsequent operations.
id: A character string that is unique to this step to identify it.
x: A step_select_cv object.

Value

An updated version of recipe with the new step added to the sequence of any existing operations.

Author

Antoine Bichat

Examples

rec <-
  recipe(Species ~ ., data = iris) %>%
  step_select_cv(all_numeric_predictors(), n_kept = 2) %>%
  prep()
rec
#> 
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs 
#> Number of variables by role
#> outcome:   1
#> predictor: 4
#> 
#> ── Training information 
#> Training data contained 150 data points and no incomplete rows.
#> 
#> ── Operations 
#> • Top CV filtering on: Sepal.Length Sepal.Width, ... | Trained
tidy(rec, 1)
#> # A tibble: 4 × 4
#>   terms           cv kept  id             
#>   <chr>        <dbl> <lgl> <chr>          
#> 1 Sepal.Length 0.142 FALSE select_cv_QgF0f
#> 2 Sepal.Width  0.143 FALSE select_cv_QgF0f
#> 3 Petal.Length 0.470 TRUE  select_cv_QgF0f
#> 4 Petal.Width  0.636 TRUE  select_cv_QgF0f
bake(rec, new_data = NULL)
#> # A tibble: 150 × 3
#>    Petal.Length Petal.Width Species
#>           <dbl>       <dbl> <fct>  
#>  1          1.4         0.2 setosa 
#>  2          1.4         0.2 setosa 
#>  3          1.3         0.2 setosa 
#>  4          1.5         0.2 setosa 
#>  5          1.4         0.2 setosa 
#>  6          1.7         0.4 setosa 
#>  7          1.4         0.3 setosa 
#>  8          1.5         0.2 setosa 
#>  9          1.4         0.2 setosa 
#> 10          1.5         0.1 setosa 
#> # ℹ 140 more rows