Feature selection step using Kruskal test — step_select

Select variables with the lowest (adjusted) p-value of a Kruskal-Wallis test against an outcome.

Usage

step_select_kruskal(
  recipe,
  ...,
  role = NA,
  trained = FALSE,
  outcome = NULL,
  n_kept = NULL,
  prop_kept = NULL,
  cutoff = NULL,
  correction = "none",
  res = NULL,
  skip = FALSE,
  id = rand_id("select_kruskal")
)

# S3 method for class 'step_select_kruskal'
tidy(x, ...)

Arguments

recipe: A recipe object. The step will be added to the sequence of operations for this recipe.
...: One or more selector functions to choose variables for this step. See recipes::selections() for more details.
role: Not used by this step since no new variables are created.
trained: A logical to indicate if the quantities for preprocessing have been estimated.
outcome: Name of the variable to perform the test against.
n_kept: Number of variables to keep.
prop_kept: A numeric value between 0 and 1 representing the proportion of variables to keep. n_kept and prop_kept are mutually exclusive.
cutoff: Threshold beyond which (below or above) the variables are discarded.
correction: Multiple testing correction method. One of p.adjust.methods. Default to "none".
res: This parameter is only produced after the recipe has been trained.
skip: A logical. Should the step be skipped when the recipe is baked by recipes::bake()? While all operations are baked when recipes::prep() is run, some operations may not be able to be conducted on new data (e.g. processing the outcome variable(s)). Care should be taken when using skip = TRUE as it may affect the computations for subsequent operations.
id: A character string that is unique to this step to identify it.
x: A step_select_kruskal object.

Value

An updated version of recipe with the new step added to the sequence of any existing operations.

Author

Antoine Bichat

Examples

rec <-
  iris %>%
  recipe(formula = Species ~ .) %>%
  step_select_kruskal(all_numeric_predictors(), outcome = "Species",
                      correction = "fdr", prop_kept = 0.5) %>%
  prep()
rec
#> 
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs 
#> Number of variables by role
#> outcome:   1
#> predictor: 4
#> 
#> ── Training information 
#> Training data contained 150 data points and no incomplete rows.
#> 
#> ── Operations 
#> • Kruskal filtering against Species on: Sepal.Length, ... | Trained
tidy(rec, 1)
#> # A tibble: 4 × 5
#>   terms              pv       qv kept  id                  
#>   <chr>           <dbl>    <dbl> <lgl> <chr>               
#> 1 Sepal.Length 8.92e-22 1.19e-21 FALSE select_kruskal_jF0Hq
#> 2 Sepal.Width  1.57e-14 1.57e-14 FALSE select_kruskal_jF0Hq
#> 3 Petal.Length 4.80e-29 9.61e-29 TRUE  select_kruskal_jF0Hq
#> 4 Petal.Width  3.26e-29 9.61e-29 TRUE  select_kruskal_jF0Hq
bake(rec, new_data = NULL)
#> # A tibble: 150 × 3
#>    Petal.Length Petal.Width Species
#>           <dbl>       <dbl> <fct>  
#>  1          1.4         0.2 setosa 
#>  2          1.4         0.2 setosa 
#>  3          1.3         0.2 setosa 
#>  4          1.5         0.2 setosa 
#>  5          1.4         0.2 setosa 
#>  6          1.7         0.4 setosa 
#>  7          1.4         0.3 setosa 
#>  8          1.5         0.2 setosa 
#>  9          1.4         0.2 setosa 
#> 10          1.5         0.1 setosa 
#> # ℹ 140 more rows