In this post we’ll run through how to do feature selection by cross-validation in sparklyr. You can see previous posts for some background on cross-validation and sparklyr. Our aim will be to loop over a set of features refitting a model with each feature excluded. We can then compare the performance of these reduced models to a model containing all the features. This way, we’ll quantify the effect of removing a particular feature on performance. We’re doing this feature selection with cross-validation as it will give us more reliable estimates that simply splitting our data into a training and test set. If we did 5-fold cross-validation, then we would fit 5 models with our feature of interest excluded then average the performance.

While there are other ways to do feature selection, this method has an appealing simplicity and directness. The question we really want to answer with feature selection is ‘what would happen if I removed this feature’. Using cross-validation answers this question directly for us – we remove the feature and see what happens. This differs from metric like permutation feature importance, which are sometimes used for feature selection. Permutation feature importance measures for tree model tell us how much performance would drop if we randomised a particular feature (see here for more details). However, when this is done, the structure of the trees is determined using the pre-randomised feature. Thus, while the feature of interest is subsequently randomised, it still has an influence on the initial tree structure. As it’s put in the Elements of Statistics Learning (as quoted here).

This does not measure the effect on prediction were this variable not available, because if the model was refitted without the variable, other variables could be used as surrogates.


First we’ll do some setup by loading in our packages and creating a Spark connection.


sc <- spark_connect(master = 'local')

As in the previous post on cross-validation, we’re going to use the Titanic data set, provided by the titanic package. We need to clean up this data a bit by selecting a subset of columns, turning factor columns to integers, and removing NAs, before creating a Spark DataFrame for the data.

# get the training data
df_titanic <- titanic::titanic_train %>%
  # select some columns
  select(Survived, Pclass, Sex, Age, SibSp, Parch, Fare) %>%
  # make the col names lower case
  rename_all(tolower) %>%
  # turn sex to a interger
  mutate(sex = ifelse(sex == 'male', 1, 0)) %>%
  # remove NAs

# create a Spark DataFrame for the data
tbl_titanic <- copy_to(sc, df_titanic)

Feature selection by cross validation

We’ve looked previously at two ways to do cross-validation with sparklyr. Here we’re going to use the in-built function from sparklyr, as it’s quicker than a user-specified function. In order to do our models, we’re going to need to specify a set of features to use where our feature of interest is excluded.

Below is a function to give us the set of features for a data set, excluding a particular feature. The function takes some data, a response column that we want to predict and an optional set of columns to exclude. It will return a vector of column names for the data with response and excluded removed.

get_feature_cols <- function(tbl, response, exclude = NULL) {
  # column names of the data
  columns <- colnames(tbl)
  # exclude the response/outcome variable and 
  # exclude from the column names
  columns[!(columns %in% c(response, exclude))]

Lets use this on our data, excluding the pclass column. We also pass in our response column (survived), as we don’t want that included in our predictors!

feature_cols <- get_feature_cols(tbl_titanic, 'survived', 'pclass')

## [1] "sex"   "age"   "sibsp" "parch" "fare"

Next we’re going to create a single column containing our features. Each element of this column will be a vector containing the values on a set of features for that row. This is the alternative way to feed data into a Spark model, compared to the formula approach used last time.

Before creating this vector column our data might take the form.

y x1 x2
0 1.1 4.9
1 5.6 3.4

After the transformation it would look something like

y x
0 (1.1, 5.6)
1 (4.9, 3.4)

There’s a Spark function to apply this transformation to our data, which we call with ft_vector_assembler(). We’re using the feature_cols defined above, and saying that the output column should be called features.

tbl_titanic_va <-
                      input_cols = feature_cols, 
                      output_col = 'features')

The next thing we’ll need to do is specify the estimator we want to use. Here we’re going to keep things simple and use logistic regression.

One thing to note here is that we’re providing a Spark connection as the first argument to ml_logistic_regression, not a Spark DataFrame. This is crucial, because we don’t actually want to fit the model now, which is what would happen if we provided a DataFrame. Instead, we get back a “pointer to a Spark Predictor object”, which will be used for our cross-validation later. You can do ?ml_logistic_regression to get more information on the objects that are returned for different argument types.

As well as specifying our Spark connection, we indicate what our outcome and features columns are called.

estimator <-
                         label_col = 'survived', 
                         features_col = 'features')

As noted when we previously looked at the cross-validation function, we need to provide a parameter grid to ml_cross_validator(). We only want to fit a single model, so we’ll just provide a minimal parameter grid containing default values.

param_grid <- list(logistic_regression = list(elastic_net_param = 0))

Finally, we need to specify an evaluator to be used in our cross-validation. Here we also specifying the column name for our outcome, as well as our Spark connection.

evaluator <-
                                     label_col = 'survived')

Now we can actually do some cross-validation. Note, this is different to how we did it previously as we aren’t using a pipeline. This is just to show there are multiple ways to use the cross-validation function.

# do cross-validation with pclass excluded
cv_fit_pclass <- ml_cross_validator(
  estimator = estimator,
  estimator_param_maps = param_grid,
  evaluator = evaluator,
  seed = 2018

# look at the performance metrics
##   areaUnderROC elastic_net_param_1
## 1    0.8301738                   0

We could have carried out the whole process, included the vector assembling, in a single pipeline. However, this would mean we’d get a pipeline object back. We’d then have to pull the information about the cross-validation out of this object (e.g. see here). Above we, instead, get a cross-validation object straight back. The pipeline approach would be preferable in a setting where we want to save and reuse our modelling process. We could save the whole pipeline object to disk and reuse it on future data.

Now we’ve gone through the whole process of doing cross-validation with in-built functions, we can wrap it up into a function.

ml_fit_cv <-
  function(sc, # Spark connection
           tbl, # tbl_spark
           model_fun, # modelling function
           label_col, # label/response/outcome columns
           feature_cols_exclude = NULL, # vector of features to exclude
           param_grid, # parameter grid
           seed = sample(.Machine$integer.max, 1) # optional seed (following sdf_partition)
           ) {
    # columns for the feature
    feature_cols <-
      get_feature_cols(tbl, label_col, feature_cols_exclude)
    # vector assembler
    tbl_va <- ft_vector_assembler(tbl,
                                   input_cols = feature_cols,
                                   output_col = "features")
    # estimator
    estimator <- model_fun(sc, label_col = label_col)
    # an evaluator
    evaluator <-
      ml_binary_classification_evaluator(sc, label_col = label_col)
    # do the cv
      estimator = estimator,
      estimator_param_maps = param_grid,
      evaluator = evaluator,
      seed = seed

We can then apply the function to our data, this time excluding the age column.

cv_fit_age <- ml_fit_cv(
  sc = sc, 
  tbl = tbl_titanic, 
  model_fun = ml_logistic_regression, 
  label_col = 'survived', 
  feature_cols_exclude = 'age', 
  param_grid = param_grid, 
  seed = 2018

##   areaUnderROC elastic_net_param_1
## 1    0.8306967                   0

What we really want to do is apply this process to all our features. The combination of purrr and list columns is perfect for this sort of task. First off, we create a base data frame that just contains all our candidate features.

df_feature_selection <-
  data_frame(excluded_feature = get_feature_cols(tbl_titanic, 'survived'))

## # A tibble: 6 x 1
##   excluded_feature
##   <chr>           
## 1 pclass          
## 2 sex             
## 3 age             
## 4 sibsp           
## 5 parch           
## 6 fare

Next we’re using our ml_fit_cv() function to map over the excluded_feature column and do cross-validation with each feature excluded from the model. The cv_fit column will contain a cross-validation object where the model was fit without the excluded_feature for that row. As well as creating the cv_fit column, the average AUC for the cross-validation is being pulled out and put into the avg_metric column. We use map_dbl() here to indicate that we want a vector back.

df_feature_selection <- df_feature_selection %>%
    cv_fit = map(
      ~ ml_fit_cv(
        tbl = tbl_titanic,
        model_fun = ml_logistic_regression,
        label_col = 'survived',
        feature_cols_exclude = .x,
        param_grid = param_grid,
        seed = 2018
    avg_metric = map_dbl(cv_fit, ~ .x$avg_metrics)

The last thing we want to know is the performance of the full model to compare the reduced models with features excluded. We can get this full model by simple using our ml_fit_cv() function without specifying the feature_cols_exclude argument.

cv_fit_full <- ml_fit_cv(
  param_grid = param_grid, 
  seed = 2018

The average performance for the full model can then be added into df_feature_selection.

df_feature_selection <- df_feature_selection %>%
  mutate(full_avg_metric = cv_fit_full$avg_metrics)

Finally we can plot how the performance of each of the reduced models compares to the full model. We can see straight away that sex is an important predictor of survival, whereas parch and fare are not.

       aes(excluded_feature, avg_metric, fill = avg_metric)) +
  coord_cartesian(ylim = c(0.5, 1)) +
  geom_hline(aes(yintercept = full_avg_metric), linetype = 'dashed') +
    x = 'sex',
    y = 0.87,
    label = 'Full model',
    family = 'mono',
    alpha = .75,
    size = 3.5
  ) +
  geom_point(shape = 21, size = 2.5, show.legend = FALSE) +
  scale_fill_viridis_c(option = 'B', end = 0.9) +
    x = 'Excluded feature',
    y = 'Average Area \nunder the ROC curve',
    title = 'Feature selection by cross-validation',
    subtitle = 'Average drop in performance from removing each feature'
  ) +
  theme_minimal(base_size = 12, base_family = 'mono') +
  theme(panel.grid.major.x = element_blank())