Feature Selection — Wrapper Method with AIC/BIC in R and Python(with Code example)

8 min readApr 7, 2023

Data Selection Intro— Filter Method, Wrapper Method and Embedded Method

Compare the different between Filter Method, Wrapper Method and Embedded Method.

sharkyun.medium.com

In the last article, we know the idea of the different data selection. In this article, we will be focusing on the implementation with Wrapper Method in R and python. Wrapper method is a greedy search algorithm that attempts to find the “optimal” feature subset by iteratively selecting features based on the classifier performance

Let‘s use the dataset ‘swiss’ from R to demo our R code for the wrapper method. In R, there are two common ways: step( ) and regsubsets( ).

First, lets introduce the step( ) function:

step() function can perform different wrapper function by specific the direction like direction = "forward". The function is using AIC to evaluate by default, and you may specific criterion = "BIC" to change the evaluation critierion. The example below I use direction = "both" which indicates to stepwise selection. The model we started with is the full linear model.

data(swiss)
lm1 <- lm(Fertility ~ ., data = swiss)
slm1 <- step(lm1, direction = "both")
summary(slm1)

From summary(slm1), we observe the started AIC is 190.69, and the for each predicator, there is an AIC score on the right. We can see the AIC socre for removing ‘Examination’ is lower than full model, so it suggest to remove ‘Examination’. Then second time when it run the step( ), since it is stepwise, we see there is + ‘Examination’ option as well, but all the options will not reduce AIC anymore, so the function stopped here. The final model just simply removed ‘Examination’.

Here’s the example how we started with null model with ‘forward’ direction using step( ).

```{r}
min.model = lm(Fertility ~ 1, data = swiss)
biggest <- formula(lm(Fertility~.,swiss))
fwd.model = step(min.model, direction='forward', scope=biggest)
```

Besides Linear regression, we also can perform other model to run with step( ), here is the example using step( ) with generalized linear model (GLM). (tips: glm(family=binomial) →logistic regression)

data(swiss)
lm1 <- glm(Fertility ~ ., data = swiss)
slm1 <- step(lm1, direction = "both")
summary(slm1)

Let’s quick talk why we want to use AIC or , AIC stands for Akaike information criterion, and BIC stands for Bayesian Information Criterion. They are the mathematical methods for evaluating how well a model fits the data it was generated from. In statistics, AIC is used to compare different possible models and determine which one is the best fit for the data. AIC and BIC can be used to perform data selection when you have a set of candidate models and you want to select the model that provides the best trade-off between goodness of fit and complexity. In general, models with lower AIC or BIC values are preferred, as they provide a better balance between fit and complexity. The formula of AIC:

AIC = -2*log(L) + 2*p

p is the number of independent variables used (predicators) and L is the log-likelihood estimate (a.k.a. the likelihood that the model could have produced your observed y-values).

Also the formula of BIC:

BIC = -2*log(L) + p*log(N)

p is the number of independent variables used (predicators), L is the log-likelihood estimate (a.k.a. the likelihood that the model could have produced your observed y-values), and N s the number of examples in the training dataset.

We see that the penalty for AIC is less than for BIC. This causes AIC to pick more complex models. — Page 162, Machine Learning: A Probabilistic Perspective, 2012.

Thats why we tend to use AIC, but if we have large number of sample in our dataset, we would wnat to choose BIC, becuase…

…given a family of models, including the true model, the probability that BIC will select the correct model approaches one as the sample size N -> infinity.

— Page 235, The Elements of Statistical Learning, 2016.

The downside of BIC is that for smaller, less representative training datasets, it is more likely to choose models that are too simple. So when we have small dataset, we prefer go with AIC.

The Second common way to perform wrapper method in R is regsubsets(). Different than step( ), regsubsets is also using wrapper function but limited with number of predicator, which its used for finding the best subset. The regsubsets() function in R is used to perform best subset selection in linear regression models. It is part of the "leaps" package in R and is used to select the best subset of predictor variables that explain the variability in the response variable. You need to specify the option nvmax, which represents the maximum number of predictors to incorporate in the model. For example, if nvmax = 5, the function will return up to the best 5-variables model, that is, it returns the best 1-variable model, the best 2-variables model, …, the best 5-variables models.

The star sign at the bottom shows the significant level between the predicators, and we can tell the ‘Examination’ is less significant than the others, which if we only want the best 4 predicators then we will remove it.

Also, we can use the following code to see what number of predicators are best for different evaluation criteria.

```{r}
res.sum <- summary(reg1)
data.frame(
  Adj.R2 = which.max(res.sum$adjr2),
  CP = which.min(res.sum$cp),
  BIC = which.min(res.sum$bic)
)
```
output: 5, 4, 4

Remark: There is no single correct solution to model selection, each of these criteria will lead to slightly different models. Remember that, “All models are wrong, some models are useful”. For more precise evaluation terminology, you may involve cross validation as well.

link for More example with regsubsets

For Python Users, there are also plenty of package that can perform feature selection with wrapper method. The three main packages I want to introduce is mlxtend, sklearn and statsmodels.

Mlxtend

The SequentialFeatureSelector class in mlxtend provides forward and backward selection algorithms, which are named SFS (Sequential Forward Selection) and SBS (Sequential Backward Selection), respectively. Both algorithms take a machine learning model and a scoring function as inputs, and use these to evaluate the performance of the model on different subsets of features.

from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

# Load the Boston Housing dataset
boston = load_boston()
X, y = boston.data, boston.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a linear regression model
lr = LinearRegression()

# Create a Sequential Forward Selector object
sfs = SequentialFeatureSelector(lr, k_features=5, forward=True, scoring='neg_mean_squared_error', cv=5)

# Fit the object to the training data
sfs = sfs.fit(X_train, y_train)

# Print the selected features
print(sfs.k_feature_idx_)

# Print the final feature set
X_train_sfs = sfs.transform(X_train)
X_test_sfs = sfs.transform(X_test)

We create a SequentialFeatureSelector object called sfs with the linear regression model, specifying that we want to select 5 features using k_features=5, that we want to use forward selection with forward=True, that we want to evaluate feature subsets using negative mean squared error with scoring='neg_mean_squared_error', and that we want to use 5-fold cross-validation with cv=5.

We fit the sfs object to the training data using fit() method. Then we print the selected features using k_feature_idx_. Finally, we transform the training and testing data using transform() to get the final feature set.

Note that SequentialFeatureSelector supports various performance evaluation criteria, such as accuracy, AUC, f1-score, etc., and can be used with any machine learning model that has fit() and predict() methods.

Sklearn

sklearn.feature_selection.RFE: Recursive Feature Elimination (RFE) is a method for selecting features by recursively removing the least important features until a desired number of features is reached. The RFE class in scikit-learn (a.k.a. sklearn) library implements this method. It can work with any model that has a coef_ or feature_importances_ attribute after fitting. It returns a mask of the selected features.

# Import your necessary dependencies
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
# Feature extraction
model = LogisticRegression()
rfe = RFE(model, 3)
fit = rfe.fit(X, Y)
print("Num Features: %s" % (fit.n_features_))
print("Selected Features: %s" % (fit.support_))
print("Feature Ranking: %s" % (fit.ranking_))

output:

Num Features: 3

Selected Features: [ True False False False False True True False]

Feature Ranking: [1 2 3 5 6 1 1 4]

These are marked True in the support array and marked with a choice “1” in the ranking array. This, in turn, indicates the strength of these features.

Note: since sklearn does not built in AIC, we can construct a AIC score function and implement into RFE, like

# Define a custom scorer function that calculates AIC
def aic_scorer(estimator, X, y):
    n_samples, n_features = X.shape
    y_pred = estimator.predict(X)
    rss = np.sum((y - y_pred) ** 2)
    k = n_features
    aic = n_samples * np.log(rss / n_samples) + 2 * k
    return -aic

# Create a recursive feature elimination object and fit it to the data
rfe = RFE(estimator, n_features_to_select=5, step=1, scoring=aic_scorer)
rfe.fit(X, y)

statsmodels

From statsmodels you can perform by using add() and drop().

statsmodels.api.OLS: OLS stands for Ordinary Least Squares, which is a linear regression method that estimates the coefficients of the linear equation by minimizing the sum of the squared residuals. The OLS class in the statsmodels.api package can be used to fit a linear regression model and select the features based on the p-values of their coefficients. You can use the add() and drop() methods to add or drop features iteratively until you find the best subset of features.

import statsmodels.api as sm
from sklearn.datasets import load_boston

# Load the Boston Housing dataset
boston = load_boston()
X = boston.data
y = boston.target

# Define a function to perform wrapper feature selection based on AIC
def wrapper_aic(X, y):
    k = X.shape[1]
    included = set()
    best_aic = None
    while True:
        aic_values = []
        for i in range(k):
            if i not in included:
                model = sm.OLS(y, sm.add_constant(X[:, list(included) + [i]])).fit()
                aic_values.append((model.aic, i))
        if not aic_values:
            break
        aic_values.sort()
        new_aic, new_feature = aic_values[0]
        if best_aic is None or new_aic < best_aic:
            included.add(new_feature)
            best_aic = new_aic
        else:
            break
    model = sm.OLS(y, sm.add_constant(X[:, list(included)])).fit()
    return model, included

# Perform wrapper feature selection based on AIC
best_model, selected_features = wrapper_aic(X, y)

# Print the selected features
print(selected_features)

We start with an empty set of included features and iterate until no further improvement can be achieved. Inside the loop, we create a list of models, where each model is fitted on the current set of included features and a feature not already included. We then sort the models based on their AIC value and select the model that results in the best AIC value. If the new AIC value is better than the current best AIC value, we add the corresponding feature to the set of included features using add() method. If the new AIC value is not better, we exit the loop.

Finally, we fit the OLS model on the selected features and return the best model and the list of selected feature indices using selected_features variable.

Follow and upvote would help~

References:

https://machinelearningmastery.com/probabilistic-model-selection-measures/

https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/step