# Feature Selection — Wrapper Method with AIC/BIC in R and Python(with Code example)

In the last article, we know the idea of the different data selection. In this article, we will be focusing on the implementation with Wrapper Method in R and python. Wrapper method is a greedy search algorithm that attempts to find the “optimal” feature subset by iteratively selecting features based on the classifier performance

Let‘s use the dataset ‘swiss’ from R to demo our R code for the wrapper method. In R, there are two common ways: step( ) and regsubsets( ).

First, lets introduce the step( ) function:

step() function can perform different wrapper function by specific the direction like `direction = "forward".`

The function is using AIC to evaluate by default, and you may specific `criterion = "BIC"`

to change the evaluation critierion. The example below I use `direction = "both"`

which indicates to stepwise selection. The model we started with is the full linear model.

`data(swiss)`

lm1 <- lm(Fertility ~ ., data = swiss)

slm1 <- step(lm1, direction = "both")

summary(slm1)

From summary(slm1), we observe the started AIC is 190.69, and the for each predicator, there is an AIC score on the right. We can see the AIC socre for removing ‘Examination’ is lower than full model, so it suggest to remove ‘Examination’. Then second time when it run the step( ), since it is stepwise, we see there is + ‘Examination’ option as well, but all the options will not reduce AIC anymore, so the function stopped here. The final model just simply removed ‘Examination’.

Here’s the example how we started with null model with ‘forward’ direction using step( ).

````{r}`

min.model = lm(Fertility ~ 1, data = swiss)

biggest <- formula(lm(Fertility~.,swiss))

fwd.model = step(min.model, direction='forward', scope=biggest)

```

Besides Linear regression, we also can perform other model to run with step( ), here is the example using step( ) with generalized linear model (GLM). (tips: glm(family=binomial) →logistic regression)

`data(swiss)`

lm1 <- glm(Fertility ~ ., data = swiss)

slm1 <- step(lm1, direction = "both")

summary(slm1)

Let’s quick talk why we want to use AIC or , AIC stands for **Akaike information criterion, **and** **BIC stands for **Bayesian Information Criterion.** They are the mathematical methods for evaluating how well a model fits the data it was generated from. In statistics, AIC is used to compare different possible models and determine which one is the best fit for the data. AIC and BIC can be used to perform data selection when you have a set of candidate models and you want to select the model that provides the best trade-off between goodness of fit and complexity. In general, models with lower AIC or BIC values are preferred, as they provide a better balance between fit and complexity. The formula of AIC:

AIC = -2*log(L) + 2*p

**p** is the number of independent variables used (predicators) and **L** is the log-likelihood estimate (a.k.a. the likelihood that the model could have produced your observed y-values).

Also the formula of BIC:

BIC = -2*log(L) + p*log(N)

**p** is the number of independent variables used (predicators), **L** is the log-likelihood estimate (a.k.a. the likelihood that the model could have produced your observed y-values), and **N **s the number of examples in the training dataset.

We see that the penalty for AIC is less than for BIC. This causes AIC to pick more complex models.— Page 162, Machine Learning: A Probabilistic Perspective, 2012.

Thats why we tend to use AIC, but if we have large number of sample in our dataset, we would wnat to choose BIC, becuase…

…given a family of models, including the true model, the probability that BIC will select the correct model approaches one as the sample size N -> infinity.

— Page 235, The Elements of Statistical Learning, 2016.

The downside of BIC is that for smaller, less representative training datasets, it is more likely to choose models that are too simple. So when we have small dataset, we prefer go with AIC.

The Second common way to perform wrapper method in R is `regsubsets().`

Different than step( ), regsubsets is also using wrapper function but limited with number of predicator, which its used for finding the best subset. The `regsubsets()`

function in R is used to perform best subset selection in linear regression models. It is part of the "leaps" package in R and is used to select the best subset of predictor variables that explain the variability in the response variable. You need to specify the option `nvmax`

, which represents the maximum number of predictors to incorporate in the model. For example, if `nvmax = 5`

, the function will return up to the best 5-variables model, that is, it returns the best 1-variable model, the best 2-variables model, …, the best 5-variables models.

The star sign at the bottom shows the significant level between the predicators, and we can tell the ‘Examination’ is less significant than the others, which if we only want the best 4 predicators then we will remove it.

Also, we can use the following code to see what number of predicators are best for different evaluation criteria.

````{r}`

res.sum <- summary(reg1)

data.frame(

Adj.R2 = which.max(res.sum$adjr2),

CP = which.min(res.sum$cp),

BIC = which.min(res.sum$bic)

)

```

output: 5, 4, 4

**Remark:** There is no single correct solution to model selection, each of these criteria will lead to slightly different models. Remember that, “All models are wrong, some models are useful”. For more precise evaluation terminology, you may involve cross validation as well.

link for More example with regsubsets

For Python Users, there are also plenty of package that can perform feature selection with wrapper method. The three main packages I want to introduce is **mlxtend, sklearn **and **statsmodels**.

# Mlxtend

The `SequentialFeatureSelector`

class in `mlxtend`

provides forward and backward selection algorithms, which are named `SFS`

(Sequential Forward Selection) and `SBS`

(Sequential Backward Selection), respectively. Both algorithms take a machine learning model and a scoring function as inputs, and use these to evaluate the performance of the model on different subsets of features.

`from mlxtend.feature_selection import SequentialFeatureSelector`

from sklearn.linear_model import LinearRegression

from sklearn.datasets import load_boston

from sklearn.model_selection import train_test_split

# Load the Boston Housing dataset

boston = load_boston()

X, y = boston.data, boston.target

# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a linear regression model

lr = LinearRegression()

# Create a Sequential Forward Selector object

sfs = SequentialFeatureSelector(lr, k_features=5, forward=True, scoring='neg_mean_squared_error', cv=5)

# Fit the object to the training data

sfs = sfs.fit(X_train, y_train)

# Print the selected features

print(sfs.k_feature_idx_)

# Print the final feature set

X_train_sfs = sfs.transform(X_train)

X_test_sfs = sfs.transform(X_test)

We create a `SequentialFeatureSelector`

object called `sfs`

with the linear regression model, specifying that we want to select 5 features using `k_features=5`

, that we want to use forward selection with `forward=True`

, that we want to evaluate feature subsets using negative mean squared error with `scoring='neg_mean_squared_error'`

, and that we want to use 5-fold cross-validation with `cv=5`

.

We fit the `sfs`

object to the training data using `fit()`

method. Then we print the selected features using `k_feature_idx_`

. Finally, we transform the training and testing data using `transform()`

to get the final feature set.

Note that `SequentialFeatureSelector`

supports various performance evaluation criteria, such as accuracy, AUC, f1-score, etc., and can be used with any machine learning model that has `fit()`

and `predict()`

methods.

# Sklearn

`sklearn.feature_selection.RFE`

: Recursive Feature Elimination (RFE) is a method for selecting features by recursively removing the least important features until a desired number of features is reached. The `RFE`

class in scikit-learn (a.k.a. `sklearn`

) library implements this method. It can work with any model that has a `coef_`

or `feature_importances_`

attribute after fitting. It returns a mask of the selected features.

`# Import your necessary dependencies`

from sklearn.feature_selection import RFE

from sklearn.linear_model import LogisticRegression

# Feature extraction

model = LogisticRegression()

rfe = RFE(model, 3)

fit = rfe.fit(X, Y)

print("Num Features: %s" % (fit.n_features_))

print("Selected Features: %s" % (fit.support_))

print("Feature Ranking: %s" % (fit.ranking_))

output:

Num Features: 3

Selected Features: [ True False False False False True True False]

Feature Ranking: [1 2 3 5 6 1 1 4]

These are marked True in the *support* array and marked with a choice “1” in the *ranking* array. This, in turn, indicates the strength of these features.

Note: since sklearn does not built in AIC, we can construct a AIC score function and implement into RFE, like

`# Define a custom scorer function that calculates AIC`

def aic_scorer(estimator, X, y):

n_samples, n_features = X.shape

y_pred = estimator.predict(X)

rss = np.sum((y - y_pred) ** 2)

k = n_features

aic = n_samples * np.log(rss / n_samples) + 2 * k

return -aic

# Create a recursive feature elimination object and fit it to the data

rfe = RFE(estimator, n_features_to_select=5, step=1, scoring=aic_scorer)

rfe.fit(X, y)

**statsmodels**

From **statsmodels** you can perform by using `add()`

and `drop().`

`statsmodels.api.OLS`

: OLS stands for Ordinary Least Squares, which is a linear regression method that estimates the coefficients of the linear equation by minimizing the sum of the squared residuals. The `OLS`

class in the `statsmodels.api`

package can be used to fit a linear regression model and select the features based on the p-values of their coefficients. You can use the `add()`

and `drop()`

methods to add or drop features iteratively until you find the best subset of features.

`import statsmodels.api as sm`

from sklearn.datasets import load_boston

# Load the Boston Housing dataset

boston = load_boston()

X = boston.data

y = boston.target

# Define a function to perform wrapper feature selection based on AIC

def wrapper_aic(X, y):

k = X.shape[1]

included = set()

best_aic = None

while True:

aic_values = []

for i in range(k):

if i not in included:

model = sm.OLS(y, sm.add_constant(X[:, list(included) + [i]])).fit()

aic_values.append((model.aic, i))

if not aic_values:

break

aic_values.sort()

new_aic, new_feature = aic_values[0]

if best_aic is None or new_aic < best_aic:

included.add(new_feature)

best_aic = new_aic

else:

break

model = sm.OLS(y, sm.add_constant(X[:, list(included)])).fit()

return model, included

# Perform wrapper feature selection based on AIC

best_model, selected_features = wrapper_aic(X, y)

# Print the selected features

print(selected_features)

We start with an empty set of included features and iterate until no further improvement can be achieved. Inside the loop, we create a list of models, where each model is fitted on the current set of included features and a feature not already included. We then sort the models based on their AIC value and select the model that results in the best AIC value. If the new AIC value is better than the current best AIC value, we add the corresponding feature to the set of included features using `add()`

method. If the new AIC value is not better, we exit the loop.

Finally, we fit the OLS model on the selected features and return the best model and the list of selected feature indices using `selected_features`

variable.

# Follow and upvote would help~

References:

https://machinelearningmastery.com/probabilistic-model-selection-measures/

https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/step