Recursive Feature Elimination

Palavras-chave:

Publicado em: 05/08/2025

Recursive Feature Elimination (RFE) in Machine Learning

Recursive Feature Elimination (RFE) is a feature selection technique that iteratively removes features from a model, building it on the remaining features in each iteration. The aim is to select a subset of features that are most relevant for predicting the target variable. This article provides a comprehensive guide to understanding and implementing RFE using scikit-learn in Python.

Fundamental Concepts / Prerequisites

To understand Recursive Feature Elimination (RFE), a basic understanding of the following concepts is helpful:

Feature Selection: The process of selecting a subset of relevant features for use in model construction.
Supervised Learning: Algorithms that learn from labeled data, where the goal is to map input features to output targets.
Model Training: The process of fitting a machine learning model to a dataset.
Model Evaluation: Assessing the performance of a trained model.
Scikit-learn: A Python library for machine learning.
Cross-Validation: A technique for evaluating a model's performance on unseen data.

We'll be using `sklearn`'s `RFE` class which works with estimators that provide feature ranking through attributes like `coef_` or `feature_importances_`. Common estimators used with RFE include linear models (e.g., Logistic Regression, Linear Regression) and tree-based models (e.g., Random Forests, Gradient Boosting).

Core Implementation/Solution

Here's an implementation of Recursive Feature Elimination using scikit-learn in Python:


from sklearn.datasets import make_classification
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 1. Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=0, random_state=42)

# 2. Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Initialize a Logistic Regression model
model = LogisticRegression(solver='liblinear', random_state=42)

# 4. Initialize RFE with the model and desired number of features to select
rfe = RFE(estimator=model, n_features_to_select=3)

# 5. Fit RFE to the training data
rfe.fit(X_train, y_train)

# 6. Get the selected features
selected_features = X_train[:, rfe.support_]

# 7. Train the model with the selected features
model.fit(selected_features, y_train)

# 8. Transform the test data to include only the selected features
X_test_selected = X_test[:, rfe.support_]

# 9. Make predictions on the test data
y_pred = model.predict(X_test_selected)

# 10. Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy with RFE: {accuracy}")

# Print the ranking of features
print("Feature Ranking:", rfe.ranking_)

Code Explanation

1. **Generate synthetic dataset:** `make_classification` creates a synthetic classification dataset. We create a dataset with 1000 samples, 10 features, 5 informative features and set a random state for reproducibility.

2. **Split data into training and testing sets:** We split the data into training (70%) and testing (30%) sets using `train_test_split`.

3. **Initialize a Logistic Regression model:** We initialize a `LogisticRegression` model. The `solver` is set to 'liblinear' because it's suitable for smaller datasets, and we set a `random_state` for reproducibility.

4. **Initialize RFE:** `RFE` is initialized with the Logistic Regression model and the desired number of features to select (in this case, 3).

5. **Fit RFE:** `rfe.fit` fits the RFE object to the training data. This involves iteratively training the model and removing features until the desired number of features is reached.

6. **Get the selected features:** `rfe.support_` is a boolean array indicating which features were selected. We use it to select the corresponding columns from the original feature matrix `X_train`.

7. **Train the model with the selected features:** We train a new Logistic Regression model using only the selected features from the training data.

8. **Transform the test data to include only the selected features:** Similar to the training data, we transform the test data to include only the selected features using `rfe.support_`.

9. **Make predictions:** We make predictions on the transformed test data using the trained model.

10. **Evaluate the model:** We calculate the accuracy of the model on the test set using `accuracy_score`. We also print the ranking of the features using `rfe.ranking_`. Lower ranking means more important feature.

Complexity Analysis

Time Complexity: The time complexity of RFE depends heavily on the underlying model and the number of features. In general, if the model fitting takes O(m) time and there are n features, then RFE will take approximately O(n*m) to execute since the underlying model is refitted iteratively. This is a simplified view, and the exact complexity will vary based on the precise model and implementation.

Space Complexity: The space complexity is primarily determined by the storage of the data and the model. The space complexity is mostly dominated by the underlying classifier. O(n) in many cases, where n is the number of features, but can be higher depending on the algorithm used by the selected features.

Alternative Approaches

An alternative to RFE is **SelectFromModel**. This approach uses a pre-trained model to select features based on their importance (e.g., feature importances in tree-based models or coefficients in linear models). While `SelectFromModel` is generally faster because it doesn't involve iterative model retraining, RFE can sometimes yield better results as it explicitly optimizes for feature selection performance.

Conclusion

Recursive Feature Elimination (RFE) is a valuable technique for feature selection, especially when dealing with high-dimensional datasets. It systematically eliminates less important features, leading to simpler and potentially more accurate models. While computationally more intensive than some other feature selection methods, it can often result in improved performance by identifying the most relevant features for a given task. Understanding its implementation and trade-offs is crucial for effective machine learning model building.