L1 and L2 Regularization Methods in Machine Learning

Palavras-chave:

Publicado em: 04/08/2025

L1 and L2 Regularization Methods in Machine Learning

Regularization is a crucial technique in machine learning to prevent overfitting, a common problem where a model learns the training data too well, including its noise, and performs poorly on unseen data. L1 and L2 regularization are two popular methods for achieving this by adding a penalty term to the cost function, discouraging overly complex models. This article will explain the concepts behind L1 and L2 regularization, provide code examples, and discuss their complexities and alternatives.

Fundamental Concepts / Prerequisites

Before diving into L1 and L2 regularization, it's helpful to understand the following concepts:

Overfitting: When a model learns the training data too well and performs poorly on unseen data.
Cost Function: A function that quantifies the error of a model's predictions. The goal of training is to minimize this function.
Model Complexity: The degree to which a model can fit various shapes of data. More complex models have more parameters.
Parameters: The coefficients in a model.

Core Implementation: L1 and L2 Regularization in Python


import numpy as np

def l1_regularization(weights, lambda_val):
    """
    Calculates the L1 regularization term.

    Args:
        weights (numpy.ndarray): The model's weights.
        lambda_val (float): The regularization strength (lambda).

    Returns:
        float: The L1 regularization term.
    """
    return lambda_val * np.sum(np.abs(weights))


def l2_regularization(weights, lambda_val):
    """
    Calculates the L2 regularization term.

    Args:
        weights (numpy.ndarray): The model's weights.
        lambda_val (float): The regularization strength (lambda).

    Returns:
        float: The L2 regularization term.
    """
    return (lambda_val / 2) * np.sum(weights ** 2)


def linear_regression_with_regularization(X, y, lambda_val, regularization_type='l2'):
    """
    Performs linear regression with either L1 or L2 regularization.

    Args:
        X (numpy.ndarray): The feature matrix.
        y (numpy.ndarray): The target vector.
        lambda_val (float): The regularization strength (lambda).
        regularization_type (str): 'l1' for L1 regularization, 'l2' for L2 regularization.

    Returns:
        numpy.ndarray: The learned weights.
    """
    # Add a column of ones to X for the bias term
    X = np.concatenate((np.ones((X.shape[0], 1)), X), axis=1)

    # Calculate the optimal weights using the normal equation with regularization
    if regularization_type == 'l2':
        identity_matrix = np.identity(X.shape[1])
        identity_matrix[0, 0] = 0  # Don't regularize the bias term
        weights = np.linalg.inv(X.T @ X + lambda_val * identity_matrix) @ X.T @ y
    elif regularization_type == 'l1':
        raise NotImplementedError("L1 regularization with normal equation is not directly solvable.  Gradient descent is needed.")
        # L1 regularization is typically implemented using gradient descent or other iterative optimization methods.
    else:
        raise ValueError("Invalid regularization type. Must be 'l1' or 'l2'.")

    return weights


# Example usage:
if __name__ == '__main__':
    # Generate some sample data
    np.random.seed(0)
    X = 2 * np.random.rand(100, 1)
    y = 4 + 3 * X + np.random.randn(100, 1)

    # L2 Regularization
    lambda_l2 = 0.1
    weights_l2 = linear_regression_with_regularization(X, y, lambda_l2, regularization_type='l2')
    print(f"L2 Regularized Weights: {weights_l2}")

    # Attempting L1 regularization using the normal equation will raise an error
    # Lambda for L1, for demonstration. Not used in the implemented calculation.
    lambda_l1 = 0.1
    try:
        weights_l1 = linear_regression_with_regularization(X, y, lambda_l1, regularization_type='l1')
        print(f"L1 Regularized Weights: {weights_l1}")
    except NotImplementedError as e:
        print(f"Error: {e}")

Code Explanation

The code demonstrates how to implement L1 and L2 regularization. The `l1_regularization` and `l2_regularization` functions calculate the respective regularization terms given the model's weights and the regularization strength (lambda). `linear_regression_with_regularization` function performs linear regression with either L1 or L2 penalty. L2 Regularization has an analytical solution in this case and it is implemented with the normal equation. The L1 Regularization case throws a `NotImplementedError` as the solution is found through iterative optimization such as gradient descent and not through the closed form `normal equation`.

The example usage generates sample data, applies L2 regularization, and prints the learned weights. It also demonstrates the error raised when attempting to directly solve for L1 with the normal equation.

Complexity Analysis

The time complexity of the `l1_regularization` and `l2_regularization` functions is O(n), where n is the number of weights in the model. This is because they iterate through the weights to calculate the sum of absolute values (L1) or the sum of squared values (L2). The space complexity is O(1) as they only store a few scalar values during computation.

The time complexity of the `linear_regression_with_regularization` function, using the normal equation with L2 regularization, is dominated by the matrix inversion operation, which is O(p^3), where p is the number of features (including the bias term). The space complexity is O(p^2) due to the need to store the inverse of the matrix.

Alternative Approaches

Gradient Descent: Both L1 and L2 regularization can be implemented using gradient descent or other iterative optimization algorithms. This is a more common approach, especially for L1 regularization where a closed-form solution like the normal equation is not readily available due to the non-differentiability of the absolute value function at zero. Gradient descent iteratively adjusts the model's weights based on the gradient of the cost function (including the regularization term). The trade-off is that gradient descent requires careful tuning of the learning rate and can be more computationally expensive than the normal equation for L2 when the number of features is relatively small.

Conclusion

L1 and L2 regularization are powerful techniques for preventing overfitting in machine learning models. L2 regularization shrinks the weights towards zero, while L1 regularization can lead to sparse models by setting some weights exactly to zero, effectively performing feature selection. Choosing the right regularization technique and strength (lambda) often involves experimentation and validation on a holdout dataset to achieve optimal performance on unseen data.