Adam Optimizer

Palavras-chave:

Publicado em: 03/08/2025

Understanding and Implementing the Adam Optimizer

The Adam optimizer is a popular algorithm for training deep learning models. It combines the benefits of both AdaGrad and RMSProp, resulting in an adaptive learning rate method that is efficient and requires relatively little tuning. This article aims to provide a comprehensive understanding of the Adam optimizer and guides you through its implementation.

Fundamental Concepts / Prerequisites

To effectively understand the Adam optimizer, you should have a basic understanding of the following concepts:

* **Gradient Descent:** The fundamental optimization algorithm that iteratively updates model parameters in the direction of the negative gradient of the loss function. * **Learning Rate:** A hyperparameter that controls the step size during parameter updates. * **Momentum:** A technique that accelerates gradient descent by accumulating the exponentially decaying average of past gradients. * **RMSProp:** An adaptive learning rate method that uses the exponentially decaying average of squared gradients to normalize the learning rate. * **Bias Correction:** A technique used in Adam to correct the initial bias introduced by exponential moving averages.

Core Implementation

This section provides a Python implementation of the Adam optimizer.


import numpy as np

class Adam:
    def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
        self.learning_rate = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.m = None  # First moment vector
        self.v = None  # Second moment vector
        self.t = 0  # Timestep

    def update(self, params, grads):
        """
        Updates the parameters using the Adam optimization algorithm.

        Args:
            params (list): A list of parameter arrays to be updated.
            grads (list): A list of gradient arrays corresponding to the parameters.
        """
        if self.m is None:
            self.m = [np.zeros_like(param) for param in params]
            self.v = [np.zeros_like(param) for param in params]

        self.t += 1

        updated_params = []
        for i, (param, grad) in enumerate(zip(params, grads)):
            # Update biased first moment estimate
            self.m[i] = self.beta1 * self.m[i] + (1 - self.beta1) * grad

            # Update biased second raw moment estimate
            self.v[i] = self.beta2 * self.v[i] + (1 - self.beta2) * (grad ** 2)

            # Compute bias-corrected first moment estimate
            m_hat = self.m[i] / (1 - self.beta1 ** self.t)

            # Compute bias-corrected second raw moment estimate
            v_hat = self.v[i] / (1 - self.beta2 ** self.t)

            # Update parameters
            param -= self.learning_rate * m_hat / (np.sqrt(v_hat) + self.epsilon)
            updated_params.append(param)

        return updated_params

Code Explanation

The code defines an `Adam` class that implements the Adam optimization algorithm.

`__init__` method: Initializes the optimizer with the learning rate, beta1, beta2, and epsilon. `beta1` and `beta2` are exponential decay rates for the moment estimates. `epsilon` is a small constant added to the denominator to prevent division by zero. It also initializes `m` (first moment vector), `v` (second moment vector), and `t` (timestep) to None and 0, respectively.

`update` method: Updates the parameters based on the calculated gradients. It takes two lists as input: `params` (the model's parameters) and `grads` (the gradients of the loss function with respect to the parameters). First, it initializes the moment vectors `m` and `v` to zero if they are None (this happens during the first update). Then, it increments the timestep `t`. For each parameter and gradient pair, it updates the biased first moment estimate (`self.m[i]`), updates the biased second raw moment estimate (`self.v[i]`), computes bias-corrected first moment estimate (`m_hat`), computes bias-corrected second raw moment estimate (`v_hat`), and finally updates the parameter using the Adam update rule. The updated parameters are stored in `updated_params` list and returned.

Complexity Analysis

The time and space complexity of the Adam optimizer are primarily determined by the size of the model parameters.

Time Complexity: The `update` method iterates through each parameter and gradient, performing a constant number of operations per parameter. Therefore, the time complexity is O(n), where n is the number of parameters in the model.

Space Complexity: The Adam optimizer stores the first and second moment vectors (m and v) for each parameter. Therefore, the space complexity is O(n), where n is the number of parameters in the model.

Alternative Approaches

One alternative to the Adam optimizer is the Stochastic Gradient Descent (SGD) with momentum. SGD with momentum updates the parameters by accumulating a velocity vector based on past gradients. While SGD with momentum can converge to better solutions than Adam in some cases, it typically requires more careful tuning of the learning rate and momentum parameters. Adam, being adaptive, often performs well with less hand-tuning.

Conclusion

The Adam optimizer is a powerful and widely used algorithm for training deep learning models. It combines the advantages of momentum and adaptive learning rates, making it efficient and relatively easy to use. Understanding its implementation and complexities can help you fine-tune its performance for specific tasks and troubleshoot any issues that may arise during training.