What is Gradient Descent? — AI Glossary

Gradient descent diagram showing iterative steps descending toward a loss minimum

Gradient descent is the optimization algorithm that trains most AI models. It works by repeatedly adjusting a model’s parameters in the direction that most reduces the prediction error, taking small steps “downhill” on the loss surface until reaching a minimum. It is the engine that makes machine learning learn.

Imagine a hiker lost in fog on a hilly landscape, trying to find the lowest valley. Without being able to see far, the hiker’s best strategy is to feel the slope beneath their feet and take a step in the steepest downhill direction. Repeat thousands of times, and they will likely reach a valley. That is gradient descent in one sentence.

Learn Our Proven AI Frameworks

Beginners in AI created 6 branded frameworks to help you master AI: STACK for prompting, BUILD for business, ADAPT for learning, THINK for decisions, CRAFT for content, and CRON for automation.

How Gradient Descent Works

The “gradient” is a vector that points in the direction of steepest increase in the loss function. To minimize loss, gradient descent moves in the opposite direction — subtracting the gradient from the current parameters. The formula for each parameter update is:

parameter = parameter - learning_rate × gradient

The learning rate controls step size. Too large: the model bounces around and never converges. Too small: training is correct but glacially slow. Finding the right learning rate (or using adaptive schedulers that adjust it automatically) is one of the most important hyperparameter decisions in training.

Gradients are computed by backpropagation — the algorithm that propagates error signals backwards through the network to compute each parameter’s contribution to the loss. Gradient descent and backpropagation work together: backpropagation computes the gradients; gradient descent uses them to update parameters.

Variants of Gradient Descent

Three main variants differ in how many training examples they use per update:

  • Batch gradient descent — computes gradient over the entire dataset before updating. Most stable, but prohibitively slow for large datasets.
  • Stochastic gradient descent (SGD) — updates after every single example. Noisy but fast; the noise can actually help escape local minima.
  • Mini-batch gradient descent — computes gradient over a small batch (32–512 examples) before updating. The standard in practice — balances speed and stability.

Modern optimizers build on SGD with adaptive learning rates and momentum:

  • Adam — maintains per-parameter learning rates and momentum estimates. The default choice for most deep learning.
  • AdamW — Adam with weight decay regularization. Standard for training transformers.
  • RMSProp — adapts learning rates based on recent gradient magnitudes.

Gradient Descent in Practice

For a large language model with 70 billion parameters, gradient descent must compute and apply 70 billion gradient values at each training step — thousands of times per second across thousands of GPUs running in parallel. This is the core computational challenge of AI training, driving demand for specialized hardware like NVIDIA H100s and Google TPUs.

A key risk is getting stuck in local minima — points where the loss is lower than nearby points but not the global minimum. In practice, large neural networks seem to have many good local minima, and modern training techniques (learning rate schedules, noise, batch size) help navigate the loss landscape effectively.

Common Misconceptions

Misconception: Gradient descent always finds the best possible solution. It finds a local minimum, which may not be the global minimum. For complex models on large datasets, this is typically “good enough” — but the solution found depends on initialization, learning rate, and other factors.

Misconception: A larger learning rate always trains faster. A learning rate that is too large will cause the loss to oscillate or diverge. The right learning rate depends on the model, dataset, and optimizer. Learning rate warmup (starting small and increasing gradually) is common practice for training transformers.


Key Takeaways

  • Gradient descent adjusts model parameters iteratively to minimize the loss function.
  • The gradient points uphill; gradient descent moves in the opposite direction.
  • Learning rate controls step size — a critical hyperparameter.
  • Mini-batch SGD is the standard practice; Adam and AdamW are the dominant optimizers.
  • Backpropagation computes the gradients that gradient descent uses to update parameters.

Frequently Asked Questions

What is the learning rate?

The learning rate is a hyperparameter that controls how large each parameter update step is. Typical values range from 0.0001 to 0.01. Too high causes instability; too low causes slow convergence. Learning rate schedulers automatically reduce the rate during training for better final performance.

What is the difference between gradient descent and backpropagation?

Backpropagation is the algorithm that computes gradients — how much each parameter contributed to the prediction error. Gradient descent uses those gradients to update the parameters. Backpropagation answers “what direction to move?”; gradient descent does the moving.

What is a local vs. global minimum?

A local minimum is a point where loss is lower than all nearby points but not the lowest possible. A global minimum is the absolute lowest loss achievable. Gradient descent can get “stuck” in local minima. For large neural networks, research suggests most local minima in practice are nearly as good as the global minimum.

Why is Adam the most popular optimizer?

Adam (Adaptive Moment Estimation) maintains per-parameter learning rates that adapt based on the history of gradients. This makes it robust to different loss landscape geometries and largely eliminates the need to hand-tune the learning rate as carefully as vanilla SGD requires.

Free Download: Free AI Guides

Download our free, beautifully designed PDF guides to ChatGPT, Claude, Gemini, and Grok — plain English, no fluff.

Download Free →

What is gradient clipping?

Gradient clipping caps the magnitude of gradients before applying them. Without it, occasionally very large gradients (the “exploding gradient” problem) can cause catastrophically large parameter updates that destabilize training. Clipping is standard practice for training transformers and recurrent neural networks.


Sources: Grokipedia — Gradient Descent · Google ML Crash Course: Gradient Descent · arXiv: Adam — A Method for Stochastic Optimization

Explore more AI fundamentals in the AI Glossary or grab our Beginner’s AI Cheat Sheet.

You May Also Like


Get free AI tips daily → Subscribe to Beginners in AI

Sources

This article draws on official documentation, product pages, and industry reporting. Specific sources are linked inline throughout the text.

Last reviewed: April 2026

Get Smarter About AI Every Morning

Free daily newsletter — one story, one tool, one tip. Plain English, no jargon.

Free forever. Unsubscribe anytime.

Discover more from Beginners in AI

Subscribe now to keep reading and get access to the full archive.

Continue reading