Augustin-Louis Cauchy, a mathematician, first invented gradient descent in 1847 to solve calculations in astronomy and estimate stars’ orbits. Learn about the role it plays today in optimizing machine learning algorithms.
![[Featured Image] A machine learning engineer uses a gradient descent to train a model.](https://d3njjcbhbojbot.cloudfront.net/api/utilities/v1/imageproxy/https://images.ctfassets.net/wp1lcwdav1p1/j4dXV8GUtttNlSmJyS94L/b44cae406087d560a0866da5a762c6c7/GettyImages-1388395588.jpg?w=1500&h=680&q=60&fit=fill&f=faces&fm=jpg&fl=progressive&auto=format%2Ccompress&dpr=1&w=1000)
Gradient descent is an algorithm you can use to train models in both neural networks and machine learning.
The “gradient” in the gradient descent formula represents the slope of the cost function at a specific point.
Convergence in gradient descent is the point where the algorithm has minimized the cost function enough that parameter updates are almost negligible.
You can strengthen your math skills to better understand and implement gradient descent.
Find out what gradient descent is, why it’s vital for machine learning, its main types, and potential limitations. Afterward, if you’re ready to build foundational skills in machine learning, consider enrolling in DeepLearning.AI’s Mathematics for Machine Learning and Data Science Specialization. Along with machine learning methods, this program offers guidance on data preprocessing, statistical analysis, applied mathematics, A/B testing, and more.
Gradient descent is an algorithm you can use to train models in both neural networks and machine learning. It uses a cost function to optimize its parameters, showing the accuracy of a machine learning model under training at each parameter. Gradient descent existed as a mathematical concept before the emergence of machine learning.
A gradient in vector calculus is similar to the slope but applies when you have three or more variables. It becomes the vector of partial derivatives for all independent variables and is denoted f for the maximum gradient increase or -∇f for the maximum gradient decrease of the function.
Gradient descent requires knowledge of calculus for its implementation in machine learning.
Gradient descent works with convex functions and finds the fewest and most accurate number of steps toward the lowest point of a curve, optimizing the path. Let’s go over a couple of terms that inform gradient descent before examining how they work:
Parameters: The coefficients of the function that minimize the cost
Cost function: Also called the “loss function” in machine learning, it is the difference between the actual and predicted value at the present position. A model stops learning once this function gets as close as possible to 0.0
Learning rate: Sometimes referred to as the step rate or alpha, it is the magnitude of the steps the function takes as it minimizes the cost
The primary function of gradient descent is to find the parameters that best minimize the cost by having the coefficients make the cost equal to or as close to 0.0 as possible
In the gradient descent formula, the “gradient” is the slope of the cost function at a specific point. It shows the direction and rate of the steepest descent, using it to move toward the minimum of the function.
The goal is to have the cost = 0.0 or the closest acceptable minimum. To calculate this, start by writing the cost function.
Write the cost function as cost =f(x) with x as the coefficient.
Use a starting coefficient of 0.0 or any small number.
Take the derivative, or partial derivative if multiple variables are present, to find the gradient to know which direction to move in on the curve.
Once you have a gradient (the derivative of the cost function) and know which way to move, use your learning rate to tell how much the value of the coefficient changes every calculation.
Repeat until the cost is zero or as close as it can reach.
Gradient descent involves knowledge of calculus, but its implementation is always the same series of steps.
Convergence in gradient descent is the point where the algorithm has minimized the cost function enough that parameter updates are almost negligible. This is essentially when the model reaches the “best” solution, and more iterations are unlikely to improve findings.
Machine learning uses two main types of gradient descent:
Batch gradient descent (BGD): Provides updating for the machine learning model after each training epoch by averaging the error of predictions and actual outcomes of the cost function at each iteration.
Stochastic gradient descent (SGD): Calculates the error for every sample in the data set, needing a prediction for every iteration of the training epoch, recalculating each coefficient for every instance.
Batch gradient descent and stochastic gradient descent have unique advantages and disadvantages when calculating gradient descent in machine learning. Let’s take a look at each:
| Batch gradient descent | Stochastic gradient descent |
|---|---|
| Has a higher efficiency in computing | Takes more computing power |
| Has a lower update frequency, leading to a more stable learning rate as it reaches 0.0 | Has a higher update frequency, leading to a faster learning rate and quicker insights into model performance |
| Since running the entire batch of data is slower, batch gradient descent can reach 0.0 without optimizing the coefficients. | Since SGD makes a prediction every step, it leads to more accurate predictions before reaching 0.0. |
| Requires more memory to store large data sets because all data must fit | Easier to run large data sets since SGD runs one training epoch at a time |
Batch gradient descent is a common approach to machine learning, but stochastic gradient descent performs better on larger data sets.
If you need to use aspects of both batch gradient descent and SGD, consider using a method called mini-batch gradient descent (MBGD) that combines them. It still uses batches but breaks up a data set into small batches, each providing the updates from SGD as they perform gradient descent. It makes the learning of each batch quick while also remaining computationally fast. This method is standard in machine learning, the training of neural networks, and deep learning applications.
While gradient descent is an efficient way to optimize machine learning algorithms, you will find some common problems the algorithm runs into that may leave you with models that aren’t fully optimized. With graphs that are not entirely convex parabolas, points other than the global minimum can make the cost function equal to 0.0. These two points are:
Local minima: These give a slope of 0.0 and seem like global minimum points to the algorithm, but are just local minimum points before the cost function increases again before reaching the global minimum
Saddle points: Give a slope of 0.0 at a series of points where the cost function stops steadily decreasing before it continues its descent to the global minimum
In applying gradient descent to deep learning neural networks, two issues arise:
Vanishing gradients: Occur during the backpropagation of neural networks, making the gradient too small, leading to an eventual coefficient of zero, resulting in the network stopping learning
Exploding gradients: Occur when a model becomes unstable due to a gradient that is too large, leading to coefficients that are no longer calculable because of a complex machine learning algorithm
Read more: Neural Network Weights: A Comprehensive Guide
Subscribe to Career Chat on LinkedIn to stay current with the latest trends in machine learning. Continue your learning journey with our other free digital resources:
Explore certifications: 6 machine learning certificates + how to choose the right one for you
Watch on YouTube: Machine Learning Classification | Python Diabetes Prediction Model
Plan your career trajectory: Machine Learning Career Paths: Explore Roles & Specializations
Accelerate your career growth with a Coursera Plus subscription. When you enroll in either the monthly or annual option, you’ll get access to over 10,000 courses.
Editorial Team
Coursera’s editorial team is comprised of highly experienced professional editors, writers, and fact...
This content has been made available for informational purposes only. Learners are advised to conduct additional research to ensure that courses and other credentials pursued meet their personal, professional, and financial goals.