Gradient Descent Algorithm 101. Understand the optimization algorithm… | by Pol Marin

Beginner friendly guide

Understand the optimization algorithm widely used in deep and machine learning

The Mountainside — Photo by Ralph (Ravi) Kayden on Unsplash

Imagine you are a drop of water on top of a mountain and your goal is to reach the lake at the base of the mountain. This high mountain has different slopes and obstacles, so going down in a straight line might not be the best solution. How would you approach this problem? Without a doubt, the best solution would be to take small steps, one at a time, always in the direction that brings you closer to your ultimate goal.

Gradient Descent (GD) is the algorithm that does exactly that, and is essential for any data scientist to understand. It’s basic and fairly simple but crucial, and anyone willing to enter the field should be able to explain what it is.

In this post, my aim is to make a complete beginner friendly guide to make everyone understand what GD is, what it is for, how it works and mention different variations.

As always, you will find it resources section at the end of the post.

But first things first.


Using Wikipedia’s definition[1], Gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. Although it is certainly not the most effective method, it is commonly used in machine learning and deep learning, especially in neural networks.

It is basically used to minimize the value of a function by updating a set of parameters at each iteration. Mathematically speaking, it uses the derivative (gradient) to gradually decrease (decrease) its value.

But there is a catch: not all functions are optimisable. We need a function, either univariate or multivariate, that is differentiablewhich means that derivatives exist at each point of the domain of the function, i convex (U shape or similar).

Now, after this simple introduction, we can start to dig a little deeper into the mathematics behind it.

Case study

Because everything becomes clearer when we go beyond theory, use real numbers and values ​​to understand what it does.

Let’s use a common data science case where we want to develop a regression model.

Disclaimer: I totally made this up and there is no logical reasoning behind using these features, it just came out of nowhere. The goal is to show the process itself.

The cost function or loss function in any data science problem is the function we want to optimize. As we use regression, we will use this:

random function
Random Regression Function — Image by author

The goal is to find the optimal minimum of f(x,y). Let me draw what it looks like:

3d plot
f(x,y) represented with a=1 — Image by the author

Now our goal is to get the right values ​​for “x” and “y” that will allow us to find the optimal values ​​for this cost function. We can already see it graphically:

  • y=0
  • x being -1 or 1

In the GD itself, because we want our machine to learn to do the same.

The algorithm

As mentioned, gradient descent is an iterative process where we calculate the gradient and move in the opposite direction. The reasoning behind this is that the gradient of a function is used to determine the slope of that function. As we want to go down, not up, then we move in the opposite direction.

It’s a simple process where we update xiy in each iteration, following the following approach:

Updating parameters in gradient descent — Image by author

Explained in words, at iteration k:

  1. Compute the gradient using the values ​​of xi and y in this iteration.
  2. For each of these variables, xi and y, multiply its gradient by lambda (𝜆), which is a float called the learning rate.
  3. Subtract from xi and y respectively the values ​​calculated in step 2.
  4. Make xiy have the new value in the next iteration.

This process is repeated until a certain condition is met (this is not important today). Once this happens, the training is over and so is the optimization. We are (or should be) at least (local or global).

Now, let’s put this theory into practice.

The first thing we need to do is calculate the gradient of f(x,y). The gradient corresponds to a vector of partial derivatives:

Gradient of f(x,y) — Image by the author

Now, using Python, all I’m going to do is create a loop that iteratively calculates the gradient, using the corresponding xi and y , and update those parameters as specified above.

Before that, I’ll define two more values:

  • The learning rate (𝜆) can be fixed or variable. For this simple tutorial, it will be 0.01.
  • I’ll also use a value called eps (epsilon) to determine when to stop iterating. Once both partial derivatives are below this threshold, gradient descent will stop. I set it to 0.0001.

Now, let’s do some code:

import random

# Define constants
eps = 0.0001
lr = 0.01

# Initialize x and y with random values
x = random.uniform(-2, 4)
y = random.uniform(-1, 1)

def f(x,y):
return (x**2 -1)**2 +y**2

def df_x(x):
return 4*x*(x**2 - 1)

def df_y(y):
return 2*y

# Perform gradient descent
while max(df_x(x), df_y(y)) >= eps:
x = x - lr * df_x(x)
y = y - lr * df_y(y)

# Print optimal values found
print(f'x = x, y = y')

The output of a random iteration was:

GD output sample — Image by author

We can see that these values ​​are quite close to ax=1 and y=0, which were in fact minima of the function.

One thing I forgot to mention was the xi and y initializations. I chose to randomly generate a number within random ranges. In real-world problems, you always need to use more time to think about them. Same with learning rate, stopping condition and many other hyperparameters.

But for our case, this was more than enough.

Gradient descent variations

I am sure now you understand the basic algorithm. However, several versions are in use and I think a few are worth mentioning.

  • Stochastic Gradient Descent (SGD). SGD is the variation that randomly chooses a data point from the entire data set at each iteration. This reduces the number of calculations, but obviously has its drawbacks, such as not being able to converge to the global minimum.
  • Batch Gradient Descent (BGD). BGD uses the entire data set in each iteration. This is not entirely desired for large data sets, as it can be computationally expensive and slow, but on the other hand, convergence to the global minimum is theoretically guaranteed.
  • Mini Batch Gradient Descent (MBGD). This can be considered an intermediate point between SGD and BGD. It does not use one data point at a time or the entire data set, but a subset of it. At each iteration, we choose a random number of samples (predefined) and perform gradient descent using only these.


The Gradient Descent algorithm is widely used in deep and machine learning, but also in other areas. That’s why understanding it is imperative for anyone who wants to become a data scientist.

I hope this post clarifies what it is, what it does and how it does it.

                    Thanks for reading the post! 
I really hope you enjoyed it and found it insightful.

Follow me for more content like this one, it helps a lot!

If you want to support me further, you can sign up for Medium’s Membership via the link below – it won’t cost you an extra cent, but it will help me in this process. Thank you so much!


[1] Gradient descent – Wikipedia

Source link

At Ikaroa, a full stack tech company, we understand the importance of staying at the forefront of the dynamic software landscape. That’s why we would like to let you know about the Gradient Descent Algorithm 101 and its optimization algorithm by Pol Marin.

The Gradient Descent Algorithm is a widely used optimization algorithm. It is a first-order iterative optimization algorithm which helps find the minima of a function. By reducing the learning rate and increasing the iterations of the algorithm, it gradually reduces the cost function until it reaches the narrowest point. This can be done by estimating the gradient of the cost function to update parameters and reduce the cost.

Pol Marin’s article thoroughly explains the techniques of Gradient Descent Algorithm that can be used to solve machine learning problems. He explains that by deriving mathematical formulas, Gradient Descent Algorithm can efficiently minimize complex functions, helping specify the location of a local minima. He also discusses the importance of minimizing cost functions, which forms the basis of the algorithm.

In essence, Gradient Descent Algorithm allows machines to find solutions to complex problems by utilizing effective search algorithms and mathematics. This can only be done with the right optimization tool in place.

Ikaroa leverages its immense technical excellence to bring you a number of innovative solutions based on the Gradient Descent Algorithm. With our solutions, you can save time and money, and optimize the performance of your software projects in a cost-effective manner. We also offer 24/7 assistance and ensure that you get the best services from us.

To understand the power of the Gradient Descent Algorithm and its optimization tool, we strongly recommend reading Pol Marin’s article. It provides a comprehensive understanding of the Gradient Descent Algorithm and discusses the techniques and approaches that help produce the best results. With the knowledge you gain in Pol Marin’s article, your business and software projects can take a leap ahead.

So, head to Ikaroa and find all the solutions your business needs to succeed in the dynamic software landscape.


Leave a Reply

Your email address will not be published. Required fields are marked *