Sitemap

Tumbling down the SGD rabbit hole — part 1

10 min readMar 14, 2018

As could be feared, my Deep Learning excursions have led me in increasingly strange places. Lately, I’ve become quite obsessed with optimizers, maybe the single most important factor in successfully training a Deep Learning model.

This post assumes a basic knowledge of Deep Learning (if you’re new to the topic, this previous post is probably a better one to read).

We shall start with a basic explanation of Stochatic Gradient Descent, the most commonly used optimization function used in Deep Learning. And then, we’ll fall into darkness :D

It’s a long way down

What Deep Learning really is

Deep Learning boils down to:

  • building a data set,
  • designing a complicated function with a large number of parameters (the neural network),
  • using the data set, trying to find a set of parameters yielding a level of accuracy in line with our expectations.

That’s it. Zooming in on the last item, what we’re really trying to achieve is find a set of parameters for which the loss function (which computes the difference between ground truth and predicted values) reaches a small enough value.

This set of parameters cannot be computed algebraically: it has to be approximated iteratively using an optimization function.

As Sebastian Ruder puts it: “« Deep Learning ultimately is about finding a minimum that generalizes well, with bonus points for finding one fast and reliably ». Indeed, we’re not necessarily looking for the absolute minimum, just for one that would generalize “well enough” to solve our business problem.

The SGD optimizer

Stochastic Gradient Descent was invented in 1951 by Robbins and Monro. It’s still very widely used today. All it requires is the ability to derive a function, plus a single hyper-parameter called the learning rate (noted ‘lr’ below: 0.1 is a typical value).

Here’s the SGD update rule when the function has a single parameter called ‘w’: w = w-(lr*f’(w))

In plain-speak: we compute the function derivative at our present location. This tells us what the slope is and which way is down. Accordingly, we then take a small step down and repeat until we get to the minimum value for f().

An example is the best way to explain it :)

SGD with a single parameter

Let’s apply this to a very simple function : f(x) = x². I hope that you’ll remember from high-school that its derivative is 2x ;)

At x=1 (our starting point for this example), the derivative is equal to 2. The slope is positive, which means that if we have to decrease x to get closer to the minimum (If the slope was negative, we’d increase x to get closer to the minimum)

Reember, SGD updates parameters in the reverse direction of the slope.

Let’s apply SGD with a learning rate of 0.25: x = x-(0.25*2) = 0.5.

For x=0.5, the derivative is equal to 1. Let’s run another round of SGD:
x = x-(0.25*1) =0.25.

After a few iterations, this is what we get.

As you can see, we gradually get closer and closer to the function’s minimum, i.e. x=0. SGD works, woohoo.

…unless we pick a very large value — causing divergence — or a very small value — causing intolerably slow convergence. “Neither too small nor too large” seems to be the mantra of Deep Learning practitioners :)

SGD with many parameters

In a deep neural networks, each weight is a parameter. As a consequence, the network’s function has as many dimensions as the network has weights… which could be millions and even tens of millions. Mind-boggling.

Still, SGD works the same. However, for additional performance, we don’t want to update weights one at a time: during the training phase, backpropagation should instead update all the weights in a given layer at once (starting from the last layer and moving on to the previous one until it reaches the input layer).

To do so, let’s define a vector storing the weights for this layer:
L = [w1, w2, …]

Similarly, let’s define a vector storing the partial derivatives of our function. This vector is called the gradient (yes, that’s the ‘G’ in ‘SGD): it informs us about the slope in each dimension.

grad(f)(L) = [(df/dW1)(w1), (df/dW2)(w2), … ]

Fortunately, Deep Learning libraries like Apache MXNet or TensorFlow know how to compute partial derivatives automatically, so we’ll never have to do that ourselves. Pfeew.

The SGD update now becomes: L = L-(lr*grad(f)(L))

Here’s a simulated example with two parameters, where we literally “walk down the mountain” step by step to reach a low point in the “valley”.

z=x²- xy. Generated with love at academo.org.

See? It’s not that complicated: you’ve now conquered SGD. No matter the number of dimensions, we have a simple iterative algorithm that lets us find the smallest value possible for our function.

Does this really work every single time? Well… no :)

Problem #1: local minima

A function may exhibit many local minima. Look at the weird beast below.

z=(cos(x)/x)*(sin(y)/y). Generated with love at academo.org.

Lots of very shallow local minima, some deeper ones and a global one. As we iterate on SGD, we’d definitely want to avoid getting stuck in any of the shallow ones — this could happen if we used a really small learning rate, preventing us from crawling out. Ideally, we’d like to end up that global minimum, but we could end up falling into a deep local one.

Is this a theoretical problem? Do these situations actually take place with deep neural networks? If so, would there be any way to catapult ourselves out of these local minima?

Problem #2: slow convergence

Imagine a part of the parameter space where the slope would be close to zero in every dimension. Let’s call this a plateau, even if the word doesn’t really make sense in high-dimension spaces.

There, all components of the gradient would be close to zero, right? Hardly any slope. The consequence would be near-zero updates for all weights, which in turn would mean that we would hardly move towards the minimum. We’d be stuck on that plateau for a long time and training would be extremely slow, no matter how much hardware we’d throw at it. Definitely an undesirable situation (unless we’d reached a nice minimum, of course).

Could we speed up, i.e. increase the learning rate when slope is minimal?

Problem #3: different slopes

Should we really expect all dimensions to have roughly the same slope? Or could there be steep dimensions — where we’d make good progress quickly — and flatter dimensions — where we’d move much slower?

Surely that’s a reasonable hypothesis. As SGD uses the same learning rate for all parameters, this would surely cause uneven progress and slow down the training process.

Could we have adapt the learning rate to the slope of each dimension?

Problem #4: saddle points

Now, imagine we’d reach a specific point in the parameter space where all components of the gradient are actually equal to zero (yes, these points actually exist, more on this in a minute). What would happen then? No more weight updates! We’d be stuck there for all eternity. Doom on us.

“There are more things in heaven and earth, Horatio, than are dreamt of in your philosophy”

Defeating the gradient

Let’s look at an example: f(x,y) = x²- y². This does look like a horse saddle, doesn’t it?

The horse saddle. Generated with love at academo.org.

Now, let’s compute the partial derivatives: df/dx = 2x, df/dy = -2y. Hence, the gradient of this function is [2x, -2y]. Obviously, at the point (0,0), both components of the gradient are equal to zero.

Looking at the graph above, indeed we see that this point is a minimum along the x axis and a maximum along the y axis. Such points are called saddle points: a minimum or a maximum for every dimension.

Should SGD lead us to this exact place, our training process would be toast as weights would not be updated any longer. Our model would probably be no good either since we’d definitely not have reached a minimum value.

Saddle points have been proven to be very common in Deep Learning optimization problems (we’ll see in the next post how to try and limit their occurence).For now, let’s see if we can break out!

Enter the Hessian

Our problem here is that we’re only looking at unidirectional slopes. Instead, if we looked at the curvature around the saddle point (i.e. how much the surface deviates from a plane), then maybe we’d figure out that there’s actually a way down along th y axis.

Studying the curvature of a function requires the computation of second-order partial derivatives (ah come on, don’t quit on me now. Hang on!). They’re stored in a square matrix called the Hessian.

Let’s do this for our previous example:

  • d²f/dx² = 2, d²f/dxdy = 0
  • d²f/dydx = 0, d²f/dy²= -2

Thus, the Hessian for our function is:

By multiplying this matrix with unit vectors along the x and y axes, we’re going to find out what the curvature looks like:

  • H * [0, 1] = [0 -2] = -2*[0, 1]
  • H * [0, -1] = [0 2] = -2*[0, -1]
  • H * [1, 0] = [2 0] = 2*[1, 0]
  • H * [ -1, 0] = [-2 0] = 2*[-1, 0]

What do we see here? Multiplying H by a unit vector along the x axis yields a positive multiple of the vector: curvature is positive, we can only go up. Indeed, (0,0) is a minimum along the x axis.

On the contrary, multiplying H by a unit vector along the y axis yields a negative multiple of the vector. This indicates a negative curvature, which means that there is a way down. Victory!

A little more algebra (hate me, I don’t care): when a non-zero vector v and a square matrix M are such that M*v is a multiple of v, the vector v is called an eigenvector of M and the multiple is called an eigenvalue. So, whenever we encounter a saddle point, computing the Hessian and looking for a negative eigenvalue is how we can figure out which way is down. Ain’t life grand?

Now we know how to break out of saddle points. The only problem is that computing the Hessian is quite expensive. There are other ways to avoid saddle points, albeit less rigorous ones (more on this in the next post).

Are we out of the woods when it comes to saddle points? Far from it, I’m afraid.

“Hell is empty and all the devils are here.”

Defeating the Hessian

Here’s an example, called the monkey saddle: f(x,y) = x³- 3xy².

The monkey saddle. Generated with love at academo.org.

Let’s compute the gradient and the Hessian again.

The gradient is [3x²- 3y², -6xy], which is equal to [0, 0] at point (0,0). This is a saddle point again.

Accordingly, the Hessian is:

H(0,0) is a zero matrix! Multiplying it by a unit vector (or any vector, for that matter) will result in a zero vector: we’re unable to figure out curvature. Neither the gradient nor the Hessian provide any information on which way is down: such points are degenerate saddle points. Oh, the agony…

Should we lose all hope? Are there more advanced techniques to detect these conditions?

Problem #5: gradient size & distributed training

This last problem is different from the previous ones. Imagine that we’re working with a very large data set: distributing training — splitting computation across multiple instances — would surely deliver a nice speed up.

The way this typically works is quite straightforward. Each instance in the training cluster grabs a batch of samples, processes it (forward propagation to compute the loss for the batch, then backprogation to compute gradients for all layers). Then, each instance would send the computed gradients for this batch to all other instances (or to a central location in charge of propagating them). Each instance would then update their weights accordingly.

All good, right? Not quite. For large models, gradients are actually very heavy: close to 100MB for ResNet-50, a popular image classification model. This means that each instance would have to send 100MB to all other instances after each round of backpropagation! Even with a limited number of instances, this could quickly become a performance bottleneck, stalling our computing efforts and slowing down the training process.

Could we reduce the gradient size? Compress this data, maybe?

Now what?

In this post, we looked at the mathematical foundation of SGD. We learned about mathematical tools that help us find which way is down in the quest for a nice minimum. This was a bit math-heavy, but hopefully you made it to the end. Remember, you can do this! Stephen Hawking passed away yesterday and he once said:

Equations are just the boring part of mathematics. I attempt to see things in terms of geometry.

I feel that this is great advice when trying to understand Deep Learning.

In the next post, we’ll look at solutions for the issues discussed above…and how to apply them in practice.

Thank you for reading.

In search of the global minimum, we might have to break out from the bottom of the well indeed \m/

--

--

Julien Simon
Julien Simon

Responses (1)