Transcription of: L11.6 Xavier Glorot and Kaiming He Initialization




Unknown: Yeah, let's now take a

look at two of the most common weight initialization schemes

for deep neural networks. So the first one we start with is

called Xavier glaude. initialization. Sometimes people

just say Xavier initialization. Sometimes people say glorified

initialization. So the fact of the name comes from the name of

the first author of the paper that proposed this method. So

usually, this type of initialization is used in

connection with a 10 h function, the hyperbolic tangent

activation function. And recall, this is also a sigmoidal

activation function similar to the logistic sigmoid except that

the output is centered at zero. And so where we have in the

logistic sigmoid, something like this, I'm really not good at

drawing this. Let's try again, where I put us at point five,

for the logistic sigmoid, then we have for the teenage

something more like like this, where we have one and minus one

here. And the if you recall, the partial derivative of that

function with respect to his input is one. So instead of

point two, five, for the logistic sigmoid for the 10 H,

the derivative at the highest point here in the center, is

one. So in this case, we have less of a vanishing gradient

problem compared to the logistic sigmoid. But of course, it still

has the problem of this is saturation, near extreme values.

So here, this point in here, at this point, we have still this

very small or zero gradient, which is still a problem. And

yeah, this savior initialisation can act as a small improvement

to prevent these extreme values. So how does it work? So it's

essentially a two step procedure, the first step is to

initialize the weights from Gaussian or random uniform

distribution. And then the weights in the second step are

scaled proportional to the number of inputs to that given

layer. And therefore, the first hidden layer, the number of

inputs would be the number of features in the data set. And

then yet at the second layer, that would be the number of

units in the first hidden layer and so forth. So here's how it

looks like. So the assume that the weight is initialized from a

Gaussian distribution with mean zero and a small variance. And

it's let's say, this is our or weight matrix from this Gaussian

distribution, or random could also be a random normal, uniform

distribution, sorry. And then you scale it by a factor of the

square root of one over m l minus one. So what is M, M is

the feature, the more features, and l minus one is the layer

index plus or minus one means the number of features in the

previous layer. So if I have set up like this, and then

everything is of course, connected to each other. So if I

initialize the weights here, um, I want to initialize these

weights here. They are initialized based on the number

of features here. Yeah, and if you didn't initialize the bias

units to all zeros, you can also include those in the scaling.

But yeah, it's fine to initialize the bias units to all

zeros. I'm just saying, If you don't, then I recommend also,

including those in that scaling arm. Yeah, what is the rationale

behind applying this scaling factor here? So yeah, that goes

back to making an assumption that, let's say when we compute

the net input, we have a multiplication between the

weights and the activations from the previous layer. And you can

think of them as they are independent, right? So and then,

if you have an increasing sample size, then the variance of So

here, I mean, if you have increasing number of units in

the previous layer, so that's what I'm thinking here of the

sample size, then the sum increases because you

have independent variables, right? And then you're just

adding up the variances. So you can think of it as the variance

of the sum of independent variables is the sum of the

variances and then yeah, we have the scaling by one over m, where

m is the number of the samples on it say, maybe, maybe the

number of features would be more correct here. And yeah, the

square root is to consider the standard deviation. All right,

so yeah, here's just a very, very brief sketch of what I

meant. So if we have the variance of the net input, just

focusing on one unit here, this can be expressed as the sum over

the weights times the activation from the previous layer. So

that's nothing new. And, um, yeah, we can, the variance of

the sum is the sum of the variances, right? So we can

rewrite this. And here, I'm just extending it into this, these

are independent variables. And then essentially, because we

have a sum over these values, and they are the same for all

the different arm positions, right, so I can actually say,

instead of summing over these, I can just say, it's m times this

product in a way. So this is where the M comes from, right.

So we are then scaling it back by one over m. And since the

square root is for the standard deviation here, alright, so but

don't worry about it too much. It's essentially a scaling

factor, you can think of it more broadly as a scaling factor that

accounts for the number of features that goes into a given

layer. So that's the main message here that we take into

consideration the number of features. And this is sometimes

also called the number of features from the previous

layer. It's sometimes also called fan in, and I have no

idea why it's called like that. It's another term that is

commonly used, there's actually a term fan in and there's also

the term fan out, I only know everyone is using this like

since 10 years ago or something, but I never understood where

this name comes from. Alright, so in practice, also, sometimes,

if you look at some initialization schemes,

sometimes people also include here, the number of output

units, for example, they would write it as on Liu LS, the

output a number of features that goes out from that given there.

So that's also sometimes some things people sometimes do. But

yeah, in practice, I think then in a small comment and having

both Yes, so here's a visualization

from the Xavier initialisation paper. So what they are showing

here is a network with a 10 h activation, and without Xavier

initialization, just for reference now. So here, the top,

this is a histogram, a normalized histogram, showing

the activation value for the different layers in the network.

And you can see on that in the early layers, like one or two of

the activations are more uniform spread out. Whereas for the

later layers, they are largely zero, it just happens that how

the signal propagates through the network that we will see in

later layers, the activations are more centered at zero. And

here consequently, when we do back propagation, we have this

multivariable chain rule, right. So the we go usually from the

right to the left. So if I have a network, let's say like this,

that's the forward pass, and then in the backward pass, I

start with layer five, and then multiply things until I reach

layer one when I do the updates. So what you can see here is form

layer five, we have back propagated gradients that look

reasonable and arranged between minus point one and point one.

But then the further back I go, I get this vanishing gradient

problem, you can see that the early layers almost get zero

gradients most of the time, which can then be a problem. So

in this sense, the network will mostly only update the later

layers, but almost the ignore the earlier layers. So this can

be kind of Yeah, not good if you want to train a neural network

well. So then, here's the visualization, showing what I

showed you on the previous slide. And here, a version with

the Xavier Xavier initialization. So you can see

at the bottom, this looks much better if they use the Xavier

initialization you can see that all the gradients are in a

reasonable range for all the different layers. So in this

way, that's actually pretty nice that fixes this issue that some

layers learn better than others. Alright, so there's another

initialization scheme so Xavier initialization scheme was

Assuming that you use the 10 h activation function, there's

also something called a hole initialization. So this comes

from the fact that the first author on that paper was named

came or his name is claiming her. And yeah, like I said,

previously, we assumed that the activations had a zero mean, or

mean zero, which is a reasonable when we use the 10 h activation,

because it's centered at zero. But for your for relu, it's

different because the activations are not centered at

zero anymore, right? If wherever revenue function, I will only

have positive values, right, because everything looks almost

like a leaky realm. Because everything here on the left

side, left from zero, will be set to zero, right. And that

way, we don't have the activation centered at zero

anymore. So this paper proposes a method that works better with

relu units where we don't have the weights centered at zero, or

sorry, the activation centered at zero. There's some

complicated math in that paper, if you're interested, you can

check that out. But the bottom line is that we would add just a

scaling factor of two here. So in total, we would add a

sometimes people call that gain, we would add a gain of square

root of two. And we can just put it inside here. So essentially,

it's just adding a scaling factor. And this addresses the

issue with having the activations not centered at zero

in the case of relu. I'm sorry, it Okay, so, yeah, so you don't

have to, I would say worry about these types of things too much

if you just use regular neural networks. So there are

reasonable defaults in pytorch. So this video was more like

illustrating that different weight initialization schemes

exist. But in practice, I mean, it is a good idea to choose a

good initialization scheme. But yeah, mostly frameworks handle

this automatically pretty well these days. So I will actually

in the next video show you how this is done in pytorch, and how

we can change the initialization scheme and pytorch