if you look at the softmax function, it divides each element by the sum of whole vector.

1 min readAug 2, 2020

if you look at the softmax function, it divides each element by the sum of whole vector. That's why the shape of gradient can not be the same as an input. You need to take derivative w.r.t. every elements and the one item (x_i) will expand. Read about Jacobian matrix, you will get a better idea.

the dimensions of input and derivatives don't have to be the same to perform backprop (basically chain rule). Think about the start of the backprop. You start w a scalar but your gradients are vectors/matrices.

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Aerin Kim

Responses (1)