I have a tensor
X of size M x D. We can interpret each row of
X as a training sample and each column as a feature.
X is used to compute a tensor
u of size M x 1 (in other words,
u depends on
X in the computational graph). We can interpret this as a vector of predictions; one for each sample. In particular, the m-th row of
u is computed using only the m-th row of
Now, if I run
tensor.gradients(u, X), I obtain an M x D tensor corresponding to the "per-sample" gradient of
u with respect to
How can I similarly compute the "per-sample" Hessian tensor? (i.e., an M x D x D quantity)
Addendum: Peter's answer below is correct. I also found a different approach using stacking and unstacking (using Peter's notation):
hess2 = tf.stack([ tf.gradients( tmp, a )[ 0 ] for tmp in tf.unstack( grad, num=5, axis=1 ) ], axis = 2)
In Peter's example, D=5 is the number of features.
I suspect (but I have not checked) that The above is faster for M large, as it skips over the zero entries mentioned in Peter's answer.