A quick speculation.
I guessed before that you need damping to stabilize a GAN network. From looking at the error graph. It goes up and down and so on.
If you then apply the same logic to the model layers of a deep layer network. Then there could exist models which experience internal oscillation between layers.
Then perhaps you need a smart damping model function of the updating gradient data.
I guess you get into trouble if the ?first and ?last layer is slow to learn in the back propagation. I suspect you get oscillations if the learning between different layers are not working well together.