A quick idea.
Why not see if there exist some probability function or if you can go back in iteration to select a new activation function that achieved a much higher loss reduction.
That is. You have many activation functions in model to select from. Not everyone at once. So you have to select one. I guess that you could test among some candidates in som small iteration. After the ?biggest loss reduction is made. Go one big iteration further and repeat the process.