Data can be sometimes sorted in a chronological order. If you take the gradient of minibatch whose samples are highly correlated, you are estimating the biased gradient, as opposed to the true gradient. By shuffling the data, you can estimate the gradient more accurately and therefore update better (towards the true minimum).
Your question actually made me look up the deeplearning book to see if there are any other reasons. I post an excerpt here. Hope this helps.
- Larger batches provide a more accurate estimate of the gradient, but with less than linear returns.
- Small batches can offer a regularizing effect (Wilson and Martinez, 2003), perhaps due to the noise they add to the learning process