Google App Engine – TransientError in Cloud
We were engulfed by a blizzard of Google task queue ‘TransientErrors’ over the weekend, the ‘TransientError’ is one most ellusive of google app engine errors. This is what the official documentation has to say of it:
exception TransientError(Error)
There was a transient error while accessing the queue. Please try again later.
This very detailed description has left many scratching their heads wondering (a) Why they are getting the errors and (b) What should they do about them!
So far, the best description I could find of the error is here:
http://osdir.com/ml/GoogleAppEngine/2009-06/msg01337.html
A TransientError is an unexpected but transient failure. Typically it
is a deadlined or otherwise missed connection between two backends in
our system. We distinguish TransientError from InternalError to say
that transient errors failed but were expected to succeed (and
retrying will probably work), whereas InternalError is quite
unexpected and retries will have no effect.
This seems to suggest that you shouldn’t worry too much about them as they happen rarely and the automatic retry should work – no harm done…
In our experience however it seems that the error is far from transient in that it often occurs in batches with the retries experiencing the error too – it causes our app to grind to a halt, the whole thing tends to snowball. It is like something fairly serious goes wrong with the google cloud and things just stop working.
Does anybody know if there is any reliable way of handling these errors when they occur?
And more seriously, can an application be made reliable on the google cloud with nasty undocumented errors like this cropping up in batches every few months? I suppose software development for cloud environments is still really only in its infancy, developers and platform suppliers have a lot to learn, but unexplained stealth errors like the TransientError really doesn’t help the situation!