For as long as we can remember, our load balancer has thrown the occasional 502. Roughly one in every 20,000 requests would fail — a tiny blip in the grand scheme of things.
Given the bigger infrastructure challenges we were dealing with, this wasn’t worth chasing – until it was.
In DevOps, even the smallest cracks eventually demand attention.
That moment came yesterday: one of our critical APIs failed – the first time in three years (that we know of).
We started investigating.
A quick Google search – “ELB gunicorn 502” – led us straight to the fix.
Gunicorn was closing the connection well before ELB was ready to let go.
All we had to do was set longer keep-alive
value when starting gunicorn
gunicorn --keep-alive 65 <other-args>
It is even mentioned in the docs of gunicorn
to use a longer value!
Five minutes later, the pull request was merged & deployed. Immediately the problem vanished too.
Sometimes, the fix really is on the first page of Google – if you just google for the right thing.