The One 502 in 20,000

Published on June 26, 2025

For as long as we can remember, our load balancer has thrown the occasional 502. Roughly one in every 20,000 requests would fail — a tiny blip in the grand scheme of things.

Given the bigger infrastructure challenges we were dealing with, this wasn’t worth chasing – until it was.

In DevOps, even the smallest cracks eventually demand attention.

That moment came yesterday: one of our critical APIs failed – the first time in three years (that we know of).

We started investigating.

A quick Google search – “ELB gunicorn 502” – led us straight to the fix.

The root cause?

  • ELB’s idle connection timeout is 60 seconds by default.
  • Gunicorn’s keep-alive timeout is just 2 seconds.

Gunicorn was closing the connection well before ELB was ready to let go.

All we had to do was set longer keep-alive value when starting gunicorn

gunicorn --keep-alive 65 <other-args>

It is even mentioned in the docs of gunicorn to use a longer value!

Five minutes later, the pull request was merged & deployed. Immediately the problem vanished too.

Sometimes, the fix really is on the first page of Google – if you just google for the right thing.