The Latency Trap: Why Tips and Tricks Aren't Enough
Stop guessing why your requests are slow. Learn the fundamental formula: Latency = Propagation + Queueing + Service. Discover why p99 metrics matter and how to optimize your system without over-engine
Up until now, through all my learning, thinking, and working, the concept of latency (request delay) has always been… kinda vague to me. Whenever I heard about it, in my head it was always:
“Ah okay, latency is the time from when the client sends a request until it receives the response. Done.”
Yeah… not wrong. Completely correct. But also… completely useless 😄. It’s one of those definitions that sounds obvious, like “yeah yeah everyone knows that”.
But when you actually start working with it, suddenly it explains… nothing. For example:
When a manager asks:
“Why is this request so slow? It just fetches a list, why does it take 2 seconds?”
Or:
“Why is the same request sometimes 200ms, sometimes 1s, sometimes 2s?”
And now you’re stuck. Because that “definition” doesn’t help you answer anything.
So… what actually causes latency?
Why is it fast sometimes and slow other times?
What exactly is happening inside a request?
Tips, Tricks, and Fancy Diagrams
If you search online for advice about design systems or how to optimize a request’s latency, you’ll find tons of methods — hundreds of architecture diagrams, and all sorts of fancy technical explanations and buzzwords. I used to do that too and thought:
“Wow, this makes perfect sense, let’s apply it!”
I also picked up quite a few latency-optimization skills, haha — if someone asked me about them in an interview, I’m ready 😄. For example:
If too many messages hit the system at once and it gets overloaded, just push them all into a queue (like Kafka or SQS), then process them slowly, one by one — that way, no message gets lost.
When the system’s overloaded, scale it horizontally — spin up more instances, pods, or nodes to handle millions of requests. That’s what autoscaling with load balancers is for, right?
Bring the server closer to users with a CDN. With servers all around the world, European users reach the European server, Americans go to the US one — plus, that also helps reduce load on the main server.
Cache data with Redis — reading from RAM is so much faster than pulling from disk I/O, and it also takes some pressure off the database.
For databases, you can look into techniques like replication (to improve read performance), adding indexes, or using partitioning and sharding to make queries faster.
And many more… And honestly? I agree with all of them.
But…
All those pieces of advice — to me, they’re just little tips and tricks. Maybe I remember them today and forget them tomorrow. There’s just too much going on in life to keep everything in my head.
I didn’t really understand the essence or the actual components of latency — what a request has to go through, what it faces, why it’s sometimes fast and sometimes slow. When it’s fast, why is it fast? When it’s slow, why is it slow?
I realized I understood nothing if I only relied on random tips and guesses.
The Formula That Changed Everything
Thanks
I want to thank a System Design Handbook shared recently by Quang Hoang. After reading it, I learned a lot of new things about latency. Not just tips I’ll forget later. The most valuable thing for me was this formula:
Latency = Propagation + Queueing + Service
Simple. Clean. Almost too simple. But this thing changed everything for me.
Latency has 3 components:
Propagation → time for the request to travel
Queueing → time waiting (thread pool, DB connection pool, etc.)
Service → time the server actually processes
When I look at this formula… All the “tips & tricks” suddenly become easy to understand. Because now:
Optimizing latency = optimizing these 3 things.
Examples:
If the connection between the two sides (TCP handshakes, TLS handshakes, network, etc.) takes too long, then you should bring them closer together; if you can merge them or put them next to each other, even better.
If the queue is too long, you need to find a way to shorten it by reducing incoming load or increasing processing throughput (scaling servers) so the queue gets smaller.
If the request handler itself is too slow, then you’d better optimize the algorithm, tune the database, or use a programming language that’s more suitable for the business problem, or find ways to process things in parallel and asynchronously.
The more of these parts you can optimize, the better.
And if it’s still too complicated, then just hide that latency away and immediately return something like “processing completed” (succeeded) so the client feels at ease, even though behind the scenes the system is still working its butt off.
For example, when you create an AMI from an EC2 instance, AWS responds right away that the image creation is in progress, so the user can move on and do other things instead of staring at a loading spinner and waiting for the page to unblock.
Now I’ve developed a new habit: whenever I come across some tip or trick to optimize latency, I ask myself which of those three components it actually improves, what the trade-offs are, or whether it helps optimize all of them — instead of doing what I used to do: reading through long-winded explanations that I probably wouldn’t remember even for a day.
Maybe optimizing latency is still a huge topic, with plenty more going on behind the scenes, but for me, this formula already captures a big part of:
The definition of latency
The components that make it up
The strategies to optimize it
There are already tons of articles online about design systems and techniques for optimization (caching, load balancers, scaling, CDNs, etc.), so I probably won’t add to that pile here — everyone can dig into those on their own and compare them against this formula.
Real-world application
Staying on the topic of latency, I actually shared a post before about building a monitoring system, and there were 2 things I mentioned.
1. Why p99 Matters More Than Average
First, I emphasized the importance of percentile metrics (p50, p95, p99), especially this guy p99. And why is average latency useless here? Because in reality, it literally does nothing. It does NOT represent the actual latency that users experience.
If you tell your boss, “Our average latency is 200ms,” you aren’t telling the truth. You’re telling a statistic.
The “Average” is a mathematical trap. It assumes every user has a similar experience. But in a distributed system, a single request doesn’t just “happen”—it hits a load balancer, a thread pool, a database, and maybe three external APIs.
Very roughly:
p50: This is the experience of your "typical" user. Half are faster, half are slower.
p95: 1 out of every 20 requests is slower than this.
p99: The "1% experience." 1 out of every 100 requests hits this delay.
The Danger of Scale: If your landing page makes 50 different network calls to load (images, CSS, tracking, API), the chance that a user hits a p99 delay at least once is nearly 40%.
Suddenly, the “1% problem” becomes a “40% of my users are annoyed” problem. This is why we optimize for the outliers, not the average.
2. A Debug Story From Production
Second is a story where I debugged an issue:
Hundreds of third-party payment webhook requests all hit the system at exactly 12 PM, causing database congestion.
At peak, the database wanted 20 vCPUs… while the system only had 2 😂
Meanwhile, the server looked perfectly fine — RAM good, CPU good.
But every day at that time, the whole system became slow like a turtle for ~30 minutes. Everything slowed down. Boss complained. Teammates complained. Customers complained. Reputation and service quality took a hit.
Now I’ll analyze it again, but this time using the latency formula above.
At that time, I don’t know if I was just too inexperienced, or I read too many tips & tricks and started overthinking. In my head, I thought:
“If too many requests come in and we can’t handle them, just throw them into a queue and process gradually. Easy. (damn I’m a genius 😂)”
So I jumped in and designed a beautiful architecture with SQS + Lambda + reserved concurrency (this thing ensures a certain number of Lambdas are always available, and also limits how many run in parallel).
Now all webhook payment requests would be processed gradually. Let’s see how the database dares to max out CPU again 😏
Well… life is not a dream. Me and my teammate spent 2 weeks implementing this solution. Result?
Nothing improved.
System still slow.
People still unhappy.
And we wasted time.
If I had known the formula earlier, things would’ve been much simpler instead of chasing fancy stuff.
Applying the Formula: Propagation, Queueing, Service
Propagation
This one is hard to optimize. Third-party systems (like payment providers) connect to us — hard to control. In my case, maybe just vertically scale the database to 20 vCPUs and call it a day =]]]
Queueing
This is where requests wait before being processed.
Network/router queues, CPU queues → too advanced for me =]]].
But thread pool queue & connection pool → these I can control.
So I tuned the default configs in Spring Boot to better fit my system. From now on:
Requests are processed more sequentially.
Less fighting, less contention.
No more trying to do too many things in parallel while resources are limited.
Service
Honestly, I don’t know why I didn’t think about this earlier, and kept chasing fancy architectures. The webhook processing method had MANY issues:
Bad async chain design (if it’s already a chain, why make it async??)
Same request fetching transaction, invoice, payment again and again
→ direct pressure on database
→ why not cache it?
→ in-memory cache worked perfectly here
Non-critical tasks (audit, tracking) executed directly
→ more DB overload
→ I moved them to the end of the webhook
→ maybe later I’ll push them into a queue, we’ll see 😄
Database had too many missing important indexes AND too many useless ones
I just tracked
trace_idin monitoring → immediately saw which requests were slow→ reran SQL → found full table scans
As for unused indexes, every database has tools for that — just Google it (right now I forgot already… classic “learned via tips & tricks” 😂)
And I didn’t even touch any “bit-level optimization” yet.
→ At this point, the system was already good enough. (Know your limit, be happy with what you have — going deeper is just complex and time-consuming.)
After applying all these:
No surprise — the things that should’ve been done from the beginning worked best.
Now the system only needs ~0.5 vCPU (max was 2 before).
Maybe now I should increase concurrency for thread pool and connection pool 😂
Conclusion
Understanding latency instead of memorizing tricks helped me:
Think more clearly
Debug more effectively
Avoid unnecessary complexity
When things are clear (like a formula), remembering tips is no longer the problem.
The real question becomes:
Does this tip actually solve my problem… or just make it more complex?
(If you enjoy these kinds of engineering stories, you can subscribe to receive the next ones.)

