Quang Chien's Blog | Software Engineer | France: Technology

EventBridge + Lambda: My Go-To Duo for AWS Automation

Quang Chien TRAN — Fri, 08 May 2026 20:32:55 GMT

Today I want to share a little bit about AWS — a cloud platform that I use quite a lot both at work and in my studies. Not only does it help me deploy systems, but AWS also affects quite a lot the way I think about architecture design and real-world technical solutions.

Before, I had written a post about how to optimize AWS infrastructure costs in a company, with some ways of thinking that helped reduce the cloud bill by around 40% while the system still ran stably. In this post, I’ll talk about a “duo” that I use very often for automation: EventBridge and Lambda. These two services work together in a pretty lightweight way, but they solve a lot of real-world problems.

0. Definition

What is Lambda?

Simply put, AWS Lambda is a service that lets you run code without managing servers. You only need to write the code to handle business logic, and AWS will take care of the rest such as provisioning, scaling, and infrastructure operations.

However, “not managing servers” does not mean you don’t need to care about anything. In practice, you still need to configure a few things such as memory, timeout, IAM permissions, or optimize to reduce cold start.

Usually, you write code as a function (the entry point is called a handler), and Lambda will execute that function when an event is triggered. Lambda supports many runtimes such as Node.js, Python, Java…, and for me, Node.js is the preferred choice because I’m most comfortable with it and it has quite a fast cold start in many cases.

Lambda has a lot of use cases, but in this post I’ll focus only on automation combined with EventBridge.

What is EventBridge?

EventBridge is an event processing service in AWS. It allows you to receive events from many different sources such as AWS services, CloudTrail, or custom events that you send yourself.

One important point is that EventBridge does not directly “interfere” with the system. Instead, it works as an event router: when an event happens, you define rules to catch that event and trigger the corresponding targets (for example Lambda, SQS, Step Functions…).

In this post, I’ll focus only on EventBridge’s Schedule feature — that is, running jobs on a time basis (cron or rate). This is a very convenient way to build automated periodic tasks without having to set up a separate server.

Why do I often use this duo?

For me, EventBridge + Lambda is a very lightweight combo for automation:

No need to build a separate cron server.
Easy to scale.
Low cost if the workload is not large.
Built right into the AWS ecosystem.

In many real-world cases, just one schedule rule + one Lambda function is enough to solve an entire operational problem.

1. Infrastructure cost management and optimization

This is probably one of the most common use cases in Cloud work, especially if you are running multiple environments (dev, staging, sandbox…). Doing this well not only saves quite a bit of money, but also helps the operations team not have to “watch the budget” every end of month.

One of the ways I use the most is automatically turning resources on and off by time slot. Specifically, I use EventBridge Scheduler to trigger Lambda on a schedule (cron), for example at 6 PM, when everyone is done for the day, turn off all unnecessary resources, then turn them back on at 8 AM the next morning.

This is mainly applied to Development or Staging environments. Resources suitable for this approach:

EC2 → the easiest, stop/start directly.
RDS → can be stopped/started, but you need to note that AWS only allows stopping for a maximum of 7 days, after which it will start back up automatically.
ASG → there is no concept of “turning it off”, instead you set desired capacity to 0 (or scale down on schedule).
ECS → usually I will:
- set the service’s desired count to 0 to “turn it off”.
- when turning it back on, scale it up again as before.
- if using Fargate, this approach is very effective because scale = 0 means almost no compute cost.
EKS → waiting for you guys to add more here, for now I’m only working with ECS.

Besides turning resources on and off by schedule, another problem that is very often forgotten is cleaning up “trash” resources. I usually set up a job that runs periodically (for example daily or weekly) to:

Scan EBS Volumes that are no longer attached to EC2.
Delete snapshots that are too old (for example > 30 days).
Clean up Elastic IPs that are no longer attached to any resource.

It sounds simple, but if you do this carelessly, it can blow up very easily 😅. My experience is always to filter by tag (e.g. env=dev, auto-clean=true), or by clear rules (age, owner, project), and avoid deleting these resources just based on status alone. Because in reality, there are many resources that look like they are “not in use” but:

are waiting to be attached,
are backups for rollback,
or belong to compliance processes.

The whole flow is usually:

EventBridge (schedule) → Lambda → AWS SDK (scan + action)

Once set up, it runs almost completely automatically, saving cost and reducing quite a lot of manual work for the operations team.

2. Data and file processing

In practice, I quite often run into problems involving data from third parties. Usually, they provide a pretty large data file (CSV, JSON dump…), but on my side I only need a small portion of the data inside it for business purposes.

If you load the entire file and process it directly on the server, it both wastes resources and is not optimal — especially when that data is updated daily. At that point, I need to make sure of two things: processing is fast enough and the data is always the latest version.

The way I usually do it is by setting up a simple ETL flow with EventBridge and Lambda. Specifically, I use EventBridge Scheduler to run on a schedule (for example once per day). Then it triggers Lambda to perform ETL (Extract – Transform – Load).

Inside Lambda, I will:

Extract: get data from the source (API or file).
If it is a large file, I prefer to process it in a streaming way or split it up instead of loading everything into memory.
Transform: filter out exactly the data I need, normalize the format if necessary.
Load: write the processed data into S3 (as a data lake) or into a database such as RDS / DynamoDB to serve other services.

This approach helps reduce the load on the main services (no heavy data processing at runtime). The data is always prepared in advance, so you just query and use it right away, and it is easy to scale and almost does not require server maintenance.

However, Lambda is not always the right choice. For small or medium ETL jobs (data not too large, processed within a few minutes), Lambda works very well. But if the file is too large (several GB or more) or the transformation logic is complicated, then you should consider moving to AWS Glue (specialized ETL) or an ECS/Fargate job to process in batch.

One point I find very important but easy to overlook is the “safety” of the pipeline:

There should be retry or DLQ if the job fails.
Avoid blindly overwriting data (you can partition by date or version).
Make sure the job can run again without producing duplicate data (idempotency).

Overall, for simple to medium ETL problems, the combo of EventBridge + Lambda + S3/DB is a very neat solution, easy to deploy, and the cost is also quite reasonable.

3. Cron Job

Cron jobs are one of the things that almost every company has to use, especially for scheduled tasks such as batch processing, data synchronization, periodic reports, or performing billing steps on a monthly subscription cycle for customers.

Before, I also used scheduling directly inside Spring Boot applications with @Scheduled. That works, but when running in an environment with multiple instances, pods, or containers, problems start to appear. If there is no mechanism to prevent duplicate execution, the same job can easily be triggered multiple times by different instances.

Besides that, debugging and operations are also quite inconvenient. A job running inside the app means logs, retries, timeout, or failure are all tightly coupled to the application runtime, so when you need to scale or separate responsibilities, it no longer feels that smooth.

So for scheduled jobs, I usually separate scheduling from the main application. Specifically, I let EventBridge handle the schedule, and then Lambda becomes the place where the job is executed.

This flow is pretty neat:

EventBridge runs on the configured schedule.
Lambda is invoked to perform a specific task.
Lambda can call the application’s API behind an ALB, or directly process the required logic.

If the API is protected by OAuth, that’s even better, because Lambda can act as an internal client, get a token, and call the endpoint securely. This approach has a few points that I find very valuable:

No need to embed cron logic into the main application.
Easier to scale because the schedule is separated from the app runtime.
Reduces the risk of duplicate job execution when the system has multiple instances.
Operations and monitoring are also clearer.

In short, for periodic jobs that require high stability, I think letting EventBridge handle scheduling and Lambda handle execution is a pretty clean, neat, and maintainable direction.

4. Syncing the staging database from production

Recently I had a pretty practical task: how to make the staging database refresh from production every day, so that the team can debug, test, and handle incidents on data that is as close to real-world as possible.

This problem sounds simple, but in reality it has a few quite “demanding” requirements:

The staging data must be updated daily.
The staging database endpoint must stay fixed.
The staging user and password must remain the same so the dev team does not have to keep changing their connections.
The refresh process must be automatic, safe, and require as little manual work as possible.

For this problem, I chose to do it with EventBridge + Lambda.

The first thing I needed to solve was a fixed endpoint. Because every time you restore from a snapshot, AWS creates a new RDS instance with a new endpoint, so I do not let the app connect directly to the real RDS endpoint. Instead, I create a CNAME record in Route 53 to act as the fixed endpoint for staging. The app only needs to point to this hostname, and behind it, it will redirect to the newest staging instance.

I split the flow into two steps.

Flow 1: Restore new staging

On day N, Lambda will restore the latest snapshot from production to create a brand-new staging RDS instance. This instance will become the staging version for that day, with data refreshed from the latest production state.

Flow 2: Normalize and redirect

After about 30 minutes, when the new instance has finished restoring and moved to the available state, Lambda will be triggered again and do 3 things:

Reset the staging database password to a predefined value.
Update the CNAME in Route 53 to point to the endpoint of the newly restored instance.
Delete the staging database from day N-1 to avoid unnecessary cost.

All required information such as the password, the current staging database identifier, and the old staging database identifier are centrally managed in Secrets Manager and Parameter Store. This makes operations safer and also easier to trace when you need to track the state of the system.

Finally, the whole workflow is scheduled by EventBridge at the time I choose, so there is almost no manual intervention needed every day.

5. Security and compliance

One of the big values of automation on AWS is helping the business respond faster to security risks, instead of waiting for operators to detect and handle them manually. Some situations I often think about are:

A new IAM user is created but not according to the process.
An IAM role is granted overly broad permissions, such as AdministratorAccess.
An S3 bucket is switched to public mode by mistake.

For these cases, I can use EventBridge to receive security events recorded through CloudTrail, then trigger Lambda to handle them automatically according to pre-defined rules. The flow is usually:

CloudTrail records the change event.
EventBridge catches that event and triggers Lambda.
Lambda performs remediation, for example revoking permissions, returning the configuration to a safe state, or sending alerts to the operations team.

In many cases, I do not necessarily auto-fix everything immediately. There are situations where it is better to first alert via Slack, email, or SMS so the team can confirm, especially when an automatic fix could affect a system that is currently running.

This approach helps security and CloudOps teams respond faster, reduces exposure time to risk, and keeps the system aligned with security policy more effectively.

6. Extra use cases

Besides the cases I’ve already mentioned above, EventBridge + Lambda also has quite a few small but very practical uses in system operations. This is the kind of automation that is not too “grand,” but it reduces a lot of manual work for the team.

Some cases I find quite useful:

Automatically sending periodic reports
For example, every morning Lambda runs to aggregate metrics from a database, S3, or an internal API, then sends the report through email or Slack to the team.
Scheduled health checks
EventBridge triggers Lambda to call important APIs / endpoints. If the endpoint has a problem, an alert is sent immediately to the operations team.
Cleaning up temporary data and old artifacts
It can be used to delete temp files, old logs, old backups, or unnecessary artifacts to reduce storage cost.
Handling lightweight reconcile jobs
For example, syncing status between systems, checking data mismatches, or updating records that are in the wrong state.
Automatically checking and reminding secret rotation
If a secret or API key is about to expire, Lambda can send a warning so the team can handle it before it affects production.
Orchestrating small scheduled tasks
For simple workflows, I can let EventBridge trigger Lambda step by step instead of building a heavier workflow engine.

What I like about this combo is that it is very neat. No need to build another server, no need to maintain cron on a separate machine, and yet it still solves a lot of real-world operational problems.

If a use case starts becoming more complex, with more branches, or needs clearer orchestration, then that is when I would consider moving to Step Functions or another workflow solution.

Conclusion

Overall, I think EventBridge + Lambda is a very worthwhile duo if you want to automate operational tasks, data processing, cron jobs, or even some simple security flows in AWS. Its strengths are that it is neat, requires little infrastructure management, is easy to scale, and fits a lot of real-world problems.

Of course, not every case should use Lambda. If a job is too heavy, runs too long, or the workflow is too complex, I would consider moving to Glue, ECS, Step Functions, or another more suitable solution. But for small and medium problems, especially scheduled tasks or event-triggered tasks, this combo is really worth it.

I wrote this post not to say this is the only right way, but rather one way I have used quite a lot in practice and found effective. If you guys have other good ways, more optimized approaches, or real-world experience with EventBridge and Lambda, I’d really love to learn more.

(If you enjoy these kinds of engineering stories, you can subscribe to receive the next ones.)

Subscribe now

Node.js and Java: 5 Easy-to-Mix-Up Questions About Runtime, Event Loop, and I/O

Quang Chien TRAN — Sun, 03 May 2026 15:18:21 GMT

Continuing from my earlier posts about Process, Thread, and Virtual Threads in Java, I’ve come to appreciate how powerful Java has become, especially since Java 21. If we look back more than 10 years ago, Java was almost the default choice for many backend systems. But over the last few years, the landscape has changed quite a bit, and a number of new languages and runtimes have emerged that make development faster, lighter, and more flexible.

One of the most prominent names is Node.js — the JavaScript runtime. Even today, Node.js remains one of the most popular choices for backend development, especially when you need strong I/O performance, fast startup, or want to leverage the same JavaScript ecosystem from frontend to backend. And on the frontend side, JavaScript is still basically the king.

This article is not meant to introduce Node.js from scratch. Instead, it focuses on the questions I think many of us have asked at some point: How does Node.js actually work? What is it good at? Where does it struggle? And what are the mechanisms hidden underneath those things that seem so simple at first glance? Let’s get started.

1. How can Node.js handle millions of concurrent requests?

The short answer is: it depends.

Node.js can handle a very large number of concurrent requests extremely well, but that is mainly true when your workload is mostly I/O-bound — meaning the system spends most of its time waiting rather than computing. Typical examples include:

Waiting for the database to return results.
Waiting to read or write files.
Waiting to call a third-party API.
Waiting for network responses.

On the other hand, if your application is mostly CPU-bound — meaning it has to do heavy computation, video processing, data compression, encryption, or complex algorithms — then Node.js is not the ideal choice if you let everything run directly on the event loop.

Single Thread and Event Loop

Most traditional frameworks, such as Java-based ones, use a multi-threaded model. In other words, when a request comes in, the framework creates a separate thread to handle that request. If you have one million requests, that could mean one million threads at the same time, which quickly leads to massive memory usage and eventually RAM exhaustion. On top of that, the cost of context switching between those threads can make the application slower instead of doing useful work.

By default, a Node.js process runs JavaScript on a main thread, and that is what makes the event loop so efficient for I/O-bound workloads. At the same time, Node.js still has internal mechanisms and can be scaled across multiple cores when needed.

That single main thread handles incoming work without the overhead of creating a new thread for every request, so memory usage and context switching costs are much lower. The event loop keeps moving tasks into that main thread for execution. Even if your server has many CPU cores, a single Node.js process still primarily runs JavaScript on one main thread.

Non-Blocking I/O

Most of a web server’s time is spent waiting: waiting for a DB result, waiting for file reads, waiting for third-party APIs. With Java before version 21, that usually meant the thread would sit there and block, doing nothing until the result returned.

With Node.js, when an I/O request is made, the event loop forwards that work down to the runtime’s async I/O layer instead of running it directly on the JavaScript thread.

For operations like file access, network requests, or database calls, Node.js relies on the runtime’s asynchronous I/O mechanisms and the kernel. On Linux, this is commonly associated with epoll; on macOS, kqueue; and on Windows, IOCP. These mechanisms let Node register interest in file descriptors or sockets, then wait for the kernel to signal when they’re ready instead of blocking the JavaScript thread.

The important thing is this: the OS does not magically “~~push data directly into the event loop~~.” In reality, Node/libuv registers interest in the I/O event, the kernel tracks the I/O state, and once the socket or file is ready, the kernel notifies the event loop that it can continue processing.

How does the OS notify Node.js?

A more accurate way to describe it is:

Node/libuv submits the I/O request to the operating system layer.
The operating system tracks that I/O state.
When the I/O completes or data becomes available, the kernel returns a “ready” signal.
The event loop receives that signal and pulls the corresponding callback or continuation from the queue to run on the main thread.

So it’s not that the file or database directly talks to JavaScript. It’s the kernel plus the event notification mechanism informing the runtime that the resource is ready.

A good way to picture this is with Promise, async, and await. When a Promise is resolved, the code waiting on await does not start executing immediately in the middle of the event loop. It is usually placed into the Promise queue or microtask queue, and it gets priority after the current callback finishes, before the event loop moves to the next phase.

async function demo() {
  console.log("A")
  const data = await fetch("https://example.com")
  console.log("B", data)
}

What really happens is:

Print A
Send the network request down to the async I/O layer
demo() pauses at await
The event loop continues doing other work
When the response comes back, the Promise is resolved
console.log("B", data) is queued to run next

Think of it like a restaurant with only one waiter. After the waiter takes an order, he sends it to the kitchen to be prepared. While the kitchen is working, he doesn’t stand there waiting for just one table. Instead, he goes to take more orders or serve other tables. Once the food is ready, the kitchen rings a bell and places the meal in the pickup area. One waiter like that can serve hundreds of tables if he is fast enough.

The bottleneck problem

Because Node.js has only one main thread, if that thread is busy doing a computation that takes 5 seconds, then during those 5 seconds:

It cannot accept new requests.
It cannot respond to tasks that have already been forwarded to the OS and are now ready to return to the event loop.
The whole application can appear to freeze.

In contrast, multi-threaded languages like Go or Java can move that work onto another thread on another CPU core, allowing the main thread to keep accepting requests.

2. Can Node Cluster solve CPU-bound problems?

By default, Node.js runs JavaScript on a single main thread per process, so a single process cannot automatically use all CPU cores. If your server has 10 cores, then the other 9 cores will mostly sit idle.

The Cluster module is a built-in feature that allows you to create multiple worker processes running in parallel. These processes share the same network port, which makes it possible to distribute the load across multiple CPU cores.

Master Process: acts as the manager, monitoring and coordinating workers.
Worker Process: individual copies of the application that handle requests directly.

What problem does Cluster solve?

Instead of wasting resources, Cluster lets you run a number of processes roughly equal to the number of CPU cores. Overall system throughput can increase significantly.

If one worker crashes because of a bug, the other workers can keep serving traffic. The master process can also be configured to automatically respawn a new worker to replace it.

When there are multiple workers, if one worker is busy struggling with a heavy CPU-bound task, the operating system and Cluster master can route new requests to other workers that are still free.

Important note: Cluster does not make the heavy computation itself faster, but it prevents one expensive task from bringing down the entire server.

Limitations

Even though Cluster is powerful, it still has some important limitations:

No shared memory: each worker is a separate process with its own memory space. You cannot store a global variable in Worker A and expect Worker B to read it. To share data, you need external tools such as Redis or a database.
More complex management: session management becomes harder because a user’s requests may land on different workers. This is usually handled with sticky sessions or a centralized session store.

Alternative: Worker Threads

If Cluster creates multiple independent processes, then Worker Threads — introduced in Node.js v10.5.0 — let you create multiple threads within the same process.

Cluster is best for scaling HTTP servers and I/O-bound workloads.
Worker Threads are better for heavy CPU-bound tasks inside a single instance because they can share memory through SharedArrayBuffer, which makes data exchange very fast.

So if you want your Node.js application to perform CPU-bound tasks without blocking the server, you should either use Worker Threads or move that work into a separate service written in a language that is stronger for heavy computation.

3. Why does Node.js start faster than Java?

From a practical point of view, Node.js usually starts faster than Java, especially compared to Java applications that use heavyweight frameworks like Spring Boot. That does not mean Java is “slow.” It simply means Java and Node.js have different startup models, different initialization costs, and different design philosophies.

A simple analogy:

Node.js is like a motorcycle: you turn it on and it’s ready to go.
Java is like a bus: before it starts moving, it needs to go through several preparation steps.

This analogy is not meant to say one technology is absolutely better than the other. It just highlights that Node.js usually has a lower startup latency.

Interpreted vs Compiled

Node.js runs JavaScript on the V8 engine. V8 does more than just “read code and run it.” It uses JIT (Just-In-Time Compilation). When the application starts, V8 can parse and execute code very early, then gradually optimize the parts that are used frequently.

In other words, Node.js does not need to prepare too much before handling the first request. It can start working quickly and optimize over time.

Java, on the other hand, runs on the JVM, and the JVM typically has to do more work during startup:

Load classes into memory.
Verify bytecode.
Link required components.
Initialize runtime structures such as heap, stack, metaspace, and other internal components.

These steps give Java a very strong foundation for long-running systems, but they also make startup slower than Node.js in many cases.

Why does Spring Boot start slower?

When comparing Express in Node.js with Spring Boot in Java, the difference becomes even more obvious:

Node.js follows a minimalistic philosophy. You require only what you need. Things remain relatively lightweight and isolated, and unused components do not need to be initialized early.
Java, especially with frameworks like Spring Boot, often uses annotation scanning and dependency injection. During startup, it has to scan the project for things like @Service, @Controller, and @Component, build the dependency injection container, create and wire beans, apply auto-configuration, and set up multiple layers of framework abstraction.

That is why Spring Boot startup tends to feel heavier. But this is not a flaw — it is the cost of a powerful and convenient enterprise framework. Java was built for systems that need to run for a long time, stay stable, scale well, and handle sustained traffic.

So the JVM accepts a higher startup cost in exchange for stronger optimization later. Once the system warms up, the JVM can become very fast, especially for workloads that run for a long time and have repetitive patterns.

Memory management

Another reason Java often feels heavier at startup is memory management. The JVM typically initializes memory areas such as:

Heap
Stack
Metaspace
Garbage Collection structures

In many production systems, Java is also configured with parameters such as -Xms and -Xmx to define the initial and maximum memory size. This helps the system remain stable during long-running execution, but it also increases startup time and initial resource usage.

Node.js usually has a lighter startup footprint, especially for small or medium applications. However, actual memory usage still depends on code, dependencies, caching, and workload. Saying “Node.js is always light” is too absolute, but saying “Node.js is usually lighter at startup” is reasonable.

Meaning for serverless

This is where Node.js often has a very clear advantage.

In serverless environments such as AWS Lambda, startup latency — or cold start — directly affects user experience. Because Node.js typically starts faster, it is often chosen for:

Short APIs
Webhooks
Small jobs
Simple logic that needs to respond quickly

That said, it is also not correct to say Node.js is “the winner in all cases” or “always the best.” Java can absolutely be used in serverless, especially when optimized correctly. There are also now techniques that significantly reduce Java cold starts, so Java is no longer the “too slow” option it used to be.

4. What improvements has Java made to compete with Node.js for I/O-bound workloads?

If Node.js was once considered the strongest choice for I/O-bound workloads, Java has now made major improvements that close the gap — and in some cases even outperform it. The two most important directions are Reactive Programming with WebFlux and Virtual Threads in Java 21.

Reactive Programming

WebFlux is a non-blocking reactive model in the Spring ecosystem. Its goal is to handle many concurrent requests without keeping one blocking thread per request, as in the traditional model.

The strengths of WebFlux include:

It fits systems with a lot of I/O.
It uses system resources very efficiently.
It increases throughput when there are many concurrent connections.
It is a great fit for streaming workflows or services that call each other frequently.

The way WebFlux works can remind us of Node.js because both follow an event-driven, non-blocking style. However, WebFlux is not “~~Node.js in Java~~.” It is a reactive approach built on the Spring ecosystem, often running on a non-blocking runtime such as Netty.

The important thing is that WebFlux does not automatically solve CPU-bound problems. If you put heavy computation on the same event loop or structure your reactive pipeline poorly, you can still block the system. So WebFlux is strong for I/O-bound workloads, but it is not the answer for every kind of workload.

In short, WebFlux is a good fit when you need:

A high number of concurrent requests.
More I/O than computation.
End-to-end non-blocking behavior.
Efficient thread usage.

The trade-off is that it is more complex. Reactive code is often harder to read, harder to debug, and requires the team to be comfortable with data flow, backpressure, and asynchronous thinking.

Virtual Threads

Starting with Java 21, Java introduced Virtual Threads, one of the most important features since Java 8, especially with its deep integration into Spring Boot 3.2.

Virtual Threads let you create a huge number of lightweight threads, but these threads do not map 1:1 to OS threads. Instead, they are managed by the JVM scheduler and shared across a smaller number of underlying OS threads.

This brings a major benefit:

You can still write code in the familiar blocking style.
But the resource cost is much lower than traditional threads.
When I/O happens, a virtual thread can be paused so the carrier thread can be used by another task.

The best thing about Virtual Threads is that they allow Java to keep code simple while still scaling well for I/O-heavy workloads. That is why many developers — myself included — consider it one of the most important Java improvements in years.

Compared with reactive programming, Virtual Threads are usually easier for most developers to adopt because:

You do not need to fully switch to reactive thinking.
You do not need to chain callbacks or pipelines everywhere.
The code looks close to traditional synchronous style.

That said, Virtual Threads are still not a silver bullet for everything. They help a lot with I/O-bound workloads, but for heavy CPU-bound tasks, you still need proper architecture, task splitting, or parallel execution where appropriate.

Compared with Node.js

If we compare modern Java with Node.js, the discussion is no longer as simple as “~~Node.js is faster than Java for I/O-bound workloads~~.”

Node.js is still very strong in:

Fast startup
A simple event loop model
A unified JavaScript ecosystem from backend to frontend
Small, lightweight services that need fast responses

Meanwhile, Java now offers:

WebFlux for reactive, non-blocking programming
Virtual Threads for simple code that still scales well
A highly optimized JVM for long-running systems

So Java has become a very competitive option for I/O-bound workloads, especially when the team wants readable code, maintainability, and the strength of the Spring ecosystem.

5. Why is JavaScript different on the frontend and backend?

I’m not the only one who has asked this question. It is a very natural one, and it reveals a common misunderstanding: many people assume that if it is all JavaScript, it should behave the same everywhere. In reality, JavaScript itself is not the whole story. The runtime and the execution environment matter just as much.

JavaScript on the frontend and backend uses the same language, but they run in two different worlds:

The frontend runs in the browser runtime.
The backend runs in the Node.js runtime.

That difference is what gives them very different capabilities and limitations.

Frontend and backend use different runtimes

When you write frontend code with React, Vue, Angular, or plain JavaScript, your code runs in the browser. The browser provides many APIs for user interaction and UI work, such as:

alert()
window
document
DOM manipulation
Mouse, keyboard, and scroll events

That is why alert() works in the frontend. It is part of the browser API, not part of JavaScript itself.

By contrast, when JavaScript runs in Node.js, it does not have objects like window or document. Node.js is designed for server environments, so it provides different APIs, such as:

Working with the filesystem
Handling network operations
Creating servers
Reading environment variables
Accessing process information
Using other server-side libraries

That is why you cannot call alert() in a Node.js backend. That API simply does not exist in that runtime.

Node.js is not “just JavaScript”

One easy misconception is to think that Node.js is simply “JavaScript running somewhere else.” In reality, Node.js is a runtime environment for JavaScript.

Besides the JavaScript engine itself, Node.js also comes with:

A runtime for executing code
A standard library for system-level tasks
npm, the most common package manager that comes with the ecosystem

Because of that, backend JavaScript can do things the browser cannot, such as reading files, creating a TCP server, or connecting to a database.

Why can’t the browser connect directly to a database?

Frontend applications cannot — and should not — connect directly to a production database for several reasons.

The first reason is security. If the browser could connect directly to the database, you would have to expose database credentials on the client side. That would be extremely dangerous, because users could inspect, modify, or abuse those credentials.

The second reason is system architecture. In modern web applications, the frontend and the database should not talk directly. Instead, the flow usually looks like this:

Frontend → Backend API → Database → Backend API → Frontend

This approach helps:

Protect sensitive credentials.
Control access permissions.
Handle validation.
Improve logging and auditing.
Keep client and server responsibilities separate.

So the most accurate thing to remember is this:

It is not JavaScript itself that decides what you can do — it is the runtime and the execution environment.

Conclusion

Looking back at these five questions, the most important thing is not deciding whether Node.js or Java is “better.” The real point is that each platform is strong in a different kind of workload. Node.js shines with the event loop, non-blocking I/O, fast startup, and a unified frontend-backend JavaScript ecosystem. Java, on the other hand, has come a long way with WebFlux and Virtual Threads, making it much more competitive for I/O-bound workloads than it used to be.

If you understand the real nature of the runtime, event loop, cluster, virtual threads, and the limitations of the browser compared with Node.js, it becomes much easier to choose the right technology for each situation. And instead of asking who is “better,” the more valuable question is: for this workload, this team, and this set of requirements, which choice is the least risky and the easiest to operate?

(If you enjoy these kinds of engineering stories, you can subscribe to receive the next ones.)

Subscribe now

Under the Hood: A Deep Dive into Processes, Threads, and CPU Architecture

Quang Chien TRAN — Tue, 28 Apr 2026 05:16:44 GMT

Foundational knowledge has always been an important part of the programming world, and even now, AI has crept into almost every job of a programmer. By understanding the fundamentals, you will find it easier to develop, debug, and also use AI more effectively.

Today, I will have a deep dive into Process and Thread, things that lie under our operating system layer, one of the foundational pieces of knowledge you must master when working with computers, helping you understand what really happens underneath when an application runs. Let’s get started.

What is a Process ?

Simply put, a process is a software program that is being executed on a computer. One software can create many different processes. On a computer, there can be many processes from many different software programs coexisting at the same time.

To make it easier to understand, with Windows, when you run a web browser like Chrome, it will create many different processes for each tab, extension, and system component, or when you launch a game, it will create a separate process.

Each process has its own process ID, its own data and state. Each process works in its own memory space, and cannot directly access the data of another process unless there is a sharing mechanism allowed by the operating system (IPC - Inter-Process Communication).

To store the data of a process, the OS will use a data structure called Process Control Block (PCB). Each PCB is associated with a separate PID. The PCB includes the following information:

Process ID (PID): An integer that identifies the process.
State: The current state of the process. A process isn't always "Running." It might be Ready (waiting for its turn), Waiting (waiting for you to click something or a file to load), or Terminated.
Pointer: Information linking to related processes.
Priority: The priority of the process, helping the processor determine the execution order.
Program Counter: A pointer storing the address of the next instruction to be executed by the process. This is vital for Context Switching. Since the CPU jumps between processes thousands of times per second, the PC acts like a "bookmark" so the CPU knows exactly where it left off when it returns to that process.
CPU Registers: The registers the process needs to use for execution. These store the temporary data (the "math" being done at that exact microsecond).
I/O Information: Information about the read/write devices the process needs to use.
Accounting Information: Contains information about CPU usage such as time used, identification.

What is a Thread ?

A thread is a lightweight unit of execution within a process. If a process is a house, threads are the people living inside—they share common spaces like the heap and the process address space, but each thread has its own private stack and register state.

Smallest unit: It is the smallest sequence of programmed instructions that a scheduler can schedule independently.
Shared resources: Threads share code, data, and OS resources such as open files. This makes communication fast but also creates race conditions when two threads access and modify the same data without proper synchronization.
Efficiency: Context switching between threads is usually cheaper than switching between processes because the address space often stays the same.

Hardware vs. Software Threads

Hardware threads: These are execution contexts exposed by the CPU. In your example, an Apple M4 with 10 cores can provide 10 hardware threads. This affects how many threads can run in true parallelism. A “Hardware Thread” is essentially a set of registers on the CPU core that allows it to hold the state of a software thread.
OS/software threads: These are managed by the kernel. You can create many software threads, and the OS time-slices them across the available hardware threads.

The Java Example

Historically, one Java thread usually mapped to one OS thread (platform threads). Modern virtual threads (Project Loom) let many Java threads run on a smaller number of OS threads, which helps high-scale applications become more efficient.

How a Thread is Created ?

When each thread is created, it has its own execution state, including an Instruction Pointer (IP), which determines the location of the next instruction the thread will execute.

Thanks to this separate execution state, when the CPU performs a Context Switch between threads, each thread can continue right from where it left off instead of starting over from the beginning.

The Thread Control Block (TCB)

Just as a process has a PCB, a thread typically has a TCB or an equivalent thread-specific structure. It is usually smaller and lighter than a PCB.

A TCB may contain:

Thread ID.
Stack Pointer.
Instruction Pointer.
State.
Register values.

Why it is “Cheaper”

Switching between threads in the same process is usually faster than switching between processes because the OS does not need to switch to a different address space. Threads in the same process share the same memory map, so the overhead is lower.

Multi-thread

Multi-threading is the ability of a CPU or a single process to provide multiple threads of execution concurrently. Instead of following just one line of instructions, the process can split into multiple execution paths.

Advantages

Non-blocking UI: Essential for modern applications. The main thread handles user input such as clicks and scrolling, while worker threads handle heavy tasks such as API calls or database queries.
Better resource utilization: On a multi-core processor, multi-threading allows the OS to run multiple tasks in parallel when possible.
Economy: Threads are cheaper to create than processes because they do not require a completely new memory space. They share the existing heap within the same process.

Disadvantages

Race conditions: These happen when two threads read or write the same shared variable at the same time.
- Example: both threads see a balance of $100, both add $10, and instead of getting $120, the final result might be $110 because one update overwrites the other.
Deadlock: Thread A holds Resource 1 and waits for Resource 2, while Thread B holds Resource 2 and waits for Resource 1. Both wait forever.
Starvation: Low-priority threads may never get CPU time if higher-priority threads keep taking the processor.
Testing complexity: Multi-threaded bugs are often Heisenbugs — they disappear when you try to observe them because timing changes during debugging.
The invisible cost: Context switching and memory synchronization, such as volatile or synchronized in Java, add overhead that can make a simple program slower than a single-threaded one.

Why It Does Not Scale Forever

Multi-threading does not always make programs faster. This is formally described by Amdahl’s Law, which says that the speedup of a program is limited by its serial part — the part that cannot be parallelized. If 20% of your code must still run sequentially, then your program can never be more than 5x faster, no matter how many threads you add. You can find more information about this law online or ask AI, it’s quite easy to understand.

Another hidden cost is context switching overhead. When the CPU moves from Thread A to Thread B, it must save the current state, such as registers and the stack pointer, and then load the new one. If you have too many threads, the CPU may spend more time switching than doing useful work.

Multi-process Model

Multi-process is a model in which a program or system uses multiple independent processes to handle work. Each process lives in its own isolated virtual address space, has its own state, and does not directly share data with other processes like threads do. Because of this isolation, when one process crashes (for example with a segmentation fault), the other processes can often continue running normally.

Simply put, multi-process is like splitting a large application into multiple separate “work rooms”. Each room has its own task, its own people, its own documents, and is less dependent on the others. This makes the system safer and easier to isolate.

Examples:

A browser can use multiple processes for tabs, extensions, and network.
A server like Nginx or PostgreSQL (Process by Connection mechanism) can also use multiple processes to handle different tasks.
A Python program can create multiple processes to handle heavy tasks on multiple CPU cores.

Why use multi-process?

Use multi-process when:

The work can be broken down into many independent parts.
You want to take advantage of multiple CPU cores to improve performance.
You want fault isolation, to avoid one part of a failure affecting the entire application.
You want to avoid the limitations of threads in some runtimes or languages.

The GIL (Global Interpreter Lock):

In languages like Python or Ruby, a global interpreter lock (GIL/GVL) prevents multiple threads from executing bytecode at the same time. To get true parallelism on a multi-core CPU, developers often use multi-processing instead of multi-threading.

Advantages:

Better resource isolation than multi-thread.
One process failure does not bring down the entire system.
Can truly utilize multiple CPU cores.
Good fit for CPU-bound tasks.

Copy-on-Write (CoW):

Although each process has its own memory, modern operating systems use a trick called Copy-on-Write. When a process forks, the OS does not immediately copy all of its memory. Both processes share the same physical memory pages until one of them writes to a page. Only then is that page actually copied. This makes multi-processing more efficient than it might sound at first.

Disadvantages:

Creating and managing processes is more expensive than threads, because spawning a process requires heavier system calls to the kernel.
Communication between processes is more complex, because their memory spaces are separate.
It consumes more memory because each process still has its own stacks, heaps, and libraries.

IPC complexity:

Because processes cannot directly see each other’s memory, they must use Inter-Process Communication (IPC) mechanisms:

Pipes & sockets: sending data like a stream or a phone call.
Shared memory: setting up a common memory region that both processes can access.
Message queues: leaving messages in a mailbox for other processes to read later.

Multitasking

Multitasking is the ability of an operating system to manage multiple tasks, such as processes or threads, concurrently. Even on a machine with multiple cores, the operating system still relies on multitasking to make many programs appear to run at the same time. In practice, the OS divides CPU time into small slices and alternates between tasks so quickly that it creates the illusion of full simultaneity.

At the center of multitasking is Context Switching. A context switch happens when the CPU stops running one process or thread and switches to another. Before switching, the operating system must save the current execution state, and when the task runs again later, it restores that state so execution can continue from exactly where it left off.

You can think of the CPU like a chef cooking many meals at once. Before moving away from one dish, the chef remembers the temperature, timer, and cooking status. When returning later, the chef checks those notes and continues from the right point instead of starting over.

What gets saved during a context switch usually includes:

Program Counter: the next instruction to execute.
CPU Registers: temporary values currently being used by the CPU.
State information: whether the task is Ready, Running, Waiting, or another state.
I/O information: any relevant input/output details.
Accounting information: usage statistics such as CPU time.

The Cost of Switching

Context switching is essential, but it is not free. It adds overhead, which means the CPU spends time on housekeeping instead of useful work. Saving and restoring state takes time, and frequent switches can reduce overall performance.

There is also a cache effect. When the CPU switches from one task to another, the cache may still contain data from the previous task. The new task may suffer cache misses and need to fetch data from the slower main memory, which adds more delay. This is one reason why too many context switches can hurt performance.

Preemptive vs. Cooperative

Modern operating systems usually use preemptive multitasking. In this model, the OS is in control and can forcibly stop a task when its time slice expires. This helps keep the system responsive and prevents one task from monopolizing the CPU.

Older systems sometimes used cooperative multitasking. In that model, tasks had to voluntarily give up control. If one task froze or misbehaved, the whole system could become unresponsive. That is why preemptive multitasking became the standard in modern operating systems.

Hardware Support

Modern CPUs also provide hardware features that help context switching happen more efficiently. The CPU can save and restore register state quickly, which reduces some of the cost of switching. Even so, context switching still has a real performance price, especially when it happens too often.

In short, multitasking is the big picture, while context switching is the mechanism that makes it work. Multitasking allows many tasks to share one CPU over time, while context switching is the actual process of moving from one task to another and back again.

Scheduler

The scheduler is a core component of the operating system kernel that decides which process or thread gets to run on the CPU at any given moment. In simple terms, it is like the director of a stage performance: it decides who gets on stage first, who has to wait, and how long each actor gets to perform.

The scheduler does not just choose what runs, but also for how long. When a task’s time slice ends, or when it needs to wait for I/O, the scheduler moves the CPU to another task so the system stays responsive and does not waste CPU time.

The Three Levels of Scheduling

In modern operating systems, scheduling is usually split into three levels:

Long-term scheduler: decides which jobs are admitted into the system from disk into memory. It controls the degree of multiprogramming.
Short-term scheduler: also called the CPU scheduler, this is the one that picks a task from the Ready Queue and gives it CPU time. It runs very frequently and must be extremely fast.
Medium-term scheduler: handles swapping. When memory is under pressure, it can temporarily move a process out of RAM and bring it back later.

These three roles work together to balance performance, memory usage, and responsiveness.

Process States and Queues

A process usually moves through a small number of states:

New: the process is being created.
Ready: the process is ready to run and is waiting in the Ready Queue.
Running: the process is currently using a CPU core.
Waiting or Blocked: the process cannot run because it is waiting for a slow event, such as disk I/O or user input.
Terminated: the process has finished and is being cleaned up.

A common point of confusion is that a process usually does not go directly from Waiting to Running. It must first return to the Ready Queue and wait for the short-term scheduler to pick it again.

Scheduling Queues

The operating system uses different queues to organize processes and threads:

Job Queue: contains jobs that have not yet been admitted into memory.
Ready Queue: contains processes or threads that are ready to run on the CPU.
Device Queue: contains processes or threads waiting for I/O devices such as disk, network, or other peripherals.

When a process changes state, it moves to the appropriate queue. For example, if a running process needs to read from disk, it leaves the CPU and moves to the Device Queue. When the I/O completes, it returns to the Ready Queue and waits for CPU time again.

Scheduling Criteria

When choosing a scheduling algorithm, the operating system usually balances several goals:

CPU utilization: keep the CPU busy as much as possible.
Throughput: finish as many jobs as possible in a given time.
Turnaround time: reduce the time from New to Terminated.
Response time: reduce the delay between a user action and the first visible reaction.
Fairness: ensure that no task is starved of CPU for too long.
Waiting time: reduce the time a task spends waiting before it gets CPU time.

The best choice depends on the system’s goal. A server may care more about throughput and CPU utilization, while a desktop OS cares more about response time and fairness.

Scheduling Algorithms

Several scheduling algorithms are commonly used:

First Come, First Serve (FCFS): the first task to arrive is the first task to run.
Round Robin (RR): each task gets a fixed time slice, then the CPU moves to the next task.
Priority Scheduling: higher-priority tasks run first.
Shortest Job First (SJF): tasks with less work are prioritized.
Shortest Remaining Time: the task with the least remaining processing time is selected.
Multi-level Queue: the system is divided into multiple queues, each with its own scheduling policy.

Each algorithm has trade-offs between responsiveness, fairness, and efficiency. Round Robin is simple and fair, but may create more context switching. SJF can reduce average waiting time, but it is harder to use in practice because the OS must estimate job length.

Multi-level Feedback Queue

Most modern operating systems do not rely on a simple scheduling model alone. A common real-world approach is the Multi-level Feedback Queue (MLFQ).

In MLFQ, a new task typically starts in a high-priority queue with a short time slice. If it behaves like a quick interactive task, it stays near the top. If it keeps using too much CPU time, the OS gradually moves it to lower-priority queues with longer time slices. This keeps the user interface responsive while still allowing large jobs to complete.

CPU-bound and I/O-bound Workloads

Different workloads need different scheduling behavior:

CPU-bound tasks spend most of their time doing computation.
I/O-bound tasks spend most of their time waiting for disk, network, or other devices.

Schedulers often try to favor short interactive or I/O-bound tasks so the system feels responsive, while still making sure CPU-bound tasks eventually get enough processing time.

Shared Memory

Shared Memory is one of the fastest methods for Inter-Process Communication (IPC). Instead of sending data back and forth through the kernel, the operating system maps the same physical memory region into the virtual address spaces of multiple processes. That means the same data can be accessed directly by more than one process.

Shared Memory vs. Message Passing

There are two main ways for processes to communicate:

Message passing: the OS acts like a mailman. Process A sends data to the kernel, and the kernel delivers it to Process B. This is safer, but it usually involves copying data more than once.
Shared memory: the OS provides a shared region of memory that both processes can read and write directly. This avoids copying and is much faster, especially for large data.

This is why shared memory is often the preferred choice when speed matters most.

How It Works ?

The operating system takes a physical block of RAM and maps it into the virtual address spaces of two or more processes.

For example:

To Process A, the shared data might appear at address 0x1000.
To Process B, the same physical memory might appear at address 0x5000.

Even though the virtual addresses are different, both processes are looking at the same physical memory underneath.

Advantages

Zero-copy communication: once the memory is mapped, data moves at memory speed instead of being copied through the kernel.
Good for large data: shared memory is ideal for video frames, database buffers, and other large datasets.
Low latency: it is often faster than pipes or sockets for frequent data exchange.

Disadvantages and Risks

Synchronization burden: the OS does not manage access automatically, so developers must protect shared data using atomic operations, mutexes, semaphores, or locks.
Complexity: if one process crashes while holding a lock, other processes may get stuck, or the shared data may become corrupted.
Security: if permissions are not configured carefully, shared memory can become a risk for unauthorized data access.

Mutex vs. Semaphore

It is helpful to distinguish between the common synchronization tools:

Mutex: like a key to a bathroom. Only one thread or process can hold it at a time.
Semaphore: like a parking lot counter. It allows a fixed number of threads or processes to access a resource at the same time.

Shared memory is powerful because it gives you speed, but it also gives you responsibility. The OS provides the shared space, but the application must make sure it is used safely and correctly.

CPU Caches

Modern CPUs use a cache hierarchy to bridge the huge speed gap between the CPU and main memory (DRAM). The closer a memory level is to the CPU core, the faster it is, but also the smaller it tends to be.

Typically, CPUs have three main cache levels:

L1: the smallest and fastest cache. It is usually split into:
- L1i for instructions.
- L1d for data.
L2: larger than L1, but slightly slower. In many modern designs, it acts as a buffer for L1 and may be shared by a small cluster of cores.
L3: also called the Last Level Cache (LLC). It is much larger, usually measured in megabytes, and can be shared across multiple cores.

When the CPU needs data, it checks L1 first, then L2, then L3, and finally DRAM if the data is not found in cache. This layered design helps keep frequently used data close to the processor, which dramatically reduces average memory access time.

Cache Lines: The Unit of Transfer

Data is transferred between memory levels in fixed-size blocks called cache lines. On many modern CPUs, a cache line is typically 64 bytes.

The reason is that when a CPU reads a variable from RAM, it doesn’t just fetch that single variable (byte by byte). Instead, it loads an entire 64-byte chunk surrounding that variable (this is the cache line) into the CPU cache (L1). It does this because it predicts that, in most cases, nearby data will soon be used. The next time, those neighboring values are already in L1, so there’s no need to access RAM again.

This mechanism is based on the principle of spatial locality. If the CPU accesses a value in memory, it assumes that nearby values are likely to be accessed soon as well.

That’s why arrays are typically cache-friendly. Their elements are stored next to each other in memory, allowing the CPU to use cache lines efficiently. In contrast, linked lists often have nodes scattered throughout memory, making cache usage much less efficient.

Cache Hits and Misses

When the CPU looks for data, one of two things happens:

Cache hit: the data is found in cache, so execution continues quickly.
Cache miss: the data is not found, so the CPU must fetch it from a lower level, usually a slower cache or DRAM.

A cache miss is expensive. The CPU may have to wait dozens or even hundreds of cycles while the data is loaded. That is why good cache behavior can have a huge impact on performance.

Set-Associativity and Tags

Cache is not managed like a simple hash table. Instead, it uses a set-associative structure.

The CPU uses specific bits from the memory address to choose a cache set, then compares the tag to see whether the data is actually there. This design lets the hardware find data very quickly without scanning the entire cache.

Each cache line usually contains:

The actual data.
A tag that identifies which memory block it belongs to.
Metadata such as validity and dirty state.

Write-Back and Dirty Cache Lines

To stay fast, CPUs often use a write-back policy. When data is modified, the change is written to cache first, and the cache line is marked as dirty.

The main memory is not updated immediately. Instead, the data is written back later, usually when:

the cache line is evicted, or
synchronization is forced by something like a memory barrier, volatile, or atomic operations.

This approach reduces traffic to DRAM and improves performance, but it also means that cache and memory can temporarily hold different versions of the same data.

Cache Coherency and False Sharing

On multi-core CPUs, each core can have its own cache (L1, L2). This creates a data synchronization problem between CPU cores during computation: if Core 1 modifies data in its cache (within a cache line), how does Core 2 know that its copy is now outdated and needs updating?

When Core 1 modifies a variable x, it marks that cache line as “dirty” (modified). If Core 2 wants to modify another variable y that happens to reside on the same cache line, it can’t use its current copy anymore. Instead, it has to fetch the updated cache line again from L3 or RAM. This issue is commonly known as false sharing.

When two cores modify values on the same cache line, the cache line becomes inefficient because it has to be repeatedly reloaded, which is costly in terms of time. This constant back-and-forth movement is managed by a cache coherency protocol - MESI protocol. This protocol ensures that cores coordinate with each other so they don’t keep using stale data. In practice, it guarantees that all cores eventually see the most up-to-date version of shared data.

False sharing is particularly dangerous because the code may look completely correct and free of obvious contention, yet still perform poorly due to how data is laid out in cache lines.

How to solve it

To solve false sharing, the most effective approach is to ensure that variables written independently by each core/thread do not end up on the same cache line.

Padding: Separate data into its own cache line by inserting extra “junk” data, making sure important variables do not sit within the same 64 bytes (so they fall on different cache lines). In Java, you can use the @Contended annotation to apply this padding.

Redesign data per thread: Give each thread/core its own local variable, do the computation independently, and then merge the results at the end instead of continuously writing to a shared memory region.

Why Caches Matter

Caches are one of the biggest reasons modern CPUs are fast. They reduce the average cost of memory access, help keep the CPU busy, and make repeated or nearby data access much cheaper.

But caches also introduce complexity. To write high-performance code, you need to think not only about algorithms, but also about memory access patterns, cache locality, cache coherency, and false sharing.

Conclusion

Once you understand process, thread, scheduler, and cache, you will see more clearly why some code runs fast, why some code runs slowly, why synchronization bugs happen, and why AI cannot completely replace system thinking. Foundational knowledge does not make you write code instead of AI, but it helps you know what to ask AI, and how to verify AI’s results.

(If you enjoy these kinds of engineering stories, you can subscribe to receive the next ones.)

Subscribe now

Java Virtual Threads in Java 21: From Platform Threads to Scalable Concurrency

Quang Chien TRAN — Sun, 26 Apr 2026 22:00:12 GMT

Concurrency in Java has long been a painful trade-off: either choose Platform Threads, which are easy to write but consume a lot of RAM and are hard to scale, or choose powerful Reactive code that makes your brain twist into knots.

Virtual Threads were created to end that trade-off. They let you handle millions of requests with the simplest synchronous coding style. Let’s explore everything from the traditional thread model to the real power of virtual threads to see why this is such a turning point for Java. Let’s begin :) :)

Platform Thread

Mechanism

Before talking about Virtual Threads, I need to understand how Java has handled threads up to now.

In Java, each thread you create (through java.lang.Thread) is almost mapped 1-to-1 with an operating system thread (OS thread). These threads are usually called platform threads. It sounds simple, but the cost is far from small.

Why are platform threads called “heavyweight”?

Each platform thread is not just a concept inside the JVM — it is tightly bound to operating system resources. When you create a thread, the operating system allocates a separate memory area for it called the thread stack. This is where the entire execution state of the thread is stored.

Inside this stack are stack frames — each frame corresponds to one function call. Each frame will contain:

Return address
Local variables
Parameters
Intermediate data used for execution

The stack works in the familiar LIFO (Last In, First Out) principle.

Call a function → push a frame onto the stack
End the function → pop the frame off the stack

The CPU uses a register called the stack pointer to track the “top” of the stack, and moves it up or down accordingly during push/pop operations.

Where is the problem?

The important point is: the stack size of each thread is decided as soon as the thread is created (for example through the JVM -Xss parameter). Even if some operating systems can allocate stack in a “use as you go” way (lazy allocation), each thread still has to reserve a maximum stack area in advance.

This leads to two major consequences:

Each thread consumes a significant amount of memory
The thread is managed directly by the OS scheduler → high context-switching cost

When you scale to tens of thousands or hundreds of thousands of concurrent tasks, this model starts to struggle

Problems with threads — when everything starts getting overloaded

The thread stack mechanism sounds neat, but in reality it hides quite a few problems.

StackOverflowError – a familiar error

One direct consequence is: if a program uses recursion without a proper stopping point, or has too many nested function calls, the number of stack frames keeps increasing. Eventually, when the number of frames exceeds the thread stack limit, the JVM throws:

StackOverflowError

The important thing to understand is that this error happens only on a specific thread, not because of the total number of threads in the system. It simply means that thread has run out of room to store more function calls.

OutOfMemoryError – when too many threads are created

On the other hand, the problem is no longer inside one thread, but in the number of threads. Each thread needs its own stack area, which by default is often around a few hundred KB to 1 MB (depending on the JVM and OS). When you create thousands or tens of thousands of threads, the total memory usage rises very quickly.

At some point, the system can no longer allocate a new thread, and you will get the error:

OutOfMemoryError: unable to create new native thread

This reminds us of something very practical: threads are not free.

Thread-per-request – a popular model with limits

Because of how platform threads work, traditional servers often use this model: one request = one thread. This is very intuitive, easy to code, easy to debug, and works well at moderate scale (a few hundred to a few thousand concurrent requests).

But when you scale to hundreds of thousands or millions of simultaneous requests, the problems become obvious:

Not enough memory to hold that many threads
Context switching between threads becomes extremely expensive

At that point, the system slows down, then eventually gets overwhelmed.

Reactive – a solution that is not easy to swallow

To get past those limits, a common path is Reactive Programming. Instead of “holding” a thread for the whole lifetime of a request, the system will:

Use non-blocking I/O
Release the thread while waiting (for example: waiting for a database or API)
Continue processing when data becomes available (event-driven)

Thanks to that, a small number of threads can handle a very large number of requests at the same time. It sounds great, but the price is complexity. You no longer write code in a sequential (synchronous) style — you must switch to asynchronous thinking:

Code becomes harder to read (callback, chain, reactive stream…)
Debugging becomes harder (the flow is no longer linear)
Maintenance becomes difficult if the design is not tight

And this is exactly the point where many teams struggle when applying reactive programming in large systems.

Virtual threads

After all the limitations of platform threads, Virtual Threads appear as a very different approach. At the API level, everything still looks familiar: you still work with java.lang.Thread. But underneath, the operating model has changed completely.

No longer “one thread = one OS thread”

The biggest difference is: a virtual thread is no longer permanently tied to a single OS thread throughout its lifetime. Instead, the JVM acts like a smart scheduler. When a virtual thread runs, it is temporarily assigned to an OS thread (often called a carrier thread).

But when the virtual thread encounters a blocking operation (for example: calling a database, reading a file, calling an API…), the JVM can:

Pause that virtual thread
Release the OS thread
Use that OS thread to run another virtual thread

When the data becomes available, the original virtual thread will be resumed on an OS thread (which may not be the same one as before)

Fewer real threads, more work

With this mechanism, you no longer need:

10,000 OS threads to handle 10,000 requests
Just a small number of OS threads to handle a large number of virtual threads

In other words, the JVM is doing multiplexing: many virtual threads → few OS threads

Cheap enough to use freely
Because it no longer needs a fixed native stack like a platform thread, a virtual thread has extremely low creation cost. You can create hundreds of thousands or even millions of virtual threads while still staying within acceptable resource limits — something almost impossible with platform threads. This is especially useful for systems with:

Lots of I/O
Lots of waiting time (waiting time greater than CPU time)

Similar to virtual memory

If this concept feels a bit “magic,” think of it like this: Virtual Threads are similar to how virtual memory works. Instead of forcing the program to deal directly with limited physical resources (RAM or OS threads), the JVM creates an abstraction layer:

Hides the real limits
Distributes resources more flexibly
Makes the most of what is available

The result is that you feel like resources are almost infinite, while underneath everything is still carefully optimized.

How virtual threads work

A virtual thread is not a standalone “real” thread, but is placed into an internal scheduling mechanism by the JVM.

You can picture it simply: the JVM keeps a queue of virtual threads ready to run. When an OS thread (carrier thread) becomes free, the JVM mounts a virtual thread onto it for execution.

Mount / Unmount

During execution, a virtual thread continuously goes through two states:

Mount: attached to a carrier thread to run
Unmount: detached when it hits a point where it must wait (blocking I/O, sleep, …)

The interesting part is: when a virtual thread is unmounted, the carrier thread is not kept waiting. Instead, that OS thread is immediately returned to the JVM to run another virtual thread. The original virtual thread simply waits until it is ready again, then gets mounted later.

Using resources efficiently

Thanks to this mechanism, a small number of OS threads can rotate to serve many virtual threads. Compared to the thread-per-request model:

No thread is occupied doing nothing while waiting for I/O
CPU is used more continuously and efficiently
The number of required OS threads drops sharply, saving operating system resources

In other words, the system no longer wastes resources just by waiting.

Code stays sync, runtime is very async

One extremely valuable point is: developers do not need to change how they write code. You can still write code in this style:

Call functions sequentially
Use blocking I/O normally

But underneath, the JVM is silently:

Pausing the thread when needed
Switching to another task
Returning to the exact spot to continue

That means: the experience feels synchronous, but the performance is close to async.

A common misunderstanding is that each virtual thread is assigned to the “least busy” OS thread. That is not actually how it works. The JVM does not fix this mapping. Instead, it continuously:

Schedules
Pauses
Resumes virtual threads

depending on each thread’s execution state. This flexibility is the key that helps virtual threads work well in systems with:

Lots of I/O
Many concurrent requests
Waiting time taking up most of the workload

Virtual thread context switching is lighter

When people hear “context switching”, they often think it simply means changing the currently running thread. But with OS threads, the story is much heavier.

OS thread context switch – where is the cost?

With traditional threads, every context switch requires the operating system to jump into kernel space. It is not just a matter of stopping thread A and running thread B — it also includes:

Saving the entire CPU state of the current thread
Restoring the state of the next thread
Updating kernel management structures
Affecting CPU cache performance (the cache gets “cold”)

If this happens repeatedly, the accumulated cost becomes very large and drags down system throughput.

Virtual threads – moving the work into the JVM

Virtual threads work differently. Instead of pushing the context-switching burden down into the kernel, the JVM handles most of this logic in user space. When a virtual thread hits blocking work (I/O, sleep,…), the JVM will:

“Freeze” (park) the virtual thread
Store the necessary state in JVM memory
Unmount it from the carrier thread

Most importantly: all of this happens without kernel intervention. The carrier thread is immediately reused to run another virtual thread, without going through a heavy OS-thread-style context switch.

No need to carry the whole “machine” like an OS thread

An OS thread always comes with a fixed native stack, and its state is always tied to the kernel. In contrast, a virtual thread does not need to keep a native stack for its entire lifetime and only stores what is necessary to resume execution.

Simply put:

OS thread = carries the whole machine
Virtual thread = carries only the enough-to-resume state

Therefore, pausing and resuming a virtual thread is much lighter. One important thing to understand correctly: virtual threads do not eliminate context switching. They only:

Reduce the number of times kernel-level switching is needed
Move most scheduling work into the JVM
Optimize for the “run → wait → run again” pattern

In I/O-heavy systems, this is a huge difference. You can think of it like this:

OS thread: every time you switch tasks, you have to “ask the operating system for permission,” go through all the procedures → expensive
Virtual thread: the JVM handles internal scheduling itself, using OS threads only as temporary workers → faster and more flexible

So virtual threads are not “more magical,” but simply avoid unnecessary expensive costs.

Examples

Platform Thread

import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;

public class PlatformThreadExample {

    private final HttpClient client = HttpClient.newHttpClient();

    public String getCombinedData() throws Exception {
        String user = call("https://api.example.com/user/1");
        String orders = call("https://api.example.com/orders/1");
        return user + " | " + orders;
    }

    private String call(String url) throws Exception {
        HttpRequest request = HttpRequest.newBuilder()
                .uri(URI.create(url))
                .GET()
                .build();

        HttpResponse response = client.send(request, HttpResponse.BodyHandlers.ofString());
        return response.body();
    }

    public static void main(String[] args) throws Exception {
        PlatformThreadExample app = new PlatformThreadExample();
        System.out.println(app.getCombinedData());
    }
}

Here, client.send(...) is a blocking call. The running thread is held until the HTTP response returns, so if there are many concurrent requests, you will need many platform threads.

Reactive programming

The example below uses Spring WebFlux / Reactor. The goal is not to block a thread while waiting for I/O, but to chain processing steps using Mono.

import org.springframework.web.reactive.function.client.WebClient;
import reactor.core.publisher.Mono;

public class ReactiveExample {

    private final WebClient webClient = WebClient.create();

    public Mono getCombinedData() {
        Mono userMono = webClient.get()
                .uri("https://api.example.com/user/1")
                .retrieve()
                .bodyToMono(String.class);

        Mono ordersMono = webClient.get()
                .uri("https://api.example.com/orders/1")
                .retrieve()
                .bodyToMono(String.class);

        return userMono.zipWith(ordersMono, (user, orders) -> user + " | " + orders);
    }

    public static void main(String[] args) {
        ReactiveExample app = new ReactiveExample();

        app.getCombinedData()
                .subscribe(System.out::println);

        try {
            Thread.sleep(3000);
        } catch (InterruptedException ignored) {
        }
    }
}

Here, Mono does not represent a value that is immediately available, but a pipeline that will complete after. The code does not block a thread while waiting for the HTTP response, and the reactive framework will coordinate the continuation when data becomes available.

Virtual Thread

This example keeps the synchronous coding style like platform threads, but runs on virtual threads.

import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

public class VirtualThreadExample {

    private final HttpClient client = HttpClient.newHttpClient();

    public String getCombinedData() throws Exception {
        String user = call("https://api.example.com/user/1");
        String orders = call("https://api.example.com/orders/1");
        return user + " | " + orders;
    }

    private String call(String url) throws Exception {
        HttpRequest request = HttpRequest.newBuilder()
                .uri(URI.create(url))
                .GET()
                .build();

        HttpResponse response = client.send(request, HttpResponse.BodyHandlers.ofString());
        return response.body();
    }

    public static void main(String[] args) {
        VirtualThreadExample app = new VirtualThreadExample();

        try (ExecutorService executor = Executors.newVirtualThreadPerTaskExecutor()) {
            executor.submit(() -> {
                try {
                    System.out.println(app.getCombinedData());
                } catch (Exception e) {
                    throw new RuntimeException(e);
                }
            });
        }
    }
}

The important point is that the business logic is almost unchanged compared with platform threads, but instead of being “stuck” on an OS thread while waiting for I/O, the virtual thread can be temporarily parked by the JVM so the carrier thread can do other work.

Notes and Best practices

Even though virtual threads are very powerful, there are a few important points to understand correctly so you do not use them the wrong way.

Still a Thread, but don’t treat it like the old kind

A virtual thread is still a Thread, so you can use start(), join(), and so on as usual. However, that does not mean it behaves like a platform thread. Older APIs like stop() and suspend(), which have long been dangerous and deprecated, should be avoided even more with virtual threads.

Concurrency problems do not disappear

Virtual threads do not change the nature of concurrent programming. The usual problems still remain:

Race condition
Deadlock
Visibility
Atomicity

In other words: virtual threads help you run more work at once. But they do not make your code automatically correct. You still have to:

Use locks when needed
Ensure safe publication
Design shared state carefully

The JVM will not “keep waiting” for virtual threads
One thing that can be surprising: a virtual thread does not keep the JVM alive the way a non-daemon platform thread does. That means if the main thread ends, the JVM may shut down even while virtual threads are still running.

So for important tasks (writing data, processing transactions, sending events…), you need to:

Join explicitly
Or manage lifecycle clearly (for example with executors, structured concurrency, …)

Do not rely on threads the way you used to.

Pinning – when a virtual thread gets “stuck” to an OS thread

In some situations, a virtual thread cannot unmount from its carrier thread. When that happens, it is called pinned. Typical cases include:

Running inside a synchronized block/method and then hitting blocking code
Calling a native method or foreign function

When pinned, the carrier thread is kept occupied and cannot be returned to the JVM scheduler, which reduces the ability to scale significantly. Simply put: you end up close to the traditional thread model again, but without noticing it. For that reason, if your code uses a lot of synchronized, you should re-check:

Do you really need the lock? Are you holding the lock while calling I/O?

Some ways to improve:

Use ReentrantLock (more flexible)
Avoid holding locks while waiting (I/O, sleep, …)
Redesign to reduce shared state

It is not that synchronized is forbidden, but you should not let the virtual thread remain tightly held while waiting.

Do not pool virtual threads

Another important principle is: do not pool virtual threads. This is a habit that is very easy to bring over from traditional threads — and it is wrong.

Pooling makes sense when threads are expensive resources, but virtual threads are not expensive in that way. The better model is to create one virtual thread per task, then let the JVM handle their execution.

In other words, if your task is the unit of work, the virtual thread should be treated as an abstraction for that task, not as a precious worker that must be kept for reuse.

Be careful with `ThreadLocal`

With platform threads, ThreadLocal is sometimes a convenient way to attach data to a thread. But with virtual threads, the number of threads can be very large, so if you abuse ThreadLocal, you can accidentally increase memory usage quickly and make the code harder to control.

If your goal is to limit access to a finite resource such as a database connection, a semaphore is often clearer and more direct.

Do not combine with parallel streams

A common misunderstanding is:

~~Virtual thread + parallel stream = faster~~

Parallel streams are mainly designed for CPU-bound workloads and usually rely on ForkJoinPool, while virtual threads are strongest for I/O-bound workloads where most of the time is spent waiting.

The two do not conflict, but they do not naturally amplify each other either. Unless there is a clear reason, combining them usually does not improve performance and can even make the system less predictable.

Conclusion

In short, virtual threads do not replace every concurrency model, but they are a major step forward for Java in keeping code readable while still scaling well. Used in the right place, they reduce complexity without forcing developers into a heavy asynchronous architecture.

If you’re using Java 21 or later, virtual threads are officially ready to use. If you’re still using an older version, are you ready to upgrade your Java version ? :D :D

(If you enjoy these kinds of engineering stories, you can subscribe to receive the next ones.)

Subscribe now

Building an "All-in-One" Monitoring System with OpenTelemetry and SigNoz

Quang Chien TRAN — Sun, 19 Apr 2026 15:13:16 GMT

After my first year as a backend engineer, the thing I’m most proud of isn’t a difficult feature, but successfully building a monitoring system from scratch.

It helped me realize a harsh truth: understanding the code is not enough. Only when I could see a request moving through each service did I truly understand how the system actually works. For the first time, I learned that debugging is not about guessing, but about observing.

Why I had to build a Monitoring platform

At that time, our system barely had a proper Monitoring platform. Whenever an incident happened, my process was roughly: open Terminal, SSH into each server, then use grep and tail -f to search through endless log files.

Debugging back then was basically like searching for a needle in a haystack, moving through a sea of logs and switching back and forth between a bunch of services.

Lost direction: I didn’t know which service the error started from in a whole forest of microservices.
The trace was broken: A request might pass through 4 or 5 services, but I had no way to connect them together. It was like trying to find a person in a crowd without a photo.
Deadline pressure: each debugging session could take hours, even days, while the whole team stayed anxious, and the bug was still there.

What made me think was: “I wonder how many companies out there are still running like this?”

At my previous company, I was lucky to have access to Datadog. I have to admit, it was amazing. Everything felt modern, the UI was intuitive, and logs, metrics, and traces were all delivered in one place.

But ironically, at that time I only used Datadog at a surface level. I knew it was convenient for checking logs faster than SSH, but I hadn’t truly understood the core value of observability. Only after I no longer had it did I realize how much I was missing.

And that made me think: “Why don’t I build a system like this myself? A solution that gives real visibility into my system?”

It wasn’t as easy as I imagined

In my head, I started looking for ways to build an Monitoring system like that. My criteria at the beginning were simple.

Simplicity: as someone who wasn’t originally deep into DevOps, I needed something that would work after installation. Making it complicated from the start was the fastest way to quit.

Minimal code changes: the system was already running stably. Touching the application logic just to add Monitoring was a huge risk. I wanted to “add on,” not “rewrite.”

Minimal tooling: the more tools you have, the more failure points you create. I didn’t want to spend my whole day maintaining Monitoring tools.

Open source: not just because I wanted something “cheaper”, but because I wanted a strong community and full control over the data without depending on a vendor’s pricing.

But in reality, no solution is perfect from the beginning.

SaaS: Datadog, New Relic

I have to admit, these are nearly perfect solutions. But the price to pay is… too expensive.

For a small startup, spending several thousand USD per month just to view logs feels like an unnecessary luxury. I wanted something that could give me the Datadog experience, but at the cost of self-hosting.

Grafana Tempo, Grafana Loki, Prometheus, InfluxDB, Kibana, Grafana

This is a truly destructive combo in the open-source world. It’s complete and powerful, but extremely fragmented.

The configuration nightmare is real: you have to learn Prometheus for metrics, Loki for logs, and Tempo for tracing.

The pieces are disconnected, and making them “understand” each other is a nightmare. I tried it and quickly realized that I wanted to be a developer building products, not a full-time Monitoring engineer just to maintain this stack.

AWS CloudWatch, X-Ray, Trace

Even though the system was running on AWS infrastructure, honestly I just couldn’t get used to CloudWatch’s interface. It felt old, fragmented, and the user experience wasn’t smooth.

Sorry AWS. I still love you.

The solution appeared unexpectedly

This time, I didn’t find it by reading blogs. I attended a tech conference called Devoxx, where I discovered a lot of interesting topics.

By chance, I joined a talk from a movie streaming company. They introduced exactly the combination I had been looking for: OpenTelemetry + SigNoz.

My reaction at the time was basically: “Wow, this is good.”

OpenTelemetry: the common language of distributed systems

If every service in your system speaks a different Monitoring language, then understanding the full picture is impossible. OpenTelemetry was created to solve that problem.

It provides a powerful abstraction: OTel is not a place where data is stored, but a standard framework for generating and collecting observability data. It acts like a translator, turning signals from logs, metrics, and traces into a common format.

It is also highly flexible. This is its core value. With OTel, you are no longer locked into Datadog or New Relic. You can switch backends simply by changing the data routing configuration, without changing a single line of business logic.

It also gives you end-to-end tracing. OTel lets you attach a Trace ID to each request from the moment it enters the system. From there, you can follow its journey across dozens of microservices and record every part of that path.

SigNoz: all in one

If OpenTelemetry is the data collector, then SigNoz is the place where that data gets turned into valuable insights. Everything you need for a Monitoring system is in SigNoz already.

It centralizes everything. Instead of jumping between Prometheus for metrics, Loki for logs, and Jaeger for traces, SigNoz brings everything into a single interface. That correlation is extremely important: when you see a metric spike, you can immediately click into it and inspect the related logs and traces at the exact same time.

It also integrates deeply with OpenTelemetry. SigNoz is built on OTel from day one. It doesn’t just display data, but also provides advanced features such as filtering traces by latency, analyzing exceptions, and setting up intelligent alerts.

In terms of performance and cost, SigNoz is written in Go and uses ClickHouse, a very fast columnar database, which allows it to process billions of records with much lower operating cost than traditional SaaS solutions. You also have full control over your data, which is a key factor for businesses that care about security.

The OpenTelemetry + SigNoz combo

Imagine your microservices system as a skyscraper. OpenTelemetry is the international-standard wiring and power sockets installed in every room. SigNoz is the giant 8K TV plugged into that system to display all the security camera feeds.

What’s great about this is that if you later want to switch to another “TV” like Datadog or Jaeger, you just unplug the old one and plug in the new one. You don’t need to drill through walls and redo all the wiring, meaning you don’t need to rewrite the code from scratch. That freedom is the biggest value OpenTelemetry brings.

Without the connection between logs and metrics, you can end up in a situation where metrics show CPU jumping to 95%, but when you open the logs you see thousands of lines flowing every second. You start panicking: “Where is the error in this mess?”

With SigNoz, logs are no longer isolated. A log line showing a 500 Internal Server Error now comes with full context:

Which Trace ID does it belong to? That is, the request journey.
What were the CPU and RAM usage of that service at the time?
What was the latency at the database call step?

When all the data can talk to each other, debugging is no longer guesswork. It becomes an investigation based on real evidence.

Getting started right away

After coming back, I jumped straight into it, just to keep the momentum going, haha. Contrary to my early concerns that the system would be too complex, the real integration process turned out to be surprisingly smooth. I split the roadmap into 3 steps.

Normalize Logs: I standardized logs by converting Spring Boot logs into JSON using Logback, which is a built-in feature in Spring Boot.
Deploy SigNoz: through Docker — fast, clean, and without needing much configuration effort.
Activate OpenTelemetry: by running the OpenTelemetry Agent together with the Java application. I added it in the Dockerfile like this below. This was the most magical step: without touching a single line of business logic, data started flowing into SigNoz immediately.

ENTRYPOINT [”java”, “-javaagent:/opentelemetry-javaagent.jar”, “-Dotel.exporter.otlp.protocol=grpc”, “-jar”, “/myapp.jar”]

Everything started working almost immediately. It only took me 2 weeks to get the first version into production.

For the first time in many years of operating the system, the whole team could:

See traces and visually understand how a request moved through services.
Identify bottlenecks and track exactly which service was slowing down the whole system, without guessing p99 from intuition anymore.
Debugging faster, cutting the time to find an issue from hours down to just a few minutes.

What I learned from Monitoring

While working with Monitoring, I discovered a lot of concepts that I hadn’t truly understood before.

Observability has three pillars.

Logs: are timestamped event records that capture individual events such as errors, warnings, and informational messages. They are used to debug specific incidents and understand what happened and why.
Metrics: are numeric time-series data such as counters, gauges, and histograms. They show trends and resource usage over time, such as CPU, memory, request rate, error rate, and latency percentiles. They are used for dashboards, alerting, and long-term trend analysis.
Traces: are an end-to-end view of a request’s journey as it passes through multiple services, including spans and a trace ID. They are extremely useful for identifying where latency or errors are introduced in microservice systems.

There are also related concepts.

SLA (Service Level Agreement): is like a contract with the customer. If it is violated, there may be penalties or compensation.
SLO (Service Level Objective): is the internal reliability target set by the team, and it is often stricter than the SLA.
SLI (Service Level Indicator): is the actual measured signal used to determine whether the SLO is being met.

Latency percentiles are also important.

p50: half of users experience this speed or faster. It is the median value.
p95: only 5% of users are slower than this. If your p95 is too high, it means 1 out of every 20 users is experiencing pain.
p99: 99% of requests are faster than this, and only 1% are slower.

Why should we pay special attention to p99? Imagine a modern website has to call 100 microservices to finish rendering the homepage. If each service has a p99 of 1 second, the probability that at least one service becomes slow, and therefore slows down the whole page, becomes extremely high. In distributed systems, p99 is not the exception — it is the future of your system if you do not control it well. I explained this in depth in a dedicated article about latency, so feel free to read more there.

Tracing: the connecting thread

If metrics tell you that the system has high latency, tracing tells you exactly where the bottleneck is.

When looking at a trace in SigNoz, the request journey is no longer a black box:

Client → API Gateway → Order Service → Payment Service → Database

Everything is shown in a clear Gantt-style visualization.

Service involvement: Does Service A calling Service B result in an error right at the gateway?
Where is the bottleneck? Does the database call take 800ms while the entire request takes only 1 second? You immediately know you need to optimize the query instead of fixing the Java code.
Asymmetry: Some requests take 10 steps but are extremely fast, while others take only 2 steps but are extremely slow. Tracing helps you identify these bottlenecks.

Even now, I still haven’t explored every corner of SigNoz, but one thing is certain: my mindset has changed.

Instead of SSH-ing into servers and digging through lines of logs, I can now look directly at charts and observe the full flow of data. This tool didn’t just save me from bugs that came from nowhere, it also made me more confident when designing larger systems, because I know I have the ability to control them.

End

Looking back on that journey, I realized that Monitoring is not just about tools, but about understanding.

To be honest, I’m not yet a true monitoring expert, but SigNoz and OpenTelemetry helped me understand the system a hundred times better than just sitting there grepping logs like I used to.

Don’t wait until your system crashes to build Monitoring. Build it while the system is still running well, so you know what “good” looks like before you learn what “bad” feels like.

(If you enjoy these kinds of engineering stories, you can subscribe to receive the next ones.)

Subscribe now

Deep Inside PostgreSQL: Processes, Forking, and the Memory Trade-off

Quang Chien TRAN — Tue, 31 Mar 2026 23:36:39 GMT

PostgreSQL is one of the most widely used relational databases today. I’ve used it myself and have grown to really like it. It is an open-source platform that is continuously updated, and the latest version, as of now, is version 18, I believe. PostgreSQL was created to handle high concurrency across many read, write, and update workloads.

In this article, I won’t go over PostgreSQL’s old or new features, since those are easy to find with a quick search. Instead, I want to go deeper into the principles and internal architecture behind PostgreSQL — the things that affect performance and the challenges PostgreSQL has to deal with and solve.

There are two basic principles in PostgreSQL: process per connection and copy-on-write (MVCC — Multi-Version Concurrency Control). We already covered MVCC in a previous article. In this one, we’ll focus on the process per connection principle, which anyone learning PostgreSQL should understand clearly. Let’s get started.

Process per connection

This is the core principle of PostgreSQL: each connection to the database becomes a new process. That is different from MySQL, is commonly described as using a thread-based model for connections.

The main components are:

Postmaster: the main server process that listens for new client connections and starts backend processes
Backend process: forks a new process for each connection.
Shared memory: stores the buffer cache and locks shared by all processes.
Background processes: run in the background for automatic tasks such as autovacuum, checkpointer, and others.

The lifecycle of a request

When we run this query:

SELECT * FROM users WHERE id = 1;

The client sends a TCP/IP request to the PostgreSQL server to execute that statement through the standard 3-way handshake:

SYN: the client sends a packet — “Hello, server.”
SYN-ACK: the server responds — “Hello, client.”
ACK: the client confirms the connection.

After receiving the connection request, the postmaster forks a separate backend process that has nothing to do with other connections.

That backend process handles the statement in three steps:

Parse: checks the query syntax and builds a parse tree.
Planner: creates the execution plan and decides things like index scan, full table scan, and join strategy.
Executor: runs the query using the execution plan and accesses data through PostgreSQL’s buffer manager. If the needed data is not in memory, it may trigger disk I/O.

Finally, the backend process sends the result back to the client through the socket.

Advantages of process per connection

Isolation

Each PostgreSQL connection is handled by a separate process, so one slow connection or one connection consuming a lot of CPU will not directly affect the others. A simple way to think about it is that each customer has their own support agent. If one customer is difficult or takes a lot of time, only that agent is affected, not the other customers.

Each backend process has its own workspace, its own memory, and its own execution flow at the operating-system level. So if one heavy query consumes 80% of the CPU, it mainly slows down that process itself instead of interfering with other connections.

In thread-based database systems such as MySQL, many connections share one larger process and the resources inside it. When one thread becomes overloaded, CPU scheduling, lock contention, or shared resource bottlenecks can make the other threads wait longer. That means average latency can rise under heavy load, especially when many connections are active at the same time.

Security

With PostgreSQL, each connection is handled separately. That means when a user logs in, the system checks authentication, access rights, and session state for each connection individually instead of putting everything into one shared place.

This improves security because if one connection has a problem, it only affects that session and does not easily spread to other connections.

A simple analogy is an apartment building where each person has their own room and their own lock. If one room has a problem, the others are fine.

In thread-based systems, many connections may share more common resources. When too many requests arrive at once, that sharing turns into resource contention. If something goes wrong in the shared part, it can affect more connections.

Scalability

PostgreSQL can scale to serve many concurrent users while still maintaining fairly stable performance if it is configured properly. Because it uses process per connection:

Each connection has its own processing space.
One heavy connection does not immediately block the whole system.
Each session can be isolated more easily.

However, each process also consumes resources such as CPU and memory. As the number of connections grows too large, resource cost grows with it. So while PostgreSQL scales well, that does not mean performance keeps improving forever just because more connections are added.

Flexibility

PostgreSQL supports several types of connections:

TCP/IP: usually used when the application and database are on different machines or networks. This is the most common connection type because it is flexible and easy to use in production, cloud, and microservices environments. The downside is network overhead, so it is usually slower than local connections.
Unix socket: usually used when the application and database are on the same server. Because it does not go through the TCP/IP stack, it has lower latency and uses fewer resources.
Shared memory: Shared Memory is an Internal IPC (Inter-Process Communication) mechanism used by PostgreSQL to share buffer cache, locks, and other internal state across backend processes. This is extremely fast because data is exchanged directly through memory without going over the network.

The trade-off

Because PostgreSQL uses process per connection, each connection creates a separate process. That means each connection is not just a session — it also requires operating-system resources to run independently.

The downside is that when the number of connections grows, resource usage grows quickly too.

What is `fork()`?

fork() is an operating-system call that creates a child process from a parent process. At first, the child process is almost identical to the parent, and then PostgreSQL separates it to serve one specific connection.

When PostgreSQL forks a process, the OS doesn't actually copy all the memory. It uses Copy-on-Write (CoW). The new process "points" to the parent's memory. Only when the new process tries to change something does the OS actually copy that specific page of memory. This is why PostgreSQL can start processes relatively quickly, though still slower than threads.

When a new process is created, the operating system must prepare:

memory for the process
file descriptors
related system state
process management information

Even though PostgreSQL uses optimizations like copy-on-write, creating a separate process still costs much more than creating a thread. That is why each PostgreSQL connection has its own overhead.

Overhead is the cost of operating a process that is not directly part of handling the client’s request.

The amount of memory each process uses depends on:

PostgreSQL version
server configuration
workload
extensions
operating system

CPU context switching

When the CPU switches from one process or thread to another, it must save the current state and load the new one. That state includes registers, the instruction pointer, and part of the current execution information.

This switching is called context switching. For example, with two processes A and B, it happens when process A becomes blocked — for example, waiting for I/O — or when it has used up its time slice. There are three common situations:

Process A is running but must wait for I/O, such as disk access, network access, or incoming data. In that case, Process A moves to the waiting/blocked state.
Process A uses up its allocated CPU time, and the operating system stops it to give CPU time to another process.
A process with higher priority appears, so the scheduler decides to switch context.

If a PostgreSQL server has only 8 CPU cores but thousands of database connections arrive at the same time, the operating system cannot run them all simultaneously. It must divide CPU time into tiny slices for each connection. So the pattern becomes:

run process A for a short time
stop
save A’s state
switch to process B
repeat for all other connections

Each switch takes time. The CPU is not only doing useful work — it is also spending time moving between tasks. When the number of connections is too large, the CPU spends too much time switching instead of processing queries, and performance drops.

A simple analogy is a chef with only one pan who has to cook for 100 customers. If the chef keeps jumping back and forth between dishes, a lot of time is wasted switching instead of finishing one dish at a time. The CPU behaves in a similar way.

Disk contention

Disk contention means many requests competing to read and write disk at the same time. Databases depend heavily on disk speed because data is not always in RAM. When cache is not enough, or the data is not already in memory, the system has to go to disk.

When too many requests arrive at once, the disk gets congested because it must continuously serve many different requests.

If the system only handles a few requests, it can read data in a mostly continuous way, called sequential read. This is fast because the disk reads nearby blocks in one pass.

But when hundreds of requests arrive at the same time, each one may need data from a different location. Then the disk has to jump around a lot, which becomes random I/O, which is much slower than sequential access.

Think of a librarian: if only a few people ask for books near each other, the librarian only needs to walk to one shelf area. But if each person asks for books in different parts of the library, the librarian has to move all over the place, which takes much more time.

Connection pooling

With PostgreSQL, connection pooling is often essential when a system has many requests or many application instances. The core reason is that PostgreSQL uses the process-per-connection model, so if your app opens and closes connections continuously, the database spends a lot of effort creating, managing, and cleaning up connections instead of doing useful work like executing queries.

The problem is that every time the app opens a new connection, PostgreSQL has to perform the handshake, create a backend process, allocate resources, and later release everything when the statement finishes. This creates connection churn — connections coming and going all the time — which is expensive in CPU, RAM, and latency.

If the number of connections gets too high, you also run into more context switching, higher memory pressure, and longer processing queues. This is especially bad when traffic spikes suddenly or when there are many short-lived requests, because the database ends up spending a lot of effort on “connections” instead of “queries.”

So PostgreSQL made a smart trade-off: “why keep creating and destroying expensive connections when we can create some in advance and reuse them”?

Connection pooling keeps a set of “warm” connections ready for reuse. When a request comes in, the app borrows a connection from the pool, runs the query, and returns it to the pool instead of creating a new one from scratch. This reduces handshake cost and the cost of creating new processes.

In other words, connection pooling changes the problem from “one request, one new connection” into “many requests sharing a smaller group of connections.” That improves throughput and reduces latency.

Because PostgreSQL does not use thread-per-connection, it gains isolation and stability, but the cost is that each connection is heavier. So when many requests arrive, resources can grow very quickly.

For PostgreSQL, connection pooling is not just an extra optimization — it is often what keeps the system healthy when there are many concurrent users or many application instances.

When pooling matters most

Pooling is especially useful when the system has these characteristics:

Many short requests, such as web APIs.
The application is scaled to many instances connecting to one shared database.
Traffic arrives in bursts.
The application is serverless or built with microservices, where connect/disconnect happens frequently.

In those cases, PostgreSQL can become overwhelmed without pooling, even if the queries themselves are not very heavy.

One important thing: pooling does not make your SQL queries faster. If a query scans an entire table, lacks an index, or is blocked by a lock, pooling only helps you avoid connection overhead — it does not replace SQL optimization.

In other words, pooling solves connection cost, while indexes, query plans, and schema design solve query cost.

How to estimate pool size

For PostgreSQL, the pool size should satisfy three conditions:

It should not exceed the number of processes the DB and CPU can handle.
It should not exhaust RAM, because each process/connection uses its own memory.
It should not make queries wait too long in line.

There is no perfect formula, but a practical starting point is:

pool_size = min(
cores * 2, 
floor(RAM_for_pool_MB / memory_per_connection_MB), 
max_db_connections - reserve
)

Where:

RAM_for_pool_MB is the memory reserved specifically for connections, not the entire server RAM.
memory_per_connection_MB is the memory used by one connection process, and it should ideally be measured in practice rather than guessed.
reserve should keep about 5 to 20 connections available for admin, monitoring, and maintenance.

For example, if your database has:

4 cores
8 GB RAM, but only 1 GB reserved for connections
each connection process uses 10 MB
10 connections reserved for admin

then:

pool_size = min(4 * 2, floor(1024 / 10), 100 - 10) = 8

This is only a starting configuration. You should still run load tests and monitor real behavior, because the best pool size depends on the workload.

Two kinds of pooling

There are two common types of pooling: application-side pooling and PgBouncer. Application-side pooling keeps connections inside each app instance, while PgBouncer sits between the app and PostgreSQL to collect many clients and limit the number of real backend connections reaching the database.

In simple terms:

Application-side pool helps the application avoid reconnecting all the time.
PgBouncer helps the whole system avoid flooding the database with too many connections.

Why application-side pooling alone is not enough

If you only have one application connected to the database, application-side pooling is often enough. But when you have many apps, multiple instances, autoscaling, or microservices, each instance creates its own pool. The total number of connections reaching PostgreSQL can add up very quickly and exceed what the database can comfortably handle.

For example:

20 instances
10 connections per instance

That already means 200 real connections to the database.

And PostgreSQL does not see 20 apps. It only sees 200 backend connections.

Why PgBouncer is useful

PgBouncer acts like a lightweight connection proxy in front of PostgreSQL. It keeps a small number of real database connections and allows many clients to share them, especially in transaction pooling mode. This reduces process creation overhead, reduces memory usage, and helps prevent connection storms when traffic spikes.

A very practical benefit is that if your app has 1,000 logical clients but only needs 20 to 25 real backend connections, PostgreSQL will stay much healthier.

PgBouncer pooling modes

PgBouncer has 3 main modes:

Session pooling: a database connection is assigned to the client for the entire lifetime of the session. This is the simplest mode, but also the least efficient.
Transaction pooling: a connection is assigned only during a transaction.
Statement pooling: each query may use a different connection.

For web workloads, transaction pooling is often the best choice because the connection is held only while the transaction is running, and then it is returned to the pool.

The key thing to remember is that statement pooling is powerful but does not work well with multi-statement transactions, while session pooling is safer but less efficient in terms of connection usage.

Conclusion

While PostgreSQL handles concurrency very well, performance can still degrade when the number of active connections becomes too high, especially because of context switching, memory pressure, and disk contention. That is why connection pooling is often essential in production systems.

Understanding PostgreSQL helps you debug better, understand the system more deeply, configure it correctly, and build applications that scale more safely and efficiently.

(If you enjoy these kinds of engineering stories, you can subscribe to receive the next ones.)

Subscribe now

The Latency Trap: Why Tips and Tricks Aren't Enough

Tue, 24 Mar 2026 14:16:36 GMT

Up until now, through all my learning, thinking, and working, the concept of latency (request delay) has always been… kinda vague to me. Whenever I heard about it, in my head it was always:

“Ah okay, latency is the time from when the client sends a request until it receives the response. Done.”

Yeah… not wrong. Completely correct. But also… completely useless 😄. It’s one of those definitions that sounds obvious, like “yeah yeah everyone knows that”.
But when you actually start working with it, suddenly it explains… nothing. For example:

When a manager asks:

“Why is this request so slow? It just fetches a list, why does it take 2 seconds?”

Or:

“Why is the same request sometimes 200ms, sometimes 1s, sometimes 2s?”

And now you’re stuck. Because that “definition” doesn’t help you answer anything.

So… what actually causes latency?
Why is it fast sometimes and slow other times?
What exactly is happening inside a request?

Tips, Tricks, and Fancy Diagrams

If you search online for advice about design systems or how to optimize a request’s latency, you’ll find tons of methods — hundreds of architecture diagrams, and all sorts of fancy technical explanations and buzzwords. I used to do that too and thought:

“Wow, this makes perfect sense, let’s apply it!”

I also picked up quite a few latency-optimization skills, haha — if someone asked me about them in an interview, I’m ready 😄. For example:

If too many messages hit the system at once and it gets overloaded, just push them all into a queue (like Kafka or SQS), then process them slowly, one by one — that way, no message gets lost.
When the system’s overloaded, scale it horizontally — spin up more instances, pods, or nodes to handle millions of requests. That’s what autoscaling with load balancers is for, right?
Bring the server closer to users with a CDN. With servers all around the world, European users reach the European server, Americans go to the US one — plus, that also helps reduce load on the main server.
Cache data with Redis — reading from RAM is so much faster than pulling from disk I/O, and it also takes some pressure off the database.
For databases, you can look into techniques like replication (to improve read performance), adding indexes, or using partitioning and sharding to make queries faster.

And many more… And honestly? I agree with all of them.

But…

All those pieces of advice — to me, they’re just little tips and tricks. Maybe I remember them today and forget them tomorrow. There’s just too much going on in life to keep everything in my head.

I didn’t really understand the essence or the actual components of latency — what a request has to go through, what it faces, why it’s sometimes fast and sometimes slow. When it’s fast, why is it fast? When it’s slow, why is it slow?

I realized I understood nothing if I only relied on random tips and guesses.

The Formula That Changed Everything

Thanks

I want to thank a System Design Handbook shared recently by Quang Hoang. After reading it, I learned a lot of new things about latency. Not just tips I’ll forget later. The most valuable thing for me was this formula:

Latency = Propagation + Queueing + Service

Simple. Clean. Almost too simple. But this thing changed everything for me.

Latency has 3 components:

Propagation → time for the request to travel
Queueing → time waiting (thread pool, DB connection pool, etc.)
Service → time the server actually processes

When I look at this formula… All the “tips & tricks” suddenly become easy to understand. Because now:

Optimizing latency = optimizing these 3 things.

Examples:

If the connection between the two sides (TCP handshakes, TLS handshakes, network, etc.) takes too long, then you should bring them closer together; if you can merge them or put them next to each other, even better.
If the queue is too long, you need to find a way to shorten it by reducing incoming load or increasing processing throughput (scaling servers) so the queue gets smaller.
If the request handler itself is too slow, then you’d better optimize the algorithm, tune the database, or use a programming language that’s more suitable for the business problem, or find ways to process things in parallel and asynchronously.

The more of these parts you can optimize, the better.

And if it’s still too complicated, then just hide that latency away and immediately return something like “processing completed” (succeeded) so the client feels at ease, even though behind the scenes the system is still working its butt off.

For example, when you create an AMI from an EC2 instance, AWS responds right away that the image creation is in progress, so the user can move on and do other things instead of staring at a loading spinner and waiting for the page to unblock.

Now I’ve developed a new habit: whenever I come across some tip or trick to optimize latency, I ask myself which of those three components it actually improves, what the trade-offs are, or whether it helps optimize all of them — instead of doing what I used to do: reading through long-winded explanations that I probably wouldn’t remember even for a day.

Maybe optimizing latency is still a huge topic, with plenty more going on behind the scenes, but for me, this formula already captures a big part of:

The definition of latency
The components that make it up
The strategies to optimize it

There are already tons of articles online about design systems and techniques for optimization (caching, load balancers, scaling, CDNs, etc.), so I probably won’t add to that pile here — everyone can dig into those on their own and compare them against this formula.

Real-world application

Staying on the topic of latency, I actually shared a post before about building a monitoring system, and there were 2 things I mentioned.

1. Why p99 Matters More Than Average

First, I emphasized the importance of percentile metrics (p50, p95, p99), especially this guy p99. And why is average latency useless here? Because in reality, it literally does nothing. It does NOT represent the actual latency that users experience.

If you tell your boss, “Our average latency is 200ms,” you aren’t telling the truth. You’re telling a statistic.

The “Average” is a mathematical trap. It assumes every user has a similar experience. But in a distributed system, a single request doesn’t just “happen”—it hits a load balancer, a thread pool, a database, and maybe three external APIs.

Very roughly:

p50: This is the experience of your "typical" user. Half are faster, half are slower.
p95: 1 out of every 20 requests is slower than this.
p99: The "1% experience." 1 out of every 100 requests hits this delay.

The Danger of Scale: If your landing page makes 50 different network calls to load (images, CSS, tracking, API), the chance that a user hits a p99 delay at least once is nearly 40%.

Suddenly, the “1% problem” becomes a “40% of my users are annoyed” problem. This is why we optimize for the outliers, not the average.

2. A Debug Story From Production

Second is a story where I debugged an issue:

Hundreds of third-party payment webhook requests all hit the system at exactly 12 PM, causing database congestion.
At peak, the database wanted 20 vCPUs… while the system only had 2 😂
Meanwhile, the server looked perfectly fine — RAM good, CPU good.

But every day at that time, the whole system became slow like a turtle for ~30 minutes. Everything slowed down. Boss complained. Teammates complained. Customers complained. Reputation and service quality took a hit.

Now I’ll analyze it again, but this time using the latency formula above.

At that time, I don’t know if I was just too inexperienced, or I read too many tips & tricks and started overthinking. In my head, I thought:

“If too many requests come in and we can’t handle them, just throw them into a queue and process gradually. Easy. (damn I’m a genius 😂)”

So I jumped in and designed a beautiful architecture with SQS + Lambda + reserved concurrency (this thing ensures a certain number of Lambdas are always available, and also limits how many run in parallel).

Now all webhook payment requests would be processed gradually. Let’s see how the database dares to max out CPU again 😏

Well… life is not a dream. Me and my teammate spent 2 weeks implementing this solution. Result?

Nothing improved.
System still slow.
People still unhappy.
And we wasted time.

If I had known the formula earlier, things would’ve been much simpler instead of chasing fancy stuff.

Applying the Formula: Propagation, Queueing, Service

Propagation

This one is hard to optimize. Third-party systems (like payment providers) connect to us — hard to control. In my case, maybe just vertically scale the database to 20 vCPUs and call it a day =]]]

Queueing

This is where requests wait before being processed.

Network/router queues, CPU queues → too advanced for me =]]].
But thread pool queue & connection pool → these I can control.

So I tuned the default configs in Spring Boot to better fit my system. From now on:

Requests are processed more sequentially.
Less fighting, less contention.
No more trying to do too many things in parallel while resources are limited.

Service

Honestly, I don’t know why I didn’t think about this earlier, and kept chasing fancy architectures. The webhook processing method had MANY issues:

Bad async chain design (if it’s already a chain, why make it async??)
Same request fetching transaction, invoice, payment again and again
- → direct pressure on database
- → why not cache it?
- → in-memory cache worked perfectly here
Non-critical tasks (audit, tracking) executed directly
- → more DB overload
- → I moved them to the end of the webhook
- → maybe later I’ll push them into a queue, we’ll see 😄
Database had too many missing important indexes AND too many useless ones
- I just tracked trace_id in monitoring → immediately saw which requests were slow
- → reran SQL → found full table scans
- As for unused indexes, every database has tools for that — just Google it (right now I forgot already… classic “learned via tips & tricks” 😂)

And I didn’t even touch any “bit-level optimization” yet.

→ At this point, the system was already good enough. (Know your limit, be happy with what you have — going deeper is just complex and time-consuming.)

After applying all these:

No surprise — the things that should’ve been done from the beginning worked best.
Now the system only needs ~0.5 vCPU (max was 2 before).
Maybe now I should increase concurrency for thread pool and connection pool 😂

Conclusion

Understanding latency instead of memorizing tricks helped me:

Think more clearly
Debug more effectively
Avoid unnecessary complexity

When things are clear (like a formula), remembering tips is no longer the problem.

The real question becomes:

Does this tip actually solve my problem… or just make it more complex?

(If you enjoy these kinds of engineering stories, you can subscribe to receive the next ones.)

Subscribe now

I Spent 6 Months Overthinking Environment Variables (The Solution Was Simple)

Quang Chien TRAN — Mon, 09 Mar 2026 22:12:55 GMT

Environment variables are something every application needs today. Big or small, sooner or later everyone has to deal with them.

People often say: “This is easy. Just store them in a secure place.”
And yes, everyone knows the best practices by heart.

But for almost six months, I was completely lost trying to find the right solution.

My Problem

From my previous article, you may know that I deploy microservices using AWS ECS.

Each service has a task definition, where environment variables are mapped to Parameter Store and Secrets Manager, two AWS services designed to store configuration and secrets securely.

For storing environment variables, everything works quite well.

But for managing them in a clear and simple way, so that team members can easily add, remove, or modify variables, things become much harder.

To do that properly, you need some knowledge about:

cloud infrastructure
IAM permissions
rollback responsibilities when something goes wrong

So I had an idea.

I created a separate project dedicated to environment variables for all services, including both normal variables and secrets. The idea was simple: if someone wants to change something, they only need to update it there.

I also set up Infrastructure as Code with Terraform for the project. Deploying variables to Parameter Store and Secrets Manager became very easy.

But then another problem appeared.

The Headache

My team uses GitHub (not self-hosted). All repositories are private.

But as everyone knows:

Pushing environment variables directly to GitHub is almost like committing suicide.

Every second, there are automated tools scanning GitHub repositories for exposed secrets—even private ones.

So the project stayed on my local machine.

But that obviously wasn’t a real solution.
What happens if someone else wants to add or modify variables? Should they just copy the project from my computer?

At a company where I previously worked, things were much simpler.

They hosted their own GitLab registry, everything was private, and they simply pushed the environment variable project there and connected it to a pipeline.

Very straightforward.

The Solution (Not Very Surprising)

You might not believe it, but I spent almost half a year thinking about this problem.

I considered many ideas.

For example:

encrypting environment variables before pushing them to GitHub
decrypting them during deployment

But honestly, those solutions were too complex and inefficient.

My goal is always the same:
solutions should be as simple as possible.

Developers already suffer enough. There’s no need to make a simple problem complicated—because once the system becomes complex, maintenance becomes harder and mistakes become more likely.

And usually, no one understands that logic except the person who designed it.

The solution I finally used was actually something I had thought about before, but the conditions at the time didn’t allow it.

AWS CodeCommit.

In many ways, CodeCommit works just like GitHub. All standard Git operations work the same way.

However, when I first considered using it, AWS had temporarily disabled the creation of new repositories for CodeCommit (around 2024, if I remember correctly). So that idea had to be abandoned.

Time passed, and I still hadn’t found a good solution.

Then one day, while scrolling Facebook during AWS re:Invent 2025, I saw the news that AWS had reopened the ability to create new repositories in CodeCommit.

So I started implementing the solution immediately.
(What if they disabled it again? Haha.)

The Implementation

CodeCommit repositories are protected inside the AWS account, with multiple layers of security.

I also set up a CI/CD pipeline for this project using:

AWS CodeBuild
AWS CodePipeline

And sorry AWS… but honestly, I’m not a big fan of these CI/CD services.

They feel too complicated for solving problems that other tools handle much more simply.

But that’s okay.

If it works, it works. 😄

Conclusion

Everything now runs the way I wanted.

But finding this solution definitely took too much time and sweat.

And honestly, the architecture still isn’t perfect because the source code is now split across different places (GitHub and CodeCommit).

Hopefully there will be better solutions for managing environment variables in the future.

If you have a better approach, please share it with me. I’d love to learn. 🙂

(If you enjoy these kinds of engineering stories, you can subscribe to receive the next ones.)

Subscribe now

10 Years Using SQL… and I Finally Learned How Databases Actually Work

Quang Chien TRAN — Sun, 08 Mar 2026 23:52:40 GMT

I’m a developer with almost 10 years of experience, and I’ve spent those same years working with databases. Most of my work has been with relational databases (RDBMS) — from PostgreSQL, Oracle, and SQL Server to MySQL. That said, for most of that time, my relationship with databases was still just a “casual acquaintance”: a few basic SELECTs, then INSERT, UPDATE, DELETE, without really understanding the deeper nature of the system or the components underneath it.

Familiar advice

If you search online for ways to optimize an SQL query, it’s not hard to find the classic “golden rules” like these:

Don’t SELECT *; select fewer columns and the query will be much faster.

If a table has only a few rows, the query will be fast.

If a query has ORDER BY on a column, add an index on that column.

If a query is slow, think immediately about partitioning or sharding the database.

Changing the order of clauses can improve SQL performance.

A fragmented table is always bad for performance, so you need to defragment it.

To be honest, I’m not here to judge whether those tips are right or wrong. But they’re only the “surface.” I usually applied them like a machine: do it, forget it, and a few months later I’d have to look them up again.

You’re not wrong — that was how I worked for many years, and I’m sure a lot of people can relate. It was basically a way of working based on “tips & tricks” rather than systems thinking.

Then one day

Then one day, I watched some YouTube videos about database optimization by Trần Quốc Huy, and the sentence that impressed me the most was this:

In databases, once you understand the execution plan, most of those tips and tricks on the internet become nonsense.

That was when I finally understood why, before that, I always felt a certain fear when working with databases, during interviews, or while handling database-related incidents. It came from the fact that I knew the syntax, but not the mechanism. I always felt like I had a blind spot — like, please don’t bring it up, I only know a few SELECTs.

When I run a query, what happens underneath? What takes time? Why does it take time? Why does Oracle cost so much, while open-source databases like PostgreSQL and MySQL are already so good — why bother paying for proprietary software? Why does PostgreSQL get fragmented, and should we use VACUUM? Why, when I DELETE a record, doesn’t the table size shrink? Once you understand what is really happening underneath, all of those questions become much clearer.

A little fundamentals

This is something you can easily search online, but I’ll repeat it here because it’s very important. In many database systems, the smallest physical unit of storage and processing is not an individual row, but a page or block. Each page/block is a fixed-size or near-fixed-size chunk of data, usually a few KB to a few tens of KB depending on the system. You can think of it like a sheet of paper, and each row is written on it.

When the database needs to read a row, it usually works with the entire page/block that contains that row, rather than reading just one line directly from disk. So query performance depends not only on the number of rows, but also on the number of pages/blocks that must be scanned, how the data is organized, and whether the data can be served from cache or must be loaded from disk into RAM first.

A query can be fast or slow largely because of its execution plan — that is, the way the database chooses to execute the statement. Different database systems use different algorithms to generate execution plans. The same SQL can be slow if the system chooses a full table scan, but much faster if it uses an index properly.

An index is usually a separate data structure, often a B-tree, that helps speed up access by storing sorted lookup values together with a pointer to the location of the page/block containing the data. In simple terms, an index is like the table of contents in a book: it doesn’t contain the full content, but it helps you find the right page/block much faster.

An index is not perfect, and it has a downside: the more indexes you add, the slower write operations (INSERT, DELETE, UPDATE) become, because the database has to update the corresponding index trees as well.

In PostgreSQL, if you run the following command, it will show the physical location of the row version within its table:

SELECT ctid, * FROM my_table;

In practice, databases often have caching mechanisms to avoid re-analyzing everything from scratch if a query or execution plan has already been used before. However, an old plan is not always the best one for every run, because the data and query conditions can change.

A query execution flow can be summarized like this:

Check the SQL syntax.
Check whether the table names, column names, and constraints are valid.
Check whether the execution plan already exists in cache.
- If yes, reuse it and move to step 4.
- If not, analyze all possible execution plans.
- Build the detailed execution steps.
Execute the SQL statement.
Return the result.

Among these steps, steps 3 and 4 are often the most time-consuming. In PostgreSQL, you can use the following command to inspect the execution plan, including the detailed steps and cost:

EXPLAIN (ANALYZE, BUFFERS) + Query

This command will show you:

Cost: the planner’s estimated cost.
Actual time: the real execution time.
Buffers / Shared hit / Shared read: how many pages were served from cache and how many had to be read from disk.

If you want to optimize a query, you should:

look at the execution plan,
check how many pages/blocks are being read,
see whether the indexes are reasonable,
and run the query for real instead of guessing whether it is slow or fast.

A bit deeper into bloat in PostgreSQL

MVCC mechanism

PostgreSQL uses MVCC (Multi-Version Concurrency Control) when handling UPDATE or DELETE, allowing multiple transactions to proceed concurrently without forcing readers and writers to wait for each other in most cases. Instead of modifying a row in place, when an UPDATE happens, PostgreSQL usually creates a new version of the row and keeps the old version for transactions that can still see it.

Thanks to that, each transaction sees a consistent snapshot of the data at the right point in time, so “reads are usually not blocked by writes”, and “writes are usually not blocked by reads”. This is one of the reasons PostgreSQL handles concurrent workloads quite well.

That said, saying “the reader never blocks the writer” and “the writer never blocks the reader” is too absolute. In reality, PostgreSQL still uses locking in some cases, such as conflicting updates on the same row, schema locks during operations like VACUUM FULL, or other special operations.

When you update a row, for example /users/123:

the old row 123 is copied and a new row version is created in another page at another location with the updated data,
the old row is marked as “expired”, and new transactions will no longer see it after the update is committed, but the database does not delete it immediately,
as a result, on disk there can still be two row versions for the same ID 123; one of them is simply marked as expired.

When you run SELECT, the execution plan still has to deal with those pages containing dead rows, which can make things slower than necessary.

Why deleting doesn’t free disk space immediately

When DELETE happens, it works in a similar way: the row is not physically removed right away, but marked as dead. It’s like a house being labeled with a warning sign saying “do not live here, risk of collapse” while the house still remains in place and the land is not cleared for a new house.

This makes the table grow larger over time. The table may be 10 GB in size, while the live data you actually work with is only 2 GB, and the remaining 8 GB are rows marked as expired or dead, not doing anything useful.

What VACUUM is in PostgreSQL

VACUUM is a command used to deal with table bloat in PostgreSQL. There are two approaches to it:

Standard VACUUM: PostgreSQL usually handles this automatically through autovacuum, but you can also adjust settings or run it manually if needed. Its job is to scan dead tuples and mark those positions as available for future writes. It also cleans dead tuple references in the index and updates related maps. The next time you INSERT, PostgreSQL can place data there instead of using new locations or adding new pages that would make the table bloat further. Note that standard VACUUM does not return disk space to the operating system; it only marks space for future PostgreSQL use.
VACUUM FULL: With this command, PostgreSQL creates a new, smaller, more compact version of the table. It can return disk space to the operating system, but it also takes an exclusive lock on the table, so the table cannot be read or written while VACUUM FULL is running.

When should you use VACUUM FULL?

Because it locks the table, VACUUM FULL should only be used in a few critical situations, such as:

when you have deleted around 60–90% of a table and want to reclaim disk space and shrink the table size,
when a table has become excessively bloated, for example 100 GB in size while the actual data is only 10 GB,
when the table is mostly read-only or used for reporting and no longer changes much, so removing bloat can improve query speed.

You should not use it on tables that are heavily read, written, or updated, because it will block the table for some amount of time, which may freeze the application and hurt performance.

What happens if 100 transactions try to update the same row at the same time?

Of course, there is no way for 100 transactions to all “successfully” update the same row by freely stacking on top of each other. What usually happens is a write-write conflict: the transaction that updates first creates a new version of the row(tuple), and the other transactions have to wait. When it’s their turn, they either read the new row again to re-check the condition, or they get aborted/fail depending on the isolation level and how the database is implemented.

Let’s say 100 transactions all want to “update” one row:

The first transaction that gets the update right will create a new version of the row (tuple).
The other transactions do not “overwrite directly” on the same version. They will be serialized in execution order, or they have to wait until the current version is done before they can continue.
If two transactions update the exact same row at the same time, the later transaction often has to re-read the new row and run the update condition again. If it no longer matches, it may no longer update anything, or it may fail with a serialization error depending on the isolation level.

Which version is kept?

The newest committed version is the version that is valid for new transactions.
Older versions only remain to serve running transactions or old snapshots.
After that, they will be cleaned up by garbage collection / vacuum / purge depending on the system.

What about the mechanism in other types of databases?

In general, many databases that do not use MVCC like PostgreSQL will usually handle UPDATE and DELETE by modifying data in place or by locking rows/pages/tables while the change is happening.

Common mechanism

In-place update

Simpler systems will let UPDATE overwrite the existing row directly, instead of creating a new version like MVCC. This saves space, but it usually needs locks to prevent multiple transactions from modifying the same data at the same time and corrupting it.

When a transaction wants to UPDATE a row, the database will:

take a lock on the row, page, or sometimes the whole table depending on the engine and isolation level,
make sure other transactions do not read/write that part of the data in a way that breaks consistency,
then write the new value into the existing data location, or update the corresponding internal storage structure.

Real delete

With DELETE, the system may remove the row from the storage structure immediately, or mark it first and clean it up later depending on the engine. The common point is that this mechanism usually depends much more on locking and physical cleanup inside the storage engine.

Locking-centric concurrency

In locking-oriented systems, readers or writers may have to wait for each other more often, depending on the isolation level and the lock type. This makes the model easier to understand, but at the same time it makes concurrency less “smooth” when the workload has many mixed reads and writes.

Advantages

Easier to understand and implement than MVCC in many cases.
Less need to keep multiple versions of the same row, so storage overhead is lower.
UPDATE and DELETE can be physically simpler if the system supports direct overwrite well.

Disadvantages

Concurrency is usually worse than MVCC, because readers and writers block each other more easily.
Bottlenecks appear more easily when many transactions touch the same area of data.
If the system has to lock too much, latency can increase under high load.

What happens if a transaction locks a row/page and forgets to commit or rollback?

Then that lock is usually held until the session ends or the connection is closed/killed. During that time, other transactions that want to touch the same resource may have to wait, and if they wait too long, the application looks like it is “frozen.”

The consequences are:

The transaction that has not ended will keep control over the resource it is modifying, so other transactions may get blocked.
It can cause deadlock or timeout: if many transactions are waiting on each other, the database system may have to choose one transaction to kill.
It can pollute the application state: the app thinks it is a “data error” but in reality the lock has not been released yet.

If a transaction is left hanging for too long, then when the connection is closed, the process is killed, or the DBMS detects that the session is dead, the transaction will be rolled back and the lock will be released.

I’m doing better now

Anything that starts from the fundamentals is always wonderful. It helps me understand what I really want, what I’m doing, and why things work the way they do. Ten years of experience does not mean you know everything. Sometimes, taking a step back to learn the most basic things again is the fastest way to move forward. Databases are fascinating — don’t let them become a blind spot in your career :D

(If you enjoy these kinds of engineering stories, you can subscribe to receive the next ones.)

Subscribe now

Beyond the Bill: How I Matured the Cloud Infrastructure I Manage

Quang Chien TRAN — Fri, 06 Mar 2026 19:02:33 GMT

In the previous article, I shared how I optimized the cloud bill. But in reality, there were still many things that needed improvement. Engineering is a journey, right?

Pipeline to Deploy Code

The setup I work with uses Jenkins to run pipelines. Of course, I have worked with GitLab CI/CD before, but for me, they are all just tools. Argo CD, GitHub Actions, or AWS CodePipeline — in the end, they are simply tools to build and deploy code.

At the beginning, the pipeline I managed only had a build stage for backend projects. Deployments were still done mostly by hand. So I added new steps to automatically deploy to staging and production. Now everything is fully automated after a simple git push. Pretty nice 😄. My job now is simply to change the code and push it.

Working with Git

I have to admit something: Even though I had used Git for a long time, I never really had an effective workflow. In a team with many developers, things could easily become messy. Every time we wanted to test a new feature, we had to ask:

Can we deploy now?
Will it overwrite someone else’s code?
Is another feature still being tested?
Is it my turn to test?

The main problem was simple: we only had one dev environment.

At that time, my understanding of an effective git workflow was still very vague.

Then one day, while walking home from work along the river—without looking at my phone, just enjoying the fresh air and thinking randomly about git workflows… suddenly an idea came to me. Haha.

Later I found a blog post online that described almost exactly the same workflow. It wasn’t a new idea — I just hadn’t been exposed to it before! In the end, everything is about what works best for your team.

My principles are very simple:

The master branch is always the most stable and correct version of the system.
All new features and bug fixes must start from this branch.
The develop branch is used for testing.
Feature branches and bug-fix branches must merge into develop to be deployed to the testing environment.
After testing is successful on develop, the feature branch can then be merged into master and deployed to production.

With this workflow, the team has been working very smoothly so far. If problems appear later… maybe I will just go walk along the river again to think about it. 😄

Infrastructure (Infrastructure as Code)

The benefits of Infrastructure as Code are well known, so I probably don’t need to explain them much. You can easily find plenty of information online or just ask AI.

After finishing the deployment of the backend infrastructure on Amazon Web Services, the next thing I did was write IaC for all the projects I worked on.

I use OpenTofu (a fork of Terraform). The idea was simple: One day, if I am no longer managing this system, at least the engineers joining will have something to help them understand what was built. And if something goes wrong, they can quickly rebuild it.

Side note: Sorry AWS, but I’m not a big fan of AWS CloudFormation. Terraform code just looks much nicer to me 😄.

Monitoring

After one year of managing this infra, the thing I’m most proud of is building a monitoring platform. At my previous job, I worked with Datadog, but my understanding was basic. I mostly just used it to read logs.

When I started my current role, I was surprised to see no monitoring platform at all. If developers wanted to read logs, they had to SSH into the server. At that moment, I felt monitoring was absolutely necessary. Without it, debugging production systems feels like fighting enemies with bare hands.

Backup

Do you know what the most valuable asset of a company is? For me, it’s data. If the data disappears, the company may disappear too.

I heard a story about a company that lost all its data after hackers gained root access and deleted everything. They asked AWS for help, but it wasn’t possible. The company shut down. That made me think.

I use AWS Backup now. With this service, backups are protected and cannot easily be deleted — even with root access. At least, within my current understanding, this feels safer.

Unless the entire AWS infrastructure collapses… which hopefully is very unlikely.

Rethinking Cron Jobs

Almost every company needs to process large datasets periodically. Originally, we used the Spring Boot @Scheduled annotation. It worked, but had two problems:

Debugging was difficult.
Horizontal scaling: If multiple servers run at once, the same job might execute twice!

My solution was simple. I moved the scheduling to AWS Lambda + Amazon EventBridge. EventBridge triggers the Lambda, which then calls an HTTP API endpoint in our service.

Everything became much easier to manage.

Conclusion

I’m always thinking about ways to improve the systems I work on. Maybe some of these solutions look simple to others. But for me, every time I find a solution, it brings a small sense of joy. And honestly, I’m always a little proud of that. 😉

(And yes, I still take walks by the river to brainstorm!) 😊

(If you enjoy these kinds of engineering stories, you can subscribe to receive the next ones.)

Subscribe now

How I Reduced the Cloud Bill by 40%: 5 Essential Mindsets

Quang Chien TRAN — Fri, 06 Mar 2026 16:14:37 GMT

Actually, there have been tons of posts about how to optimize AWS costs. I’ve read them, analyzed them, and applied what makes sense for the cloud infrastructure I manage.

At some point, your cloud infra is stable, services run smoothly, no errors, the team is happy… but then at the end of the month, you look at the bill and—well… why are you paying a few thousand dollars for just a handful of services?

Whether it’s a big company or a small one, optimizing costs for any expense always needs careful consideration. Cloud cost is no exception. If you control it well, cloud is an amazing tool. If not… your wallet slowly bleeds every month and you don’t even know why.

How I optimize my infrastructure

Clean up the garbage

Yep, you read that right. In almost every infrastructure, if you don’t clean regularly, there will be a bunch of unused stuff still sitting around—and you’re paying for things that bring zero value.

My approach:

S3 buckets: Delete buckets or objects from non-production environments that are no longer in use.
ECR (Docker images): Implement lifecycle policies to prune old versions. By keeping only the latest 10–12 images, I stopped storage costs from ballooning and cut “zombie” storage waste by over 90%.
Networking & Storage: Periodically audit for Elastic IPs and EBS volumes that exist but are not attached to any instance.

If you don’t use it, turn it off

Simple logic: you work 8 hours a day. After that, you don’t use the system—but if services are still running, you’re still paying :D

For example, non-production environments can be stopped from 8 PM to 7 AM.

ECS → set desired count = 0
RDS → stop at night, start in the morning
EC2 → stop/start or scale auto scaling down to 0

All of this is super easy. Just combine Lambda + EventBridge to schedule it. Fully automated, no need to click manually.

Use less, pay less

This one is obvious for S3. Setting lifecycle policies can save you quite a bit.

Hot data → keep in standard storage
Cold data / logs backup → move to Glacier

You can also set lifecycle rules for ECR to auto-delete old images instead of doing it manually.

Ask: "Can this be optimized further?"

This mindset applies to almost everything I do, not just AWS.

I’m a backend dev. Sometimes I finish a task but still feel the code isn’t clean enough, naming isn’t right, or it doesn’t follow SOLID / reusable principles → I refactor.

Same with AWS cost optimization, but even more frequently.

When deploying systems (EC2, ECS, EKS, RDS), we often over-provision resources “just to be safe.” But you still pay for all of it.

Example:

With ECS Fargate, I've seen payment services set to 4 vCPU, 8GB RAM. After running, CPU and memory usage were only ~10–20%. So I cut the config in half, then monitored again. Around 50–70% utilization is a good balance.

So the question I always keep in mind is:

“Can this system (or task) be optimized further?”

And optimization isn’t just about cost—it’s also performance, scalability, and clean code.

If AWS recommends it, just follow

Honestly, AWS engineers know their stuff. They’ve laid out best practices in the Well-Architected Framework, and tools like Amazon Q can guide you. Here’s what I implemented:

Use S3 Gateway Endpoints: This is a total “cheat code.” By adding a Gateway VPC Endpoint to your routing table, traffic to S3 stays within the AWS internal network. It doesn’t go over the public internet, it’s more secure, and most importantly—S3 Gateway Endpoints are free. No data transfer markers, no hourly fees. (Just be careful not to choose the “Interface” type unless you specifically need it, as those do have a cost!)
Switch ECS Fargate to ARM: I converted our Fargate tasks from x86_64 to ARM (Graviton). It’s the “good-cheap-better” standard: it usually performs better for backend workloads and is roughly 20% cheaper right out of the box.
Reserved Instances & Savings Plans: For stable workloads like RDS or baseline EC2, this is a no-brainer. If you know you’ll be running it for a year, commit to it and take the 30–60% discount.
Ditch Public IPs: Since 2024, AWS charges for every public IPv4 address (~$3.60/month per IP). By keeping resources in private subnets and using internal communication, you save money and harden your security at the same time.
Free tier is great. A lot of my Lambda, SNS, SQS, and CloudWatch usage stays within the Free Tier, so I barely pay anything.
Keep a close eye on the AWS bill and set up an AWS Budget to avoid catching issues too late. Once you notice a service suddenly getting expensive, you need to understand exactly why. In general, checking Cost Explorer every 2 or 3 days is a safe habit.
If your application uses an RDS Cluster with heavy read/write activity, leading to high I/O costs (more than 25% of the total RDS bill), it may be worth considering AWS I/O Optimized storage. In that case, storage is about 30% more expensive, but I/O cost becomes 0.

In short, there are still many ways to save AWS costs. The most important thing is to understand the service and its pricing. Before using a service, you should first understand how it’s priced and analyze it carefully, because sometimes you jump in first, and only when the bill arrives do you ask why it’s so expensive.

Conclusion

With what I’ve done above, the AWS bill has dropped by 40% compared to before. I’m happy that everything is still running well, especially for small and medium businesses, where cost should be a top priority. Running well and cheap is still better than running well and expensive, right? 😅

“The views and optimizations shared here are my own personal engineering perspectives and do not represent the specific data or policies of any employer.”

(If you enjoy these kinds of engineering stories, you can subscribe to receive the next ones.)

Subscribe now