Under the Hood: A Deep Dive into Processes, Threads, and CPU Architecture
A Comprehensive Guide to Processes, Threads, Multitasking, and CPU Cache Architecture.
Foundational knowledge has always been an important part of the programming world, and even now, AI has crept into almost every job of a programmer. By understanding the fundamentals, you will find it easier to develop, debug, and also use AI more effectively.
Today, I will have a deep dive into Process and Thread, things that lie under our operating system layer, one of the foundational pieces of knowledge you must master when working with computers, helping you understand what really happens underneath when an application runs. Let’s get started.
What is a Process ?
Simply put, a process is a software program that is being executed on a computer. One software can create many different processes. On a computer, there can be many processes from many different software programs coexisting at the same time.
To make it easier to understand, with Windows, when you run a web browser like Chrome, it will create many different processes for each tab, extension, and system component, or when you launch a game, it will create a separate process.
Each process has its own process ID, its own data and state. Each process works in its own memory space, and cannot directly access the data of another process unless there is a sharing mechanism allowed by the operating system (IPC - Inter-Process Communication).
To store the data of a process, the OS will use a data structure called Process Control Block (PCB). Each PCB is associated with a separate PID. The PCB includes the following information:
Process ID (PID): An integer that identifies the process.
State: The current state of the process. A process isn't always "Running." It might be Ready (waiting for its turn), Waiting (waiting for you to click something or a file to load), or Terminated.
Pointer: Information linking to related processes.
Priority: The priority of the process, helping the processor determine the execution order.
Program Counter: A pointer storing the address of the next instruction to be executed by the process. This is vital for Context Switching. Since the CPU jumps between processes thousands of times per second, the PC acts like a "bookmark" so the CPU knows exactly where it left off when it returns to that process.
CPU Registers: The registers the process needs to use for execution. These store the temporary data (the "math" being done at that exact microsecond).
I/O Information: Information about the read/write devices the process needs to use.
Accounting Information: Contains information about CPU usage such as time used, identification.
What is a Thread ?
A thread is a lightweight unit of execution within a process. If a process is a house, threads are the people living inside—they share common spaces like the heap and the process address space, but each thread has its own private stack and register state.
Smallest unit: It is the smallest sequence of programmed instructions that a scheduler can schedule independently.
Shared resources: Threads share code, data, and OS resources such as open files. This makes communication fast but also creates race conditions when two threads access and modify the same data without proper synchronization.
Efficiency: Context switching between threads is usually cheaper than switching between processes because the address space often stays the same.
Hardware vs. Software Threads
Hardware threads: These are execution contexts exposed by the CPU. In your example, an Apple M4 with 10 cores can provide 10 hardware threads. This affects how many threads can run in true parallelism. A “Hardware Thread” is essentially a set of registers on the CPU core that allows it to hold the state of a software thread.
OS/software threads: These are managed by the kernel. You can create many software threads, and the OS time-slices them across the available hardware threads.
The Java Example
Historically, one Java thread usually mapped to one OS thread (platform threads). Modern virtual threads (Project Loom) let many Java threads run on a smaller number of OS threads, which helps high-scale applications become more efficient.
How a Thread is Created ?
When each thread is created, it has its own execution state, including an Instruction Pointer (IP), which determines the location of the next instruction the thread will execute.
Thanks to this separate execution state, when the CPU performs a Context Switch between threads, each thread can continue right from where it left off instead of starting over from the beginning.
The Thread Control Block (TCB)
Just as a process has a PCB, a thread typically has a TCB or an equivalent thread-specific structure. It is usually smaller and lighter than a PCB.
A TCB may contain:
Thread ID.
Stack Pointer.
Instruction Pointer.
State.
Register values.
Why it is “Cheaper”
Switching between threads in the same process is usually faster than switching between processes because the OS does not need to switch to a different address space. Threads in the same process share the same memory map, so the overhead is lower.
Multi-thread
Multi-threading is the ability of a CPU or a single process to provide multiple threads of execution concurrently. Instead of following just one line of instructions, the process can split into multiple execution paths.
Advantages
Non-blocking UI: Essential for modern applications. The main thread handles user input such as clicks and scrolling, while worker threads handle heavy tasks such as API calls or database queries.
Better resource utilization: On a multi-core processor, multi-threading allows the OS to run multiple tasks in parallel when possible.
Economy: Threads are cheaper to create than processes because they do not require a completely new memory space. They share the existing heap within the same process.
Disadvantages
Race conditions: These happen when two threads read or write the same shared variable at the same time.
Example: both threads see a balance of $100, both add $10, and instead of getting $120, the final result might be $110 because one update overwrites the other.
Deadlock: Thread A holds Resource 1 and waits for Resource 2, while Thread B holds Resource 2 and waits for Resource 1. Both wait forever.
Starvation: Low-priority threads may never get CPU time if higher-priority threads keep taking the processor.
Testing complexity: Multi-threaded bugs are often Heisenbugs — they disappear when you try to observe them because timing changes during debugging.
The invisible cost: Context switching and memory synchronization, such as
volatileorsynchronizedin Java, add overhead that can make a simple program slower than a single-threaded one.
Why It Does Not Scale Forever
Multi-threading does not always make programs faster. This is formally described by Amdahl’s Law, which says that the speedup of a program is limited by its serial part — the part that cannot be parallelized. If 20% of your code must still run sequentially, then your program can never be more than 5x faster, no matter how many threads you add. You can find more information about this law online or ask AI, it’s quite easy to understand.
Another hidden cost is context switching overhead. When the CPU moves from Thread A to Thread B, it must save the current state, such as registers and the stack pointer, and then load the new one. If you have too many threads, the CPU may spend more time switching than doing useful work.
Multi-process Model
Multi-process is a model in which a program or system uses multiple independent processes to handle work. Each process lives in its own isolated virtual address space, has its own state, and does not directly share data with other processes like threads do. Because of this isolation, when one process crashes (for example with a segmentation fault), the other processes can often continue running normally.
Simply put, multi-process is like splitting a large application into multiple separate “work rooms”. Each room has its own task, its own people, its own documents, and is less dependent on the others. This makes the system safer and easier to isolate.
Examples:
A browser can use multiple processes for tabs, extensions, and network.
A server like Nginx or PostgreSQL (Process by Connection mechanism) can also use multiple processes to handle different tasks.
A Python program can create multiple processes to handle heavy tasks on multiple CPU cores.
Why use multi-process?
Use multi-process when:
The work can be broken down into many independent parts.
You want to take advantage of multiple CPU cores to improve performance.
You want fault isolation, to avoid one part of a failure affecting the entire application.
You want to avoid the limitations of threads in some runtimes or languages.
The GIL (Global Interpreter Lock):
In languages like Python or Ruby, a global interpreter lock (GIL/GVL) prevents multiple threads from executing bytecode at the same time. To get true parallelism on a multi-core CPU, developers often use multi-processing instead of multi-threading.
Advantages:
Better resource isolation than multi-thread.
One process failure does not bring down the entire system.
Can truly utilize multiple CPU cores.
Good fit for CPU-bound tasks.
Copy-on-Write (CoW):
Although each process has its own memory, modern operating systems use a trick called Copy-on-Write. When a process forks, the OS does not immediately copy all of its memory. Both processes share the same physical memory pages until one of them writes to a page. Only then is that page actually copied. This makes multi-processing more efficient than it might sound at first.
Disadvantages:
Creating and managing processes is more expensive than threads, because spawning a process requires heavier system calls to the kernel.
Communication between processes is more complex, because their memory spaces are separate.
It consumes more memory because each process still has its own stacks, heaps, and libraries.
IPC complexity:
Because processes cannot directly see each other’s memory, they must use Inter-Process Communication (IPC) mechanisms:
Pipes & sockets: sending data like a stream or a phone call.
Shared memory: setting up a common memory region that both processes can access.
Message queues: leaving messages in a mailbox for other processes to read later.
Multitasking
Multitasking is the ability of an operating system to manage multiple tasks, such as processes or threads, concurrently. Even on a machine with multiple cores, the operating system still relies on multitasking to make many programs appear to run at the same time. In practice, the OS divides CPU time into small slices and alternates between tasks so quickly that it creates the illusion of full simultaneity.
At the center of multitasking is Context Switching. A context switch happens when the CPU stops running one process or thread and switches to another. Before switching, the operating system must save the current execution state, and when the task runs again later, it restores that state so execution can continue from exactly where it left off.
You can think of the CPU like a chef cooking many meals at once. Before moving away from one dish, the chef remembers the temperature, timer, and cooking status. When returning later, the chef checks those notes and continues from the right point instead of starting over.
What gets saved during a context switch usually includes:
Program Counter: the next instruction to execute.
CPU Registers: temporary values currently being used by the CPU.
State information: whether the task is Ready, Running, Waiting, or another state.
I/O information: any relevant input/output details.
Accounting information: usage statistics such as CPU time.
The Cost of Switching
Context switching is essential, but it is not free. It adds overhead, which means the CPU spends time on housekeeping instead of useful work. Saving and restoring state takes time, and frequent switches can reduce overall performance.
There is also a cache effect. When the CPU switches from one task to another, the cache may still contain data from the previous task. The new task may suffer cache misses and need to fetch data from the slower main memory, which adds more delay. This is one reason why too many context switches can hurt performance.
Preemptive vs. Cooperative
Modern operating systems usually use preemptive multitasking. In this model, the OS is in control and can forcibly stop a task when its time slice expires. This helps keep the system responsive and prevents one task from monopolizing the CPU.
Older systems sometimes used cooperative multitasking. In that model, tasks had to voluntarily give up control. If one task froze or misbehaved, the whole system could become unresponsive. That is why preemptive multitasking became the standard in modern operating systems.
Hardware Support
Modern CPUs also provide hardware features that help context switching happen more efficiently. The CPU can save and restore register state quickly, which reduces some of the cost of switching. Even so, context switching still has a real performance price, especially when it happens too often.
In short, multitasking is the big picture, while context switching is the mechanism that makes it work. Multitasking allows many tasks to share one CPU over time, while context switching is the actual process of moving from one task to another and back again.
Scheduler
The scheduler is a core component of the operating system kernel that decides which process or thread gets to run on the CPU at any given moment. In simple terms, it is like the director of a stage performance: it decides who gets on stage first, who has to wait, and how long each actor gets to perform.
The scheduler does not just choose what runs, but also for how long. When a task’s time slice ends, or when it needs to wait for I/O, the scheduler moves the CPU to another task so the system stays responsive and does not waste CPU time.
The Three Levels of Scheduling
In modern operating systems, scheduling is usually split into three levels:
Long-term scheduler: decides which jobs are admitted into the system from disk into memory. It controls the degree of multiprogramming.
Short-term scheduler: also called the CPU scheduler, this is the one that picks a task from the Ready Queue and gives it CPU time. It runs very frequently and must be extremely fast.
Medium-term scheduler: handles swapping. When memory is under pressure, it can temporarily move a process out of RAM and bring it back later.
These three roles work together to balance performance, memory usage, and responsiveness.
Process States and Queues
A process usually moves through a small number of states:
New: the process is being created.
Ready: the process is ready to run and is waiting in the Ready Queue.
Running: the process is currently using a CPU core.
Waiting or Blocked: the process cannot run because it is waiting for a slow event, such as disk I/O or user input.
Terminated: the process has finished and is being cleaned up.
A common point of confusion is that a process usually does not go directly from Waiting to Running. It must first return to the Ready Queue and wait for the short-term scheduler to pick it again.
Scheduling Queues
The operating system uses different queues to organize processes and threads:
Job Queue: contains jobs that have not yet been admitted into memory.
Ready Queue: contains processes or threads that are ready to run on the CPU.
Device Queue: contains processes or threads waiting for I/O devices such as disk, network, or other peripherals.
When a process changes state, it moves to the appropriate queue. For example, if a running process needs to read from disk, it leaves the CPU and moves to the Device Queue. When the I/O completes, it returns to the Ready Queue and waits for CPU time again.
Scheduling Criteria
When choosing a scheduling algorithm, the operating system usually balances several goals:
CPU utilization: keep the CPU busy as much as possible.
Throughput: finish as many jobs as possible in a given time.
Turnaround time: reduce the time from New to Terminated.
Response time: reduce the delay between a user action and the first visible reaction.
Fairness: ensure that no task is starved of CPU for too long.
Waiting time: reduce the time a task spends waiting before it gets CPU time.
The best choice depends on the system’s goal. A server may care more about throughput and CPU utilization, while a desktop OS cares more about response time and fairness.
Scheduling Algorithms
Several scheduling algorithms are commonly used:
First Come, First Serve (FCFS): the first task to arrive is the first task to run.
Round Robin (RR): each task gets a fixed time slice, then the CPU moves to the next task.
Priority Scheduling: higher-priority tasks run first.
Shortest Job First (SJF): tasks with less work are prioritized.
Shortest Remaining Time: the task with the least remaining processing time is selected.
Multi-level Queue: the system is divided into multiple queues, each with its own scheduling policy.
Each algorithm has trade-offs between responsiveness, fairness, and efficiency. Round Robin is simple and fair, but may create more context switching. SJF can reduce average waiting time, but it is harder to use in practice because the OS must estimate job length.
Multi-level Feedback Queue
Most modern operating systems do not rely on a simple scheduling model alone. A common real-world approach is the Multi-level Feedback Queue (MLFQ).
In MLFQ, a new task typically starts in a high-priority queue with a short time slice. If it behaves like a quick interactive task, it stays near the top. If it keeps using too much CPU time, the OS gradually moves it to lower-priority queues with longer time slices. This keeps the user interface responsive while still allowing large jobs to complete.
CPU-bound and I/O-bound Workloads
Different workloads need different scheduling behavior:
CPU-bound tasks spend most of their time doing computation.
I/O-bound tasks spend most of their time waiting for disk, network, or other devices.
Schedulers often try to favor short interactive or I/O-bound tasks so the system feels responsive, while still making sure CPU-bound tasks eventually get enough processing time.
Shared Memory
Shared Memory is one of the fastest methods for Inter-Process Communication (IPC). Instead of sending data back and forth through the kernel, the operating system maps the same physical memory region into the virtual address spaces of multiple processes. That means the same data can be accessed directly by more than one process.
Shared Memory vs. Message Passing
There are two main ways for processes to communicate:
Message passing: the OS acts like a mailman. Process A sends data to the kernel, and the kernel delivers it to Process B. This is safer, but it usually involves copying data more than once.
Shared memory: the OS provides a shared region of memory that both processes can read and write directly. This avoids copying and is much faster, especially for large data.
This is why shared memory is often the preferred choice when speed matters most.
How It Works ?
The operating system takes a physical block of RAM and maps it into the virtual address spaces of two or more processes.
For example:
To Process A, the shared data might appear at address
0x1000.To Process B, the same physical memory might appear at address
0x5000.
Even though the virtual addresses are different, both processes are looking at the same physical memory underneath.
Advantages
Zero-copy communication: once the memory is mapped, data moves at memory speed instead of being copied through the kernel.
Good for large data: shared memory is ideal for video frames, database buffers, and other large datasets.
Low latency: it is often faster than pipes or sockets for frequent data exchange.
Disadvantages and Risks
Synchronization burden: the OS does not manage access automatically, so developers must protect shared data using atomic operations, mutexes, semaphores, or locks.
Complexity: if one process crashes while holding a lock, other processes may get stuck, or the shared data may become corrupted.
Security: if permissions are not configured carefully, shared memory can become a risk for unauthorized data access.
Mutex vs. Semaphore
It is helpful to distinguish between the common synchronization tools:
Mutex: like a key to a bathroom. Only one thread or process can hold it at a time.
Semaphore: like a parking lot counter. It allows a fixed number of threads or processes to access a resource at the same time.
Shared memory is powerful because it gives you speed, but it also gives you responsibility. The OS provides the shared space, but the application must make sure it is used safely and correctly.
CPU Caches
Modern CPUs use a cache hierarchy to bridge the huge speed gap between the CPU and main memory (DRAM). The closer a memory level is to the CPU core, the faster it is, but also the smaller it tends to be.
Typically, CPUs have three main cache levels:
L1: the smallest and fastest cache. It is usually split into:
L1i for instructions.
L1d for data.
L2: larger than L1, but slightly slower. In many modern designs, it acts as a buffer for L1 and may be shared by a small cluster of cores.
L3: also called the Last Level Cache (LLC). It is much larger, usually measured in megabytes, and can be shared across multiple cores.
When the CPU needs data, it checks L1 first, then L2, then L3, and finally DRAM if the data is not found in cache. This layered design helps keep frequently used data close to the processor, which dramatically reduces average memory access time.
Cache Lines: The Unit of Transfer
Data is transferred between memory levels in fixed-size blocks called cache lines. On many modern CPUs, a cache line is typically 64 bytes.
The reason is that when a CPU reads a variable from RAM, it doesn’t just fetch that single variable (byte by byte). Instead, it loads an entire 64-byte chunk surrounding that variable (this is the cache line) into the CPU cache (L1). It does this because it predicts that, in most cases, nearby data will soon be used. The next time, those neighboring values are already in L1, so there’s no need to access RAM again.
This mechanism is based on the principle of spatial locality. If the CPU accesses a value in memory, it assumes that nearby values are likely to be accessed soon as well.
That’s why arrays are typically cache-friendly. Their elements are stored next to each other in memory, allowing the CPU to use cache lines efficiently. In contrast, linked lists often have nodes scattered throughout memory, making cache usage much less efficient.
Cache Hits and Misses
When the CPU looks for data, one of two things happens:
Cache hit: the data is found in cache, so execution continues quickly.
Cache miss: the data is not found, so the CPU must fetch it from a lower level, usually a slower cache or DRAM.
A cache miss is expensive. The CPU may have to wait dozens or even hundreds of cycles while the data is loaded. That is why good cache behavior can have a huge impact on performance.
Set-Associativity and Tags
Cache is not managed like a simple hash table. Instead, it uses a set-associative structure.
The CPU uses specific bits from the memory address to choose a cache set, then compares the tag to see whether the data is actually there. This design lets the hardware find data very quickly without scanning the entire cache.
Each cache line usually contains:
The actual data.
A tag that identifies which memory block it belongs to.
Metadata such as validity and dirty state.
Write-Back and Dirty Cache Lines
To stay fast, CPUs often use a write-back policy. When data is modified, the change is written to cache first, and the cache line is marked as dirty.
The main memory is not updated immediately. Instead, the data is written back later, usually when:
the cache line is evicted, or
synchronization is forced by something like a memory barrier,
volatile, or atomic operations.
This approach reduces traffic to DRAM and improves performance, but it also means that cache and memory can temporarily hold different versions of the same data.
Cache Coherency and False Sharing
On multi-core CPUs, each core can have its own cache (L1, L2). This creates a data synchronization problem between CPU cores during computation: if Core 1 modifies data in its cache (within a cache line), how does Core 2 know that its copy is now outdated and needs updating?
When Core 1 modifies a variable x, it marks that cache line as “dirty” (modified). If Core 2 wants to modify another variable y that happens to reside on the same cache line, it can’t use its current copy anymore. Instead, it has to fetch the updated cache line again from L3 or RAM. This issue is commonly known as false sharing.
When two cores modify values on the same cache line, the cache line becomes inefficient because it has to be repeatedly reloaded, which is costly in terms of time. This constant back-and-forth movement is managed by a cache coherency protocol - MESI protocol. This protocol ensures that cores coordinate with each other so they don’t keep using stale data. In practice, it guarantees that all cores eventually see the most up-to-date version of shared data.
False sharing is particularly dangerous because the code may look completely correct and free of obvious contention, yet still perform poorly due to how data is laid out in cache lines.
How to solve it
To solve false sharing, the most effective approach is to ensure that variables written independently by each core/thread do not end up on the same cache line.
Padding: Separate data into its own cache line by inserting extra “junk” data, making sure important variables do not sit within the same 64 bytes (so they fall on different cache lines). In Java, you can use the @Contended annotation to apply this padding.
Redesign data per thread: Give each thread/core its own local variable, do the computation independently, and then merge the results at the end instead of continuously writing to a shared memory region.
Why Caches Matter
Caches are one of the biggest reasons modern CPUs are fast. They reduce the average cost of memory access, help keep the CPU busy, and make repeated or nearby data access much cheaper.
But caches also introduce complexity. To write high-performance code, you need to think not only about algorithms, but also about memory access patterns, cache locality, cache coherency, and false sharing.
Conclusion
Once you understand process, thread, scheduler, and cache, you will see more clearly why some code runs fast, why some code runs slowly, why synchronization bugs happen, and why AI cannot completely replace system thinking. Foundational knowledge does not make you write code instead of AI, but it helps you know what to ask AI, and how to verify AI’s results.
(If you enjoy these kinds of engineering stories, you can subscribe to receive the next ones.)











