Beyond the Bill: How I Matured the Cloud Infrastructure I Manage

Mar 06, 2026

In the previous article, I shared how I optimized the cloud bill. But in reality, there were still many things that needed improvement. Engineering is a journey, right?

Pipeline to Deploy Code

The setup I work with uses Jenkins to run pipelines. Of course, I have worked with GitLab CI/CD before, but for me, they are all just tools. Argo CD, GitHub Actions, or AWS CodePipeline — in the end, they are simply tools to build and deploy code.

At the beginning, the pipeline I managed only had a build stage for backend projects. Deployments were still done mostly by hand. So I added new steps to automatically deploy to staging and production. Now everything is fully automated after a simple git push. Pretty nice 😄. My job now is simply to change the code and push it.

Working with Git

I have to admit something: Even though I had used Git for a long time, I never really had an effective workflow. In a team with many developers, things could easily become messy. Every time we wanted to test a new feature, we had to ask:

Can we deploy now?
Will it overwrite someone else’s code?
Is another feature still being tested?
Is it my turn to test?

The main problem was simple: we only had one dev environment.

At that time, my understanding of an effective git workflow was still very vague.

Then one day, while walking home from work along the river—without looking at my phone, just enjoying the fresh air and thinking randomly about git workflows… suddenly an idea came to me. Haha.

Later I found a blog post online that described almost exactly the same workflow. It wasn’t a new idea — I just hadn’t been exposed to it before! In the end, everything is about what works best for your team.

My principles are very simple:

The master branch is always the most stable and correct version of the system.
All new features and bug fixes must start from this branch.
The develop branch is used for testing.
Feature branches and bug-fix branches must merge into develop to be deployed to the testing environment.
After testing is successful on develop, the feature branch can then be merged into master and deployed to production.

With this workflow, the team has been working very smoothly so far. If problems appear later… maybe I will just go walk along the river again to think about it. 😄

Infrastructure (Infrastructure as Code)

The benefits of Infrastructure as Code are well known, so I probably don’t need to explain them much. You can easily find plenty of information online or just ask AI.

After finishing the deployment of the backend infrastructure on Amazon Web Services, the next thing I did was write IaC for all the projects I worked on.

I use OpenTofu (a fork of Terraform). The idea was simple: One day, if I am no longer managing this system, at least the engineers joining will have something to help them understand what was built. And if something goes wrong, they can quickly rebuild it.

Side note: Sorry AWS, but I’m not a big fan of AWS CloudFormation. Terraform code just looks much nicer to me 😄.

Monitoring

After one year of managing this infra, the thing I’m most proud of is building a monitoring platform. At my previous job, I worked with Datadog, but my understanding was basic. I mostly just used it to read logs.

When I started my current role, I was surprised to see no monitoring platform at all. If developers wanted to read logs, they had to SSH into the server. At that moment, I felt monitoring was absolutely necessary. Without it, debugging production systems feels like fighting enemies with bare hands.

Backup

Do you know what the most valuable asset of a company is? For me, it’s data. If the data disappears, the company may disappear too.

I heard a story about a company that lost all its data after hackers gained root access and deleted everything. They asked AWS for help, but it wasn’t possible. The company shut down. That made me think.

I use AWS Backup now. With this service, backups are protected and cannot easily be deleted — even with root access. At least, within my current understanding, this feels safer.

Unless the entire AWS infrastructure collapses… which hopefully is very unlikely.

Rethinking Cron Jobs

Almost every company needs to process large datasets periodically. Originally, we used the Spring Boot @Scheduled annotation. It worked, but had two problems:

Debugging was difficult.
Horizontal scaling: If multiple servers run at once, the same job might execute twice!

My solution was simple. I moved the scheduling to AWS Lambda + Amazon EventBridge. EventBridge triggers the Lambda, which then calls an HTTP API endpoint in our service.

Everything became much easier to manage.

Conclusion

I’m always thinking about ways to improve the systems I work on. Maybe some of these solutions look simple to others. But for me, every time I find a solution, it brings a small sense of joy. And honestly, I’m always a little proud of that. 😉

(And yes, I still take walks by the river to brainstorm!) 😊

(If you enjoy these kinds of engineering stories, you can subscribe to receive the next ones.)

Quang Chien's Blog | Software Engineer | France

Discussion about this post

Ready for more?