Infrastructure as Code (IaC) is fantastic. It’s light-years ahead of manual infrastructure configuration and modification.
To give a simple example, at my previous company, it would take me a week to manually set up a new stack (all infrastructure pieces, software that runs on it, and so on). Over time, we automated much of the setup process, but there were still a substantial number of manual steps. The process would take at least half a day and require someone who knew what they were doing. This time, we started from the beginning with IaC; now, it takes us about 30 minutes to get a new stack up and running, and only a little training is required to teach anybody how to do it.
We can spin new stacks on a whim. Do you want to do performance testing without affecting others? Just run this one command, and in 30 mins, the new stack will be ready to go. Do you want to test your feature branch? Do it on your stack. Do you want to experiment with infrastructure changes? You can do that too. There could be many reasons for wanting an independent stack, and getting it would be fast (and cheap).
All hail Infrastructure as Code!
All of that being said, there are several pitfalls with IaC that I want to emphasize and discuss.
One interesting observation: This time around, I am second-in-command on the (Dev)Ops side. I am WAY less knowledgeable than the lead engineer who handles infrastructure; I spend less time on it. And being second in command brings many issues into focus.
The high density of knowledge required (per line of code)
If you work on something manually, you can stumble your way through configuring services that you need to learn better. The UI shows you which controls are there. Then, if the UX is decent, it will even guide you to make the right decisions and see what is critical.
It’s very different for IaC. Lines of code and even values can impact half a dozen elements of the overall stack. That requires significantly more diligence and consideration than most engineers or even just general IT folks are used to.
To give you a simplistic example, in UI, if you try to make your database publicly accessible, there will be a big flashing red sign: “Are you sure?” In Terraform, you will have an unnoticeable and plain
publicly_accessible = true.
There are a lot of similarities between IaaC code and a product’s core code. Most non-core code for products is reasonably straightforward, touching only one specific feature or area. However, there is always core code on which everything else depends heavily, and changing any line of this core code has enormous repercussions. And you need to understand it damn well before attempting to make changes.
Most of the IaaC code is like that. There are some less critical lines of code (maybe monitoring, etc.). However, the rest of the code is usually configurations for the underlying infrastructure. And your whole product (and all the features) entirely depends on it. As a result, the margin for error is tiny, and the amount of knowledge it takes to work on it is high.
You will be forced to repeat steps if you are manually working on infrastructure. For example, you will go to the AWS UI, click “Create EKS,” go to S3 and create the necessary S3 buckets, and so on. Also, most likely, you will make a couple of missteps and have to retrace your steps, learn what caused the error, and so on. It’s messy and time-consuming. But you learn how things are wired in your system.
Engineers hate repetition. Thankfully, IaaS automates this messiness. Random errors are less likely, and there is almost no need to use the UI to make changes manually. However, there is one downside: you don’t repeat the steps. As a result, you may not remember or learn them.
Ultimately, the person who wrote the code learns a lot, and everybody else learns little (because of a lack of repetition).
No debug; No tests; Declarative code
Most of us are used to working with imperative code. We know how to reason, debug, comment, read, and maintain it. All this experience goes out of the window for Terraform and Helm’s declarative style. Again, don’t get me wrong; hands down, the declarative definition of IaC is a better approach. It’s hard to handle an almost infinite infrastructure state via imperative code.
However, declarative code still requires tooling and approaches. You can’t just set a breakpoint and see that we are creating this role for this role mapping for this resource. You can’t read the tests to see the intent of what we are trying to solve. Most information and relations are not visible in the code. Rather, they require extensive knowledge of external platforms.
This all makes it hard to read and understand IaC. You must be heavily involved in the project to understand what’s happening.
It reminds me of how I felt when working with Ruby on Rails. Writing code was pleasant, elegant, and highly productive. However, dropping into an existing project was challenging. It required extensive knowledge to understand how some obscure flag, combined with some gem (library), affected the whole application.
You can run simple code experiments in an imperative language trivially. You change the code and run it or write a test and run it; often, within seconds, you can assess whether you are moving in the right direction.
This is different for infrastructure code, especially if dependencies are involved. Simple things still may take a matter of seconds. However, complex and heavy things can take minutes, and if you need to test something from scratch (to build out the whole thing), it can easily take 30 minutes, as I mentioned before.
Let me share a prime example of my experience here. We were (and still are) working with Vault. It would have been a reasonably straightforward task if I had installed it and configured it manually. Many articles cover what needs to be done during a manual installation well. The experimentation would probably be shorter and concentrated on getting things working.
However, we had to spend substantial time experimenting on t’s automation, which wasn’t straightforward. Several dependencies were in our stack, and it often required me to bring the whole stack down and up (or at least do numerous steps to bring Vault down, clean it up, and bring it up again).
The bar for manual building stacks is usually lower and requires less experimentation. The bar for IaC is higher. It requires more tests (building a stack from scratch, upgrading, and destroying it). Often, these things are time-consuming.
Generally, you can get a rather high reproducibility rate for IaC. However, it is not 100%, and things often go wrong.
It all reminds me of problems with UI/integration tests. They are a great idea on paper, but you have to fight flakiness when you have many of them. The situation is similar for IaC (but on a smaller scale). You run automation with a lot of steps. Timing, environmental problems, and so on may affect how well it runs.
As a result, you end up with these pesky problems that happen so often that they’re hard to ignore but are so rare that they’re not easy to diagnose and address.
This one is interesting. We invested a lot of time in getting everything from the idea stage to deploying a full stack in 30 minutes. Even a person well-versed in all the tooling would have taken time to put things together and ensure they worked. We could have spent just half of that time if we had manually configured several stacks and stopped there.
This investment will pay off over time as we modify the infrastructure. Also, we can always track when we will need to perform an infrastructure audit and when we will launch more stacks and use them to make developers more productive.
Automation tools are always behind.
This one is secondary for me. However, it makes sense to mention it for the completeness of the overview. IaaS providers constantly innovate and introduce new functionality. Most of it may not be interesting to you. However, sometimes, you eagerly anticipate new functionality. For example, in the previous start-up, we were fortunate that AWS released new network load balancers because the application load balancer didn’t work well for us. Automation tools are always one step behind. As a result, you may have to wait until new functionality is folded into tools like Terraform and Helm.
First of all, IaC is not a silver bullet. It solves tons of issues but introduces its own complexities.
Secondly, it’s less about pure time savings on the operations side and more about unlocking possibilities and higher productivity for engineers. (Hooray for the DevOps movement!)
And finally, it’s incredibly critical to figure out how to share information and train DevOps people. The combination of doing it part-time and having a very high density of required knowledge (per line of code) makes it non-trivial to get up to speed.
Aembit is the Identity Platform that lets DevOps and Security manage, enforce, and audit access between federated workloads.
We invite you to try it today!