09 Aug 2024

Best Practices for Building Scalable Operations Systems and Automation

Operations is about solving problems. You’re given an endpoint and need to figure out how to get to it, often using automation.

I think ops people are great at solving the problem in front of them fairly easily. However, building something scalable that can be iterated on takes a lot of thought.

Does any of this sound familiar?

  • You’ve spent hours upon hours building a process only for requirements to change shortly afterward. Now when you try to edit anything, everything breaks.
  • You’re struggling to build a process because there are so many requirements, potential future use cases, and improvements you already want to make.
  • A third-party app like Zapier changes and it breaks everything.
  • Someone leaves the company, you take over the system they built, and can’t figure out how it works or how to improve it. Ultimately, you end up rebuilding the process from scratch.

If any of this sounds familiar, you’re in luck because software engineers have already written books on these subjects and have put a lot of thought into best practices.

Think Like an Engineer

If you’ve spent any time in software engineering, you’ll notice that these operational problems look a lot like software problems. Some overlap:

  • You’ve spent weeks building a feature, only to have the requirements change and now you’re struggling to make changes without breaking everything.
  • There is a ton of scope creep causing the feature to be incredibly complicated and difficult to build.
  • Something breaks or there’s a bug in your code that takes your whole system down.
  • You have so much tech debt that you can’t build new features.
  • You start at a company, look at the codebase, and realize that there is spaghetti code everywhere, making it difficult to iterate on any features.

There are some tools in software engineering that aren’t fully available to an ops team like automated testing of your processes and version control that would be useful (someone take this business idea). Since these aren’t available to us in our toolbox, I’m going to skip these as options in my overview.

Don’t Repeat Yourself (DRY)

What It Means

Don’t repeat yourself means avoiding duplicating text or logic across your system. If you’re copying and pasting something in several places, this is a major code smell.

Why It’s Important

A system with a lot of the same text or logic in multiple places is hard to maintain. If a bug is found or a change is needed, you need to remember to update it in several places, which leads to errors.

Additionally, it’s hard for someone you’re collaborating with to know all the places where something needs to be changed even if you’re using documentation. Instead, you should centralize everything into one place and direct a co-worker there.

Example

You have a specific implementation where if someone enters the label “X” value into a form, it triggers an Airtable automation. However, you have this label “X” value in a bunch of different places, so if someone wants to rename the label, you have to rename it in 10 different places.

Centralize these labels in a lookup table and use it across all your systems. You should use an index or ID that stays the same in one column and match it to the text in the other. That way your index never changes, but your text can.

Encapsulate Logic and Single Responsibility Principle (SRP)

What It Means

Group related tasks and logic together in reusable modules, scripts, or workflows. Think about each of your systems as inputs and outputs, grouping and hiding as much logic together.

You should build each script, tool, or process to do one thing and do it well. Avoid combining unrelated tasks into a single automation as this leads to a lot of dependencies and makes it harder to extend functionality later.

Why It’s Important

Encapsulating logic and focusing on SRP makes it easier to understand systems, troubleshoot, and maintain them. It reduces side effects so that if you change one thing in your process, it won’t take the entire system down.

Example

Continuing with my Airtable example, I’ll often separate logic for collecting, transforming, and automating processes.

That means I’ll have a base for collecting information from users. In that base, I’ll have a sheet for raw data collection, several sheets for transforming and standardizing data, and then a cleaned-up export sheet to another base. This way, you can just build a system for collecting and standardizing data, focusing on the inputs and outputs for your automated systems.

Only Build What You Need (YAGNI - You Aren't Gonna Need It) and Keep It Simple Stupid (KISS)

What It Means

Don’t over-engineer automation processes and keep them as simple as possible. Ignore all of those extra bells and whistles until they are absolutely necessary.

Why It’s Important

Building unnecessary automation adds complexity, adds more time to ship, introduces more dependencies, and makes it difficult to augment processes in the future.

Example

You’re building a system that collects dates. When you’re building this system, you think it’d be nice to reformat this date into an MM/DD/YYYY format for your end users.

Although this seems like an innocent change and an improvement to your system, you don’t know what third-party apps you may integrate with in the future and the format of the date they require, leading to breaking changes. In addition, if you end up working with international users, this date change may lead to confusion and become another thing you need to change in the future.

Focus on Readability

What It Means

You’re building processes and automation for humans, not computers. Make sure that when you’re building your process, you’re building it for other people to understand.

Why It’s Important

Showing your work and providing more details will make it easier for others to collaborate on the systems you build. It’ll also be easier for you when you revisit a system you’ve built in 6 months, remembering the logic you used to build it.

Example

In Excel, instead of writing out a really complicated formula in one cell, break it into multiple steps and columns. Don’t have a bunch of nested if statements. Instead, break it into column-by-column transformations, so someone doesn’t need to decipher your formula’s logic and can follow each step easily.

Document Your Processes

What It Means

I have a bit of an opinionated viewpoint here on documentation. If you do all of the above well, then your process should be living and breathing documentation.

I think the part of the documentation that is useful is the initial scoping. That includes the business reasons for building the process, the decisions you made around what to address with the automation and what not to, the business metrics to measure the outcomes of your process, etc.

Why This Matters

Documentation ensures that others can understand, maintain, and troubleshoot your systems. It’s essential for knowledge sharing and continuity. It’s especially useful when revisiting a project in 6 months and trying to iterate on what you’ve built.

Example

If you still want to include documentation, my recommendation would be:

  • Build the documentation INTO your systems. Don’t have a separate Jira system that references the project. Having documentation in the context in which someone is working will make it easier to understand and use.
  • Revisit your documentation regularly to ensure it’s up to date. Updating documentation should be a part of shipping a feature and be revisited as a whole at regular intervals.

Final Thoughts

I think of this process as the “software engineering” side of operations. That said, there’s an entire bucket of “product” that is important for an operations role.

What I mean by this is oftentimes a lot of these problems can be solved upfront before you ever start building anything:

  • Business Need: Does the business really need this? Is this really a problem that we need to solve? Remember that the fastest way to ship something is to not ship it at all because you don’t need it.
  • Better Scoping: By spending more time upfront, you can better scope a problem and process to avoid scope creep.
  • Minimum Viable Product: You don’t always need to build scalable processes. Based on the business need, sometimes you need to just build something quickly that you know is a bandaid and plan on throwing away later.
Topics
Maturity
operations