Home/Resources/Blog
Infra as Code · Engineering Blog

Designing Terraform modules that don't rot

A pragmatic rubric for module boundaries, versioning, and CI validation — based on three years of running a shared Terraform module library that 40+ engineers ship into every day.

NO
6 min read

Most Terraform module libraries start with the best intentions. Someone writes a clean vpc module. Someone else writes rds. Six months later the vpc module has 47 input variables, three of them are deprecated-but-still-required-don't-ask, and nobody bumps the version because every upgrade is a risk. This is module rot — and it's almost entirely preventable.

Here are the patterns we've landed on after running shared module libraries across multiple client engagements, the ones that have actually held up under the pressure of dozens of consumers and years of drift.

The wrong-sized module problem

The two most common failure modes are too big and too small.

Too big is "the platform module" that provisions a VPC, an EKS cluster, an RDS instance, a Redis cluster, and an ALB in one apply. Every consumer needs only some of it but takes all of it. Every change is a high-blast-radius change. Plans take 4 minutes.

Too small is the aws-security-group module that's just a wrapper around the resource with the same inputs renamed. It adds no value, just an extra layer of indirection — and a place for bugs to hide between the wrapper and the underlying resource.

★ The right-size rubric

A module is the right size if (a) replacing it with raw resources would meaningfully increase boilerplate at the call site, AND (b) you can describe what it does in one sentence without using the word "and".

Inputs: the "sensible default, escape hatch" pattern

The single most useful module pattern we've adopted is what we call "sensible default, escape hatch". Every input variable ships with a sane default. Complex configuration objects accept a fallback override map for the cases the abstraction can't anticipate.

variable "tags" {
  type        = map(string)
  default     = {}
  description = "Tags applied to all resources. Module adds standard tags automatically."
}

variable "extra_security_group_rules" {
  type    = list(object({
    type        = string
    from_port   = number
    to_port     = number
    protocol    = string
    cidr_blocks = list(string)
  }))
  default     = []
  description = "Escape hatch for rules the module doesn't model natively."
}

The escape hatch is the difference between "consumers fork the module" and "consumers stay on the shared version." Forks are how libraries die. Every time you make a consumer choose between forking and waiting for a feature, you're slowly killing your own library.

Versioning: semver, enforced

Pin every module reference to a tag, never a branch. Use semantic versioning religiously, and treat any change to module inputs or outputs as a major version bump — even if it "shouldn't" break anyone. Consumers will surprise you.

  1. Conventional commits in the module repo (feat:, fix:, feat!:)
  2. release-please to auto-generate version bumps and changelogs from commit history
  3. A monthly "stale module" CI job that opens PRs against every consumer repo on minor/patch bumps
module "vpc" {
  source  = "git::ssh://git@github.com/org/tf-modules.git//vpc?ref=v3.4.1"
  # ^ pinned tag, not main, not v3
  cidr_block = "10.42.0.0/16"
}

The auto-PR job is the unsung hero of this setup. Without it, consumers will sit on v1.2.0 for two years. With it, the cost of a minor bump is one approval click, and your library stays alive.

Validation in CI is non-negotiable

The fastest way to kill a module library's credibility is to ship a broken version. Every module repo in our library runs four checks on every PR:

The terratest piece is the one most teams skip. Don't. The 12-minute round trip of "apply in a sandbox, assert, destroy" has caught more real bugs for us than every other check combined.

"Modules that aren't continuously applied in CI are documentation, not infrastructure code."

Docs as a contract

Every module repo has a generated README.md via terraform-docs on pre-commit. The README is the public API. If an input isn't in the README's Inputs table, consumers can't rely on it. If you change an input description, you've changed the contract — and the changelog should say so.

# .pre-commit-config.yaml
- repo: https://github.com/terraform-docs/terraform-docs
  rev: v0.17.0
  hooks:
    - id: terraform-docs-go
      args: ["markdown", "table", "--output-file", "README.md", "./"]

Bonus: tools like Backstage's TechDocs can ingest these READMEs directly to give you a searchable module catalogue with zero extra work.

The smell test

After three years, these are the signals that tell us a module is starting to rot:

When you see two or more of those, it's time to design a v2 — not patch around the v1. Ship the v2 in parallel, give consumers a deprecation window (we use 90 days), then delete the v1.

// the production checklist

★ Tl;dr

Right-sized scope, sensible defaults with escape hatches, strict semver, terratest in CI, and treat the README as a public contract. Do those five things and your module library will scale past 30+ modules and 40+ engineers without becoming the next thing you have to rewrite.

Terraform IaC Infrastructure Module Design DevOps
NO

NodeOps360 Engineering

// Platform & SRE Practice

We build, run, and write about cloud-native platforms in production. Every post is grounded in real engagements — no theory-only takes.

// building a module library?

We've shipped hundreds of modules.

If you're standing up a shared Terraform practice or trying to fix one that's grown unwieldy, let's talk.

Start a conversation