NodeOps360 — Designing Terraform Modules That Don't Rot

Most Terraform module libraries start with the best intentions. Someone writes a clean vpc module. Someone else writes rds. Six months later the vpc module has 47 input variables, three of them are deprecated-but-still-required-don't-ask, and nobody bumps the version because every upgrade is a risk. This is module rot — and it's almost entirely preventable.

Here are the patterns we've landed on after running shared module libraries across multiple client engagements, the ones that have actually held up under the pressure of dozens of consumers and years of drift.

The wrong-sized module problem

The two most common failure modes are too big and too small.

Too big is "the platform module" that provisions a VPC, an EKS cluster, an RDS instance, a Redis cluster, and an ALB in one apply. Every consumer needs only some of it but takes all of it. Every change is a high-blast-radius change. Plans take 4 minutes.

Too small is the aws-security-group module that's just a wrapper around the resource with the same inputs renamed. It adds no value, just an extra layer of indirection — and a place for bugs to hide between the wrapper and the underlying resource.

★ The right-size rubric

A module is the right size if (a) replacing it with raw resources would meaningfully increase boilerplate at the call site, AND (b) you can describe what it does in one sentence without using the word "and".

Inputs: the "sensible default, escape hatch" pattern

The single most useful module pattern we've adopted is what we call "sensible default, escape hatch". Every input variable ships with a sane default. Complex configuration objects accept a fallback override map for the cases the abstraction can't anticipate.

variable "tags" {
  type        = map(string)
  default     = {}
  description = "Tags applied to all resources. Module adds standard tags automatically."
}

variable "extra_security_group_rules" {
  type    = list(object({
    type        = string
    from_port   = number
    to_port     = number
    protocol    = string
    cidr_blocks = list(string)
  }))
  default     = []
  description = "Escape hatch for rules the module doesn't model natively."
}

The escape hatch is the difference between "consumers fork the module" and "consumers stay on the shared version." Forks are how libraries die. Every time you make a consumer choose between forking and waiting for a feature, you're slowly killing your own library.

Versioning: semver, enforced

Pin every module reference to a tag, never a branch. Use semantic versioning religiously, and treat any change to module inputs or outputs as a major version bump — even if it "shouldn't" break anyone. Consumers will surprise you.

Conventional commits in the module repo (feat:, fix:, feat!:)
release-please to auto-generate version bumps and changelogs from commit history
A monthly "stale module" CI job that opens PRs against every consumer repo on minor/patch bumps

module "vpc" {
  source  = "git::ssh://git@github.com/org/tf-modules.git//vpc?ref=v3.4.1"
  # ^ pinned tag, not main, not v3
  cidr_block = "10.42.0.0/16"
}

The auto-PR job is the unsung hero of this setup. Without it, consumers will sit on v1.2.0 for two years. With it, the cost of a minor bump is one approval click, and your library stays alive.

Validation in CI is non-negotiable

The fastest way to kill a module library's credibility is to ship a broken version. Every module repo in our library runs four checks on every PR:

terraform fmt -check + terraform validate — table stakes
tflint with the AWS ruleset — catches deprecated arguments before they hit a plan
tfsec or checkov — catches "you forgot to enable encryption" before it ships
terratest — actually provisions the module in a sandbox account and asserts on real resources

The terratest piece is the one most teams skip. Don't. The 12-minute round trip of "apply in a sandbox, assert, destroy" has caught more real bugs for us than every other check combined.

"Modules that aren't continuously applied in CI are documentation, not infrastructure code."

Docs as a contract

Every module repo has a generated README.md via terraform-docs on pre-commit. The README is the public API. If an input isn't in the README's Inputs table, consumers can't rely on it. If you change an input description, you've changed the contract — and the changelog should say so.

# .pre-commit-config.yaml
- repo: https://github.com/terraform-docs/terraform-docs
  rev: v0.17.0
  hooks:
    - id: terraform-docs-go
      args: ["markdown", "table", "--output-file", "README.md", "./"]

Bonus: tools like Backstage's TechDocs can ingest these READMEs directly to give you a searchable module catalogue with zero extra work.

The smell test

After three years, these are the signals that tell us a module is starting to rot:

The README has a "deprecated but still required" note
Consumers have started writing wrapper modules around it
The last 3 PRs have been "increment patch version, nothing breaking" cleanup commits
Nobody on the platform team can confidently explain what enable_legacy_mode_v2 actually does
You find yourself saying "yeah but don't pass that input, just leave it default"

When you see two or more of those, it's time to design a v2 — not patch around the v1. Ship the v2 in parallel, give consumers a deprecation window (we use 90 days), then delete the v1.

// the production checklist

Module does one thing, describable in one sentence
Every input has a default; complex inputs have an escape-hatch override
Versioned with semver; every reference pinned to a tag
fmt, validate, tflint, tfsec, terratest run on every PR
README auto-generated via terraform-docs pre-commit
Conventional commits + release-please wired up
Monthly auto-PR job opens version-bump PRs in consumer repos
A 90-day deprecation policy you actually enforce

★ Tl;dr

Right-sized scope, sensible defaults with escape hatches, strict semver, terratest in CI, and treat the README as a public contract. Do those five things and your module library will scale past 30+ modules and 40+ engineers without becoming the next thing you have to rewrite.

Terraform IaC Infrastructure Module Design DevOps

NodeOps360 Engineering

// Platform & SRE Practice

We build, run, and write about cloud-native platforms in production. Every post is grounded in real engagements — no theory-only takes.

Designing Terraform modules that don't rot

The wrong-sized module problem

Inputs: the "sensible default, escape hatch" pattern

Versioning: semver, enforced

Validation in CI is non-negotiable

Docs as a contract

The smell test

// the production checklist

NodeOps360 Engineering

We've shipped hundreds of modules.

The wrong-sized module problem

Inputs: the "sensible default, escape hatch" pattern

Versioning: semver, enforced

Validation in CI is non-negotiable

Docs as a contract

The smell test

// the production checklist

NodeOps360 Engineering

Related posts

GitOps at scale: 200+ ArgoCD apps

AWS Landing Zone for a multi-BU enterprise

Zero-downtime on-prem → AWS migration

We've shipped hundreds of modules.