TLS looks simple when everything lives on one machine. The architecture changes when you have multiple nodes, multiple domains, and multiple Traefik instances all trying to be helpful at the same time.

What I learned is that certificate management and traffic routing should usually not be the same responsibility in a distributed setup.

Instead of letting every Traefik instance talk to Let’s Encrypt directly, a cleaner pattern is to separate the system into three layers:

  1. One certificate service that handles ACME and renewals.
  2. One shared storage location for certificate state and generated PEM files.
  3. Multiple Traefik instances that only load certificates and route traffic.

That separation makes TLS more predictable, easier to debug, and much safer to scale.

The Core Problem

The default Traefik ACME flow is fine for a single host. It starts to break down when:

  • more than one instance can request certificates
  • ACME state is not shared safely
  • certificate renewal is hidden inside the edge layer
  • DNS validation or provider rate limits become part of normal operations

In a multi-node environment, that creates avoidable risk. You can get duplicate certificate requests, renewal conflicts, brittle shared acme.json behavior, and a routing layer that is carrying more operational responsibility than it should.

The deeper lesson is not just “use a different tool.” The lesson is architectural:

Keep certificate lifecycle management in a control plane, and keep the ingress layer focused on serving traffic.

The Pattern That Works Better

The model I like now is:

1. ACME control plane

Run one dedicated certificate service using acme.sh.

Its job is to:

  • register with Let’s Encrypt
  • create DNS challenge records
  • issue and renew certificates
  • write certificate files to shared storage

This service should be the only thing in the environment that talks to ACME for these domains.

2. Shared state layer

Store both ACME state and issued certificates on shared storage.

That might be:

  • CephFS
  • NFS
  • EFS
  • another replicated filesystem you already trust

What matters is that every edge node sees the same certificate files and the same renewal outputs.

3. Stateless Traefik edge

Traefik should consume certificates through the file provider, not generate them itself.

That means Traefik:

  • reads PEM files from shared storage
  • selects the right certificate via SNI
  • terminates TLS
  • routes requests to the right backend

This keeps ingress simpler and easier to scale horizontally.

Why DNS-01 Matters

If you want wildcard certificates, DNS-01 is usually the right validation method.

In my case, Route53 was the DNS provider, but the pattern is broader than AWS. The point is to use a provider API that your ACME client can update automatically.

Wildcard certificates let you group domains in a way that matches the structure of your environment, for example:

  • *.apps.example.com
  • *.internal.example.com
  • *.lab.example.com

That reduces certificate sprawl and makes the system easier to reason about.

What I Learned From The Design Choices

A few design decisions matter more than they first appear to.

Separation of concerns is the real win

The biggest improvement is not that acme.sh is doing the work instead of Traefik. The improvement is that certificate issuance, certificate storage, and traffic routing are handled by different layers.

That gives you clearer failure domains:

  • if ACME breaks, routing can still work with existing certs
  • if Traefik breaks, certificate issuance still has a clean operating model
  • if DNS validation fails, the problem is easier to isolate

One ACME client is better than many

Distributed systems do not get better because more nodes are allowed to request the same certificate.

A single ACME client avoids:

  • duplicate requests
  • rate limiting
  • race conditions around renewal
  • hidden coupling between edge nodes

Shared files are more reusable than edge-specific ACME state

Writing fullchain.pem and privkey.pem files to shared storage is a very portable approach.

Those files can be reused by:

  • Traefik
  • internal APIs
  • other reverse proxies
  • Kubernetes ingress patterns later

That portability matters because it keeps your certificate layer useful even if your ingress layer changes later.

How I Would Set This Up For Others

If I were implementing this for another homelab, a small platform team, or a multi-node Docker setup, I would do it in this order.

Step 1. Create shared directories

Create a shared location for ACME state and generated certificates.

An example layout:

/shared/certs/
  acme/
  live/
    apps.example.com/
      fullchain.pem
      privkey.pem
    internal.example.com/
      fullchain.pem
      privkey.pem

The acme/ directory stores account and renewal state. The live/ directory stores the certificates Traefik will read.

Step 2. Run a single ACME service

A small acme.sh container is enough. The important part is that it has:

  • DNS provider credentials
  • access to shared storage
  • a renewal schedule

Conceptually, it looks like this:

services:
  acme:
    image: neilpang/acme.sh:latest
    environment:
      AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID}
      AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY}
      AWS_DEFAULT_REGION: us-east-1
    volumes:
      - /shared/certs/acme:/acme.sh
      - /shared/certs/live:/certs
    command: daemon

Then issue certificates with DNS validation and install them into predictable file paths:

acme.sh --issue \
  --dns dns_aws \
  -d apps.example.com \
  -d '*.apps.example.com'

acme.sh --install-cert -d apps.example.com \
  --fullchain-file /certs/apps.example.com/fullchain.pem \
  --key-file /certs/apps.example.com/privkey.pem

Repeat that pattern for each domain group you want to isolate.

Step 3. Mount the shared certificate path into Traefik

Each Traefik instance should mount the same shared certificate directory read-only.

services:
  traefik:
    image: traefik:v3.3
    volumes:
      - /shared/certs/live:/certs:ro
      - /shared/traefik/tls.yml:/etc/traefik/dynamic/tls.yml:ro

Step 4. Point Traefik at the certificates with the file provider

Your dynamic TLS config can stay straightforward:

tls:
  certificates:
    - certFile: /certs/apps.example.com/fullchain.pem
      keyFile: /certs/apps.example.com/privkey.pem
    - certFile: /certs/internal.example.com/fullchain.pem
      keyFile: /certs/internal.example.com/privkey.pem

Now Traefik just loads certs and serves them. It no longer needs to be the ACME client.

Step 5. Add operational checks

Do not stop at issuance. Add visibility around renewal and expiry.

At minimum I would check:

  • ACME container logs
  • DNS validation success
  • presence of expected PEM files
  • certificate expiration windows
  • whether Traefik is actually serving the expected certificate

A simple smoke test is still valuable:

openssl s_client -connect app.example.com:443 -servername app.example.com

How I Would Explain The Benefits To Someone Else

This setup is worth it when you want HTTPS to remain boring as the environment grows.

The benefits are practical:

  • Traefik becomes stateless with respect to certificate issuance.
  • Certificate renewals happen in one place.
  • Wildcard certs reduce operational clutter.
  • Shared storage makes certificates available everywhere they are needed.
  • Failures are easier to locate by layer.

It also gives you a better upgrade path. You can replace the ingress layer later without rebuilding your entire certificate strategy.

Where I Would Be Careful

This pattern is strong, but it still needs discipline.

I would pay attention to:

  • protecting DNS provider credentials
  • backing up ACME account state
  • ensuring shared storage is reliable
  • restarting or reloading dependent services after renewal if needed
  • avoiding one oversized wildcard certificate when separate domain groups would be safer

The goal is not centralization for its own sake. The goal is to centralize the part that should be singular, while keeping the edge layer simple and replaceable.

The Main Takeaway

The most important thing I learned is that certificate automation becomes much easier when it is treated as infrastructure, not as a convenience feature tucked inside a proxy.

If you separate ACME control, shared certificate state, and ingress routing, you get a system that is more production-aligned, more observable, and easier to hand to other people with confidence.