Language:English VersionChinese Version

The Certificate That Expires at 3 AM: A Practical Guide to SSL/TLS Automation

SSL/TLS certificate expiry is one of the most embarrassing and most preventable production outages. LinkedIn, Spotify, Microsoft Teams, and countless smaller organizations have all experienced certificate-related outages — not because the certificate management problem is technically hard, but because manual processes eventually fail. This guide covers the full certificate lifecycle: how ACME and Let’s Encrypt work under the hood, how to automate issuance and renewal, how to manage certificates across multiple servers, and how to set up monitoring that catches problems before users do.

How ACME Works: The Protocol Behind Let’s Encrypt

Let’s Encrypt issues free, publicly-trusted TLS certificates using the ACME protocol (Automatic Certificate Management Environment, RFC 8555). Understanding ACME’s mechanics helps you debug failures and choose the right challenge type for your infrastructure.

The ACME flow has four steps:

  1. Account registration: Your ACME client generates a keypair and registers with the ACME server (Let’s Encrypt’s Boulder CA). The public key identifies your account.
  2. Order creation: You request a certificate for one or more domain names. The ACME server creates an order and returns a set of authorization challenges — proofs you must complete to demonstrate control of each domain.
  3. Challenge completion: You complete one of the offered challenge types (HTTP-01, DNS-01, or TLS-ALPN-01). The ACME server verifies your completion and marks the authorization as valid.
  4. Certificate issuance: You submit a Certificate Signing Request (CSR) containing your domain names and public key. The CA signs it and returns the certificate chain.

Challenge Types and When to Use Each

HTTP-01: The ACME server expects to find a specific file at http://yourdomain.com/.well-known/acme-challenge/{token}. Your server must be publicly reachable on port 80. This is the simplest challenge for web servers, but it fails for internal services, wildcard certificates, and servers behind strict firewalls.

DNS-01: You create a TXT record at _acme-challenge.yourdomain.com with a specific value. The ACME server verifies it via DNS lookup. This challenge works for wildcard certificates and internal services. The tradeoff is that it requires API access to your DNS provider — a meaningful security consideration since DNS credentials carry significant blast radius.

TLS-ALPN-01: Less commonly used, this challenge works entirely over TLS on port 443. Useful when you cannot modify DNS records and cannot serve HTTP traffic, but you can accept TLS connections.

Certbot: The Reference Implementation

Certbot is the EFF’s ACME client and the most widely documented option. For a standalone Nginx or Apache server, it handles the full certificate lifecycle.

# Install certbot with Nginx plugin on Ubuntu/Debian
apt install certbot python3-certbot-nginx

# Obtain and install certificate (Nginx plugin modifies nginx.conf automatically)
certbot --nginx -d example.com -d www.example.com

# Dry run to test renewal without actually renewing
certbot renew --dry-run

# Certbot installs a systemd timer for automatic renewal
systemctl status certbot.timer
# certbot.timer runs twice daily and renews certs expiring within 30 days

# Manual renewal with pre/post hooks (e.g., to reload services)
certbot renew \
  --pre-hook "systemctl stop nginx" \
  --post-hook "systemctl start nginx" \
  --deploy-hook "systemctl reload nginx"

For DNS-01 challenges with Certbot, you need a DNS plugin matching your provider. Certbot maintains plugins for most major DNS providers:

# Wildcard certificate via DNS-01 with Cloudflare
pip install certbot-dns-cloudflare

# Create credentials file (restrict permissions carefully)
cat > /etc/letsencrypt/cloudflare.ini << 'EOF'
dns_cloudflare_api_token = your_api_token_here
EOF
chmod 600 /etc/letsencrypt/cloudflare.ini

# Obtain wildcard certificate
certbot certonly \
  --dns-cloudflare \
  --dns-cloudflare-credentials /etc/letsencrypt/cloudflare.ini \
  -d "*.example.com" \
  -d example.com \
  --agree-tos \
  --email admin@example.com

Caddy: Automatic HTTPS Without Configuration

Caddy takes a different approach: HTTPS is automatic by default. Every domain in your Caddyfile gets a certificate from Let's Encrypt (or ZeroSSL) without any explicit certificate configuration. Caddy handles issuance, storage, renewal, and reload automatically.

# Caddyfile — HTTPS is automatic for all these sites
example.com {
    reverse_proxy localhost:8080
}

api.example.com {
    reverse_proxy localhost:3000
    
    # Rate limiting via Caddy plugin
    rate_limit {
        zone dynamic {
            key {remote_host}
            events 100
            window 1m
        }
    }
}

# Internal service with self-signed cert (for non-public domains)
internal.example.com {
    tls internal
    reverse_proxy localhost:9090
}

Caddy stores certificates in /var/lib/caddy/.local/share/caddy by default and renews them automatically when they are within 30 days of expiry. For multi-server deployments, Caddy supports distributed certificate storage via Redis or a shared filesystem, preventing each server from independently requesting certificates for the same domains.

cert-manager: Certificate Automation in Kubernetes

cert-manager is the de facto standard for TLS certificate management in Kubernetes. It introduces Issuer/ClusterIssuer resources to represent certificate authorities and Certificate resources to request specific certificates.

# Install cert-manager via Helm
helm repo add jetstack https://charts.jetstack.io
helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --set crds.enabled=true

# ClusterIssuer using Let's Encrypt production with DNS-01 challenge (Cloudflare)
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: admin@example.com
    privateKeySecretRef:
      name: letsencrypt-prod-account-key
    solvers:
    - dns01:
        cloudflare:
          apiTokenSecretRef:
            name: cloudflare-api-token
            key: api-token
      # Apply this solver only to example.com and subdomains
      selector:
        dnsZones:
        - "example.com"

---
# Request a wildcard certificate
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: wildcard-example-com
  namespace: production
spec:
  secretName: wildcard-example-com-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
  - "*.example.com"
  - "example.com"
  # Renew 30 days before expiry
  renewBefore: 720h

cert-manager also integrates with Ingress and Gateway API resources. Add the annotation and cert-manager handles the rest:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api-ingress
  namespace: production
  annotations:
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
  tls:
  - hosts:
    - api.example.com
    secretName: api-example-com-tls
  rules:
  - host: api.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: api-service
            port:
              number: 80

Certificate Monitoring: Catching Expiry Before Users Do

Automated renewal handles the happy path. Monitoring handles everything else: renewal failures, certificates outside your automation that someone added manually, certificates on third-party services you do not control.

# Shell script: check certificate expiry for a list of domains
#!/bin/bash
DOMAINS=(
  "example.com"
  "api.example.com"
  "admin.example.com"
)
WARNING_DAYS=30
CRITICAL_DAYS=7

for domain in "${DOMAINS[@]}"; do
  expiry=$(echo | openssl s_client -servername "$domain" \
    -connect "$domain:443" 2>/dev/null | \
    openssl x509 -noout -enddate 2>/dev/null | \
    cut -d= -f2)
  
  if [ -z "$expiry" ]; then
    echo "CRITICAL: Cannot connect to $domain"
    continue
  fi
  
  expiry_epoch=$(date -d "$expiry" +%s 2>/dev/null || \
                 date -jf "%b %d %T %Y %Z" "$expiry" +%s)
  now_epoch=$(date +%s)
  days_remaining=$(( (expiry_epoch - now_epoch) / 86400 ))
  
  if [ "$days_remaining" -lt "$CRITICAL_DAYS" ]; then
    echo "CRITICAL: $domain expires in $days_remaining days ($expiry)"
  elif [ "$days_remaining" -lt "$WARNING_DAYS" ]; then
    echo "WARNING: $domain expires in $days_remaining days ($expiry)"
  else
    echo "OK: $domain expires in $days_remaining days"
  fi
done

For production monitoring, use dedicated tools rather than cron-based scripts. Checkly and UptimeRobot both offer certificate expiry monitoring with Slack/PagerDuty integration. Prometheus with the blackbox_exporter can monitor certificate expiry as a metric:

# prometheus/blackbox.yml — TLS probe configuration
modules:
  https_cert_check:
    prober: http
    timeout: 10s
    http:
      valid_status_codes: [200]
      tls_config:
        insecure_skip_verify: false
      preferred_ip_protocol: ip4

# Grafana alert rule: fire when certificate expires within 14 days
# Metric: probe_ssl_earliest_cert_expiry - time()
# Condition: < 14 * 24 * 60 * 60 (seconds)

Multi-Server Certificate Distribution

When you run multiple web servers behind a load balancer, you need a strategy for distributing certificates. Three approaches are common:

Terminate TLS at the load balancer: The cleanest approach. AWS ACM, Cloudflare, or a dedicated load balancer handles the certificate. Your backend servers receive plain HTTP. Certificates are managed in one place. The downside is that traffic between the load balancer and backends is unencrypted — acceptable for VPC-internal traffic, problematic for compliance-sensitive environments.

Shared filesystem mount: Let Certbot run on one node, store certificates on a shared NFS/EFS mount, configure all web servers to read from that path. Simple but creates a single point of failure in the NFS mount.

cert-manager with Kubernetes Secrets replication: If your servers run in Kubernetes, cert-manager writes certificates to Secrets and the external-secrets-operator can replicate them across namespaces or clusters.

The Operational Checklist

  • Inventory every certificate your organization uses, including those on third-party services, internal services, and client certificates
  • Set up monitoring with alerts at 30 days, 14 days, and 7 days before expiry — three separate alert thresholds, escalating severity
  • Test renewal in staging before relying on automation in production — run certbot renew --dry-run or cert-manager's test issuer
  • Store ACME account private keys in a secrets manager (Vault, AWS Secrets Manager), not on the filesystem
  • Document the manual renewal procedure for every automated process — automation fails, and someone needs to know what to do at 2 AM
  • Use certificate transparency log monitoring (crt.sh or Facebook's CT monitor) to detect unauthorized certificates issued for your domains

Key Takeaways

  • ACME's DNS-01 challenge is required for wildcard certificates and services not reachable on port 80. Use it with an API-scoped DNS provider token, not your root account credentials.
  • Caddy automates HTTPS entirely, making it ideal for new deployments where simplicity outweighs customization needs.
  • cert-manager is the standard for Kubernetes certificate management. Integrate with Ingress annotations for the lowest-friction workflow.
  • Monitoring must cover certificates outside your automation. A certificate manually installed three years ago on a forgotten subdomain will not renew itself.
  • Always have a documented manual renewal procedure. Automation reduces the frequency of manual intervention, not the need to understand how.

By Michael Sun

Founder and Editor-in-Chief of NovVista. Software engineer with hands-on experience in cloud infrastructure, full-stack development, and DevOps. Writes about AI tools, developer workflows, server architecture, and the practical side of technology. Based in China.

Leave a Reply

Your email address will not be published. Required fields are marked *