After launching Platform One in July with 6 applications across 3 VMs, my centralized backup strategy immediately broke. Stackback couldn’t discover volumes across different Docker Compose contexts.

The Problem

My initial architecture:

  • Single Stackback (Restic wrapper) container
  • Two shared S3 buckets
  • Centralized backup configuration via Ansible

The catch: Stackback relies on Docker labels (stack-back.volumes=true) to auto-discover backup targets. When your backup container runs in a separate docker-compose.yml from your application stacks, it can’t see what it’s supposed to back up.

Real-world impact:

  • ❌ Inconsistent backup coverage (some volumes discovered, others missed)
  • ❌ No visibility into PostgreSQL backups
  • ❌ Single point of failure for credentials
  • ❌ Resource contention when all backups ran simultaneously

The Solution

Each application stack gets its own dedicated backup container.

Architecture Evolution

Before:

# centralized-stackback/docker-compose.yml
services:
  stackback:
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      RESTIC_REPOSITORY: s3:minio.internal/shared-bucket

After:

# mattermost/docker-compose.yml
services:
  mattermost:
    labels:
      stack-back.volumes: "true"
      stack-back.postgres: "true"
  
  stackback:
    image: ghcr.io/lawndoc/stack-back:latest
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      RESTIC_REPOSITORY: s3:minio.internal/mattermost-backup-bucket
      RESTIC_PASSWORD: ${RESTIC_PASSWORD_MATTERMOST}
      BACKUP_CRON: "0 2 * * *"

Key Design Decisions

1. Dedicated S3 Buckets Per Application

Using Terraform’s MinIO provider:

# terraform/modules/minio/main.tf
resource "minio_s3_bucket" "stackback_per_app" {
  for_each = var.applications
  bucket   = "restic-stackback-${each.key}-bucket"
  acl      = "private"
}

resource "minio_ilm_policy" "stackback_lifecycle" {
  for_each = minio_s3_bucket.stackback_per_app
  bucket   = each.value.bucket
  
  rule {
    id         = "delete-old-backups"
    expiration {
      days = 30
    }
  }
}

Why 30 days? GitLab backups alone consumed exponential storage. Automated lifecycle policies prevent the “set-and-forget-until-disk-full” trap.

2. IAM Credential Isolation

Each application receives unique S3 credentials:

resource "minio_iam_user" "stackback_per_app" {
  for_each = var.applications
  name     = "restic-${each.key}-user"
}

resource "minio_iam_policy" "stackback_per_app" {
  policy = jsonencode({
    Statement = [{
      Effect   = "Allow"
      Action   = ["s3:*"]
      Resource = [
        "arn:aws:s3:::restic-stackback-${each.key}-bucket/*"
      ]
    }]
  })
}

Security win: A compromised application can only access its own backup bucket.

3. Staggered Backup Schedules

Running all backups simultaneously caused I/O storms on NFS storage. Solution: offset schedules.

Application Schedule Resource Group
Mattermost 2:00 AM Green
N8N 2:20 AM Green
Vault 2:40 AM Green
Linkwarden 2:00 AM Blue
Solidtime 2:20 AM Blue

4. Vault Integration for Secrets

Backup credentials stored in HashiCorp Vault:

# ansible/roles/platform_one/templates/stackback.env.j2
RESTIC_REPOSITORY=s3:https://{{ minio_endpoint }}/{{ backup_bucket }}
RESTIC_PASSWORD={{ lookup('community.hashi_vault.hashi_vault_read', 
  'secret=ansible/data/stackback_{{ app_name }}').password }}
AWS_ACCESS_KEY_ID={{ lookup('community.hashi_vault.hashi_vault_read', 
  'secret=ansible/data/stackback_{{ app_name }}').access_key }}

Implementation with Ansible

Dynamic template generation per application:

# roles/platform_one/tasks/deploy_application.yml
- name: Generate stackback environment file
  template:
    src: stackback.env.j2
    dest: "{{ container_data }}/{{ app_name }}/stackback.env"
    mode: '0600'
  vars:
    backup_bucket: "restic-stackback-{{ vm_name }}-{{ app_name }}-bucket"
    backup_schedule: "{{ applications[app_name].backup_schedule | default('0 2 * * *') }}"
  when: applications[app_name].backup_enabled | default(false)

Docker Compose integration:

# templates/docker-compose.yml.j2
{% if app.backup_enabled | default(false) %}
  stackback:
    image: ghcr.io/mittbachweg/stack-back:2024.11.1
    container_name: {{ app_name }}_stackback
    env_file: ./stackback.env
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
    restart: unless-stopped
{% endif %}

Lessons Learned

What worked:

  • ✅ Application-level isolation caught backup failures early
  • ✅ Staggered schedules eliminated I/O contention
  • ✅ Lifecycle policies prevented storage exhaustion
  • ✅ Vault integration centralized credential management

What didn’t:

  • ❌ Initial 7-day retention was too short (extended to 30 days)
  • ❌ Forgot to monitor backup success (added Prometheus metrics)
  • ❌ Manual Restic repository initialization (automated via Ansible)

The modular approach trades simplicity for reliability. Worth it.