Docker has become the standard for packaging and deploying applications, but there's a significant gap between "it works on my machine" Docker usage and production-ready containerisation. After running hundreds of containers in production across multiple e-commerce platforms, here are the practices that have made the biggest difference.
Multi-Stage Builds
If you're still using single-stage Dockerfiles, you're shipping unnecessary build tools, source code, and dependencies to production. Multi-stage builds are the single most impactful optimisation you can make.
Here's a real-world example from one of our Node.js services:
# Stage 1: Build
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production && \
cp -R node_modules prod_modules && \
npm ci
COPY . .
RUN npm run build
# Stage 2: Production
FROM node:20-alpine
WORKDIR /app
RUN addgroup -g 1001 appgroup && \
adduser -u 1001 -G appgroup -s /bin/sh -D appuser
COPY /app/prod_modules ./node_modules
COPY /app/dist ./dist
COPY /app/package.json ./
USER appuser
EXPOSE 3000
HEALTHCHECK \
CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1
CMD ["node", "dist/server.js"]
This pattern reduced our image size from 1.2GB to 180MB -- a 85% reduction. Smaller images mean faster pulls, faster scaling, and a smaller attack surface.
Security Hardening
Running containers as root is one of the most common security mistakes. Here's a checklist we follow for every production image:
1. Never Run as Root
Always create a dedicated user and switch to it with the USER directive. If your app needs to bind to port 80, use a reverse proxy or setcap instead of running as root.
2. Use Minimal Base Images
alpinevariants are great for most workloads (5MB base)distrolessimages from Google for maximum security (no shell, no package manager)- Avoid
:latesttag -- always pin to a specific version
3. Scan for Vulnerabilities
We run trivy in our CI pipeline on every build:
security-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Build image
run: docker build -t myapp:$ .
- name: Run Trivy scan
uses: aquasecurity/trivy-action@master
with:
image-ref: myapp:$
format: 'sarif'
exit-code: '1'
severity: 'CRITICAL,HIGH'
A container vulnerability scanner in CI is not optional for production workloads. It's the bare minimum. The earlier you catch vulnerabilities, the cheaper they are to fix.
Layer Optimisation
Docker image layers are cached, so the order of your Dockerfile instructions matters significantly for build performance:
- Base image and system packages -- changes rarely, cache forever
- Dependency files (
package.json,requirements.txt) -- changes occasionally - Install dependencies -- only re-runs when dependency files change
- Application source code -- changes frequently, should be last
The key insight: put things that change least frequently at the top of your Dockerfile. Every instruction after a cache-busting change has to re-run.
Health Checks
Always include a HEALTHCHECK in your Dockerfile. Without one, your orchestrator (ECS, Kubernetes, Docker Swarm) has no way to know if your application is actually working -- only that the process is running.
HEALTHCHECK \
CMD curl -f http://localhost:3000/health || exit 1
The --start-period flag is often overlooked but crucial -- it gives your application time to start up before health checks begin. Without it, slow-starting applications will be killed before they're ready.
Resource Limits
Always set memory and CPU limits in your container orchestration config. A runaway process shouldn't be able to take down the entire host:
# docker-compose.yml
services:
api:
image: myapp:latest
deploy:
resources:
limits:
memory: 512M
cpus: '0.5'
reservations:
memory: 256M
cpus: '0.25'
Summary
Production Docker isn't complicated, but it requires intentionality. The practices that give you the most return:
- Multi-stage builds for smaller, cleaner images
- Non-root users and minimal base images for security
- Layer ordering optimisation for faster builds
- Health checks for reliable orchestration
- Resource limits to prevent noisy neighbors
- Vulnerability scanning in CI as a gate
Next up: how we use Terraform to manage our AWS infrastructure across multiple environments with a single codebase.
Comments