Last month I found myself staring at a Docker image for our Go API that was 1.4 GB in size and was taking the GitLab pipeline eleven (11) minutes to build each time I made a push to the repository. No, that is not a typo, the actual image had everything from the full Go toolchain to half of apt and even some 200 MB worth of cache files that nobody even asked for. I’ve dealt with overly bloated Virtual Machines (VMs) in the past, but the amount of wasted space in CI from creating new containers felt like a whole new level of failure. So, I took the original single stage Dockerfile, totally ripped it out, and created a gitlab ci pipeline multi stage docker build from scratch. After spending a weekend troubleshooting the issues caused by broken runners and hanging jobs, I got down to a setup that allows us to ship out a 32 MB image in under three (3) minutes that is ready to run in production. This is what worked for me and what you should avoid.
Quick Summary
- What DinD actually requires from your runner config.
- How to build multi-stage targets within your
.gitlab-ci.ymland pass them along in the CI/CD. - Techniques for caching that survive ephemeral executors.
- A ritual for cleaning up orphaned containers that are consuming space on your hard drive.
- Straightforward answers to the auth, cache and Kaniko questions that you will get.
Understanding the multi-stage dockerfile ci integration
The Architectural Shift from Single to Multi-Stage Builds
With a single stage build, you will be dumping your compiler, debug symbols and any apt package you want, into one Final layer. You can squash the layer after it is created or manually remove the files after each build, but this is an unreliable approach. In the multi-stage approach, you perform the heavy lifting in your builder stage, and only copy the runtime artifacts you need from the builder stage to the slim production stage.The Continuous Integration (CI) pipeline is not concerned about the specifics inside the image – it only needs to build the Dockerfile for the correct testing target and tag it in the correct manner.
Many times I have seen teams use a single massive RUN command to attempt to fit everything inside of it, and pray that they can somehow get by with using docker-slim to clean up afterwards. This approach does not work. A proper multi-stage setup is not a cleanup hack, but rather, a design choice. Using multiple stages allows for controlling the total size of your image and having a small and limited attack surface for your image security.
How GitLab Runners Process Independent Container Stages
When you use the Docker executor to execute a job, the GitLab runner will create a new container from the specified image. Then when you execute all the commands to perform the build inside of that container, the build commands you execute (including docker build) will be executed against the Docker daemon provided by the DinD (Docker in Docker) service. Each time you execute docker build, it will read the entirety of the Dockerfile. However, if you use the --target parameter when executing the docker build, the result of that docker build will only contain the layers up until that target (this is how you can build the builder stage and then push nothing and then build your production stage using --cache-from and the layers created from the builder stage). The runner does not have any memory from one job to the next unless you explicitly push the caching images to the GitLab registry.
This is a critical aspect because many users assume that the local layer cache of the runner’s instance remains intact after a job finishes running. This is incorrect. Ephemeral runners are created from scratch for every job that executes. You will have to continue to treat the registry as your persistence layer.
Prerequisites for Docker-in-Docker Execution
Configuring the dind setup gitlab ci
If you set up your runner’s configuration without granting sufficient privileges to execute DinD jobs, the jobs executed via the DinD service will all fail with cryptic “cannot connect to the Docker daemon” errors. Therefore, you will need a runner that is using the docker executor and has been set up with privileged = true.The docker:20.10-dind image, or any newer version, requires the use of the --privileged argument to properly manage the storage and network resources of the Docker container.
concurrent = 4
check_interval = 0
[[runners]]
name = "docker-runner"
url = "https://gitlab.example.com/"
token = "REDACTED"
executor = "docker"
[runners.docker]
tls_verify = false
image = "docker:20.10-dind"
privileged = true <-- this is mandatory
disable_entrypoint_overwrite = false
oom_kill_disable = false
disable_cache = false
volumes = ["/certs/client", "/cache"]
shm_size = 0
In the following code area, you see a portion of the config.toml file which contains the [runners.docker] section. If privileged = true was not specified, then the inner Docker daemon would have no access to manage kernel features such as Storage Backends or Network Namespaces, thus resulting in silent failures of your pipeline.
Defining the gitlab-ci.yml docker executor Environment
Your .gitlab-ci.yml file should also define both the image you are going to use to run the docker CLI, and the DinD service. I am setting DOCKER_TLS_CERTDIR to “” (an empty string) to utilize the non-TLS socket option, since the added latency that can occur when using TLS inside of CI, does not yield any advantage.
variables:
DOCKER_HOST: tcp://docker:2375
DOCKER_TLS_CERTDIR: ""
DOCKER_DRIVER: overlay2
default:
image: docker:20.10
services:
- docker:20.10-dind
Here you see the basic setup of the run-time environment for our job. The job containers will communicate with the DinD service via TCP port 2375 (the default TCP port that Docker uses to communicate). By utilizing a non-TLS connection, I do not have to worry about any overhead resulting from the generation of TLS certificates and the potential for failures when performing the handshake.
Architecting the gitlab ci pipeline multi stage docker build
Drafting the Target Multi-Stage Dockerfile
I am creating a small Go binary file. To build it, in the build stage, I am using golang:1.20-alpine, and for the final stage, I am using scratch which will contain just the binary file and one CA certificate file.
# Stage 1: builder
FROM golang:1.20-alpine AS builder
RUN apk add --no-cache git ca-certificates
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o /app .
# Stage 2: production
FROM scratch
COPY --from=builder /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/
COPY --from=builder /app /app
EXPOSE 8080
ENTRYPOINT ["/app"]
The end result of this approach will be a final image of only a few megabytes, and you never deploy any compiler files into your production environment. My CI job will call docker build --target production.
Why I ultimately chose this route
Before I settled on the approach described above, I tried using the host Docker socket and docker:sock bind-mounted tricks. This did not yield favourable results; the intermediate images my pipeline depended on for caching were frequently and randomly pruned by the host’s Docker daemon.To get access to the Docker daemon socket at /var/run/docker.sock, CI users must join the docker group, which is not ideal. However, using a dedicated Docker daemon in a “Docker-in-Docker” (DinD) scenario allows each build to run in an isolated environment. This isolation is what makes me feel comfortable using DinD for my CI builds, though it does consume more disk space than a single shared daemon. The added disk space usage of using DinD should be offset by the caching strategy.
Handling gitlab container registry authentication
Before you can upload your cache images or final production images to the GitLab container registry, you must authenticate against it. The GitLab registry authentication docs describes the needed environment variables. You’ll need to authenticate before every push; therefore, the easiest way to authenticate is in your before_script.
before_script:
- echo "$CI_REGISTRY_PASSWORD" | docker login $CI_REGISTRY -u $CI_REGISTRY_USER --password-stdin
The one-liner in the above example uses the current job’s CI_REGISTRY_PASSWORD variable for authentication. Using this method ensures that no secrets are exposed to the shell script’s command history. The job will automatically retrieve the appropriate value for the CI_REGISTRY variable, whether it be registry.gitlab.com or your instance of GitLab.
Executing the Build and Push Stages
The build job does the following steps:
- Pull the previously created builder cache image
- Build the production image
- Push the cache and production images
build_production:
stage: build
script:
- docker pull $CI_REGISTRY_IMAGE/cache:builder-latest || true
- >
docker build
--cache-from $CI_REGISTRY_IMAGE/cache:builder-latest
--target production
-t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
-t $CI_REGISTRY_IMAGE:latest
.
- docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
- docker push $CI_REGISTRY_IMAGE:latest
after_script:
- docker push $CI_REGISTRY_IMAGE/cache:builder-latest
By pushing the most recent builder cache image after the two most recent builds, you can take advantage of it as a cache source for the next build. To prevent the job from failing due to an empty cache, we add || true to the end of the pull command.
Implementing a Robust ci/cd docker caching strategy
Pulling Previous Images as Cache Sources
The first time I built these images, I experienced long build times (6 minutes) because each layer of every build must be downloaded from the internet for the first run. In the Docker build cache docs, the --cache-from option is mentioned, but in order to take advantage of this feature, you’ll need to have pushed a builder cache image first.I use a dedicated tag for caching the builder’s layer. The pipeline initially pulls a cache called cache:builder-latest. This means that the local cache will now have Go modules that were previously downloaded during the build time available to it again without needing to download them again.
The command in the previous section does this. Important to note, while the COPY . . layer usually misses caches, the expensive go mod download layer pulls from the cache image.
Utilizing Inline Build Caching via BuildKit
BuildKit provides inline cache metadata so that the Docker daemon knows, without needing to look at intermediate layers, which layer matches what. While it is not a cure-all for everything, it does provide some assistance in cases where file timestamps within a monorepo can shift a bit.
export DOCKER_BUILDKIT=1
docker build --cache-from $CI_REGISTRY_IMAGE/cache:builder-latest --build-arg BUILDKIT_INLINE_CACHE=1 -t $CI_REGISTRY_IMAGE:latest .
The BUILDKIT_INLINE_CACHE=1 argument embeds cache metadata into the final image created, so that way cache can be pulled from future builds more accurately. Be certain to set DOCKER_BUILDKIT=1 for the job’s environment variable; some of the runner’s default builders are still using the legacy builder.
Strategies to optimize docker image size gitlab
Outside of utilizing the multi-stage separation of builds, I also further streamlined the builder by immediately removing any cached packages from the build stage and using the flag –no-cache when installing packages from apk. I also prune dev dependencies from Nodejs projects after the build stage. During cleanup, I routinely run docker build –squash (experimental) and then immediately afterwards run a docker image prune -f.The major advantage is to select the right base for the images you are using, not having a nice flag; there is a considerable amount of file support in each distro. I believe if you are adding a complete Ubuntu rootfs/image merely to execute a binary compiled file, then it’s time to stop doing that.
Resolving Silent Failures with Orphaned Build Containers
Identifying Stalled Jobs in the Docker Daemon
I had let a job run for an entire day, but it was stuck on step 4/12 : RUN go build … — no indication of failure shown, just silence. The Daemon for Docker in Docker (DinD) was holding open a dead Build Container which never released a job lock. Here is the log that was being produced for the job:
$ docker build --target production ...
Sending build context to Docker daemon 45.06kB
Step 1/12 : FROM golang:1.20-alpine AS builder
---> 9e7c1c8f2a3b
Step 2/12 : RUN apk add --no-cache git ca-certificates
---> Using cache
---> a5f3d76a14b5
...
Step 4/12 : RUN go mod download
---> Running in 12ab34cd56ef
# ← Hangs here forever
The arrow above indicates where the job log stopped producing output. The job completed without a timeout or an OOM kill, the reason for the hang was due to having an orphaned container running in the DinD Daemon. The logs of the DinD daemon itself (docker logs) reported an active container with ID “12ab34…”
Injecting Manual Cleanup Routines into the Pipeline
To support cleaning up a build environment created with DinD, I have included an after_script in the pipeline to aggressively clean whatever has happened. After every job, I execute a complete system prune deleting all stopped Containers; unused Networks and Unreferenced Images.
after_script:
- docker system prune -af --volumes
The use of aggressive cleanup is appropriate in an ephemeral runner environment, as they do not have the ability to retain any resources. Your setup may differ when using Persistent runners that keep a local Cache. In that case, you will want to use the --filter "until=1h" but I’ve often found that orphaned containers frequently slip through the time based filter process. The use of the complete system prune ensures the runner does not become at full disk capacity, thereby stalling future jobs.
Frequently Asked Questions
Why is my GitLab CI pipeline ignoring my Dockerfile stage cache?
In your GitLab CI Pipeline, if you are not using the --cache-from option pointing to the same registry staged image, then your Runner will not have anything to Cache Locally; hence each job will start from an empty layer store. Additionally, the Cache will be invalidated by any Change to a Byte in the Copy Layer i.e. a Change to a Line or comment in your go.mod file, which will invalidate Cache. Ensure to push your separate cache tag for your Builder Stage after every build, then Pull the new Cache First before performing the next build.
How do I pass CI variables securely into a multi-stage Docker build?
You should never Include Secrets in your Dockerfile; Always use docker build --build-arg SECRET_VALUE and then Reference The Variable in The Appropriate Stage. To pass GitLab CI Variables, you would use the Command --build-arg DB_PASS=$DB_PASS, hence directly. As an Improved Method of Handling Sensitive Information, the BuildKit correctly uses BuildKit’s --secret Flag for Sensitive Files to Temporarily Mount, Then Discard. The BuildKit secrets documentation outlines the syntax.
Can I use Kaniko instead of DinD for unprivileged rootless builds?
Yes, it is Permitted. Using Kaniko will run entirely from User Space, without requiring a Privileged Daemon. Thus, it will be safer to use on Shared Clusters. However, Using Kaniko will result in the loss of Daemon Side Caching Provided By DinD At The Native Level. With Kaniko, Layer Caching is performed by pushing Intermediate Layers To A Registry And Retrieving Them Back, therefore For Smaller Teams, The Simplicity of DinD Along With BuildKit Will Often Offset The Relative Overhead of Kaniko. If Your Security Policy Prohibits The Use of Privileged Containers, You Should Use Kaniko And Accept The Extended Rebuild Times Unless The Remote Cache Is Carefully Configured.