GitLab Transcend is back, June 10, streaming live from London. (Sponsored)On June 10, GitLab Transcend streams live from London with an agenda built for practitioners like you. You can expect an agenda that’s full of keyboard moments with live demos of Duo Agent Platform, agentic AI use cases from your peers, and The Developer Show hosted live by Senior Developer Advocate, Colleen Lake. Register today. GitLab Transcend streams live from London on June 10 (with regional replays for APAC and AMER on June 11). Register for free today. In November 2023, Vercel quietly shipped an internal platform that cut its build provisioning time from 90 seconds to 5. That sounds like a story about making things faster. It is, but only on the surface. The real story is that Vercel got faster by accepting a harder constraint, building a more complicated foundation, and then layering three separate optimizations on top of it. The 18x improvement is the result. Vercel is a deployment platform for web applications. When a developer pushes code to a connected repository, Vercel pulls that code, runs the build process (compiling, bundling assets, packaging the output) on its own servers, and then deploys the result to a global edge network of geographically distributed servers that deliver the site to end users. The build step happens on Vercel’s infrastructure, which means thousands of customers run their build scripts on machines that Vercel manages. Every push has to feel instant to the developer, has to run safely on shared hardware, and has to scale through traffic spikes without degrading. The platform that handles all of this is internally codenamed Hive, and it has been powering Vercel’s builds since late 2023. Hive is the reason behind the 90-to-5 transformation. In this article, we examine the constraints Vercel faced, the choices they made in response, and the optimizations that produced the speedup. Disclaimer: This post is based on publicly shared details from the Vercel Engineering Team. Please comment if you notice any inaccuracies. The Trust ProblemThe architecture rests on a single foundational assumption. Hive operates as if every piece of code it executes might be malicious, running on machines shared by many tenants at once. That assumption influences everything that follows. It matters because the trust calculation flips entirely between two situations. When a team runs its own code on its own server, the goal is performance and convenience. The code trusts the machine, and the machine trusts itself. When the code comes from someone else and runs on shared hardware, the calculation changes. The platform has to assume the code might try to break out of its sandbox, read another customer’s secrets, or interfere with builds running on the same machine. This is hostile multi-tenancy, and it is a different infrastructure problem from running cooperative workloads. Vercel sits squarely in this harder category. Every customer push is, from Vercel’s perspective, code written by someone the team has never met, running on a machine that is also running other customers’ code at the same time. The build script could be a normal Next.js compilation, or it could be a deliberately crafted exploit designed to escape the sandbox. Vercel has to handle both cases identically, since the platform cannot tell the difference in advance. The obvious answer is to run each build inside a Docker container. Containers are how modern infrastructure runs isolated workloads, and most engineers reach for them by reflex. The problem is that containers were designed primarily as a packaging tool, with isolation as a useful side effect. Multiple containers on the same machine all share the same Linux kernel, which is the part of the operating system with direct access to the hardware. Anything that breaks through the kernel can reach other parts of the machine. For most workloads, this risk is acceptable, since most workloads are cooperative. A team’s own microservices have no incentive to attack each other. However, for running strangers’ build scripts at scale, the risk profile is different. A single kernel exploit in one customer’s build could reach every other customer’s build on the same machine, and the blast radius would be enormous. This is why standard container orchestration was a poor fit. Tools like Kubernetes assume cooperative tenants and provide good isolation by default, but not adversarial isolation. Adding hardening on top of Kubernetes was an option, but for a constraint as foundational as tenant isolation, building from primitives gave Vercel more leverage. Containers leave a gap that Vercel could not afford to leave open. The question was how to close that gap without giving up the speed that containers provide. See the diagram below that offers some insight: MicroVMs and FirecrackerThe traditional alternative to containers is the virtual machine. A virtual machine runs a complete operating system on top of a virtualization layer, which means two VMs on the same physical machine each have their own kernel. A kernel exploit in one VM cannot reach the other, since the kernels are genuinely separate. The downside is weight. A traditional VM might take 30 to 60 seconds to boot and consume hundreds of megabytes of memory just to exist. For a workload like web hosting, where a single VM runs for months, that overhead is fine. For a workload like running a 2-minute build and then throwing the environment away, it becomes wasteful. Around 2018, AWS released Firecracker, an open-source virtualization tool that strips a VM down to the minimum needed to run one short-lived workload. Firecracker microVMs boot in around 125 milliseconds and use only a few megabytes of memory each. They provide VM-level isolation, with separate kernels and a hardware-enforced boundary that the CPU itself maintains, at something close to container-level speed. This is a new shape in the isolation tradeoff space, occupying a corner that did not exist before. AWS originally built Firecracker to power Lambda, where it now runs at production scale across millions of concurrent functions. That track record gave Vercel a battle-tested foundation rather than an experimental one. Vercel adopted Firecracker as the core of Hive. Each customer build runs in a microVM that Vercel calls a cell, and the relationship between cells and Firecracker processes is strictly one-to-one. Each Firecracker process manages exactly one cell, and each cell handles exactly one build. Inside the cell sits a container that runs the actual build script. The container handles packaging, since it carries all the build tools and dependencies the customer’s project needs. The microVM handles isolation, since it provides the kernel-level boundary that containers alone cannot. Each layer does what it is good at. This setup is the architectural answer to the trust problem. Vercel can now run a strange piece of code with confidence that, even if the code attempts something hostile, it cannot reach beyond the cell it is running in. The microVM is the wall, and the wall is enforced by the CPU’s virtualization features rather than by software alone. Firecracker provides the isolation primitive, while the rest of Hive is the machinery that turns one isolated cell into a system capable of running thousands of builds across the world. Inside Hive |