Etsy, an online marketplace for unique, handmade, and vintage items, has
seen high growth over the last five years. Then the pandemic dramatically
changed shoppers’ habits, leading to more consumers shopping online. As a
result, the Etsy marketplace grew from 45.7 million buyers at the end of
2019 to 90.1 million buyers (97%) at the end of 2021 and from 2.5 to 5.3
million (112%) sellers in the same period.
The growth massively increased demand on the technical platform, scaling
traffic almost 3X overnight. And Etsy had signifcantly more customers for
whom it needed to continue delivering great experiences. To keep up with
that demand, they had to scale up infrastructure, product delivery, and
talent drastically. While the growth challenged teams, the business was never
bottlenecked. Etsy’s teams were able to deliver new and improved
functionality, and the marketplace continued to provide a excellent customer
experience. This article and the next form the story of Etsy’s scaling strategy.
Etsy’s foundational scaling work had started long before the pandemic. In
2017, Mike Fisher joined as CTO. Josh Silverman had recently joined as Etsy’s
CEO, and was establishing institutional discipline to usher in a period of
growth. Mike has a background in scaling high-growth companies, and along
with Martin Abbott wrote several books on the topic, including The Art of Scalability
and Scalability Rules.
Etsy relied on physical hardware in two data centers, presenting several
scaling challenges. With their expected growth, it was apparent that the
costs would ramp up quickly. It affected product teams’ agility as they had
to plan far in advance for capacity. In addition, the data centers were
based in one state, which represented an availability risk. It was clear
they needed to move onto the cloud quickly. After an assessment, Mike and
his team chose the Google Cloud Platform (GCP) as the cloud partner and
started to plan a program to move their
many systems onto the cloud.
While the cloud migration was happening, Etsy was growing its business and
its team. Mike identified the product delivery process as being another
potential scaling bottleneck. The autonomy afforded to product teams had
caused an issue: each team was delivering in different ways. Joining a team
meant learning a new set of practices, which was problematic as Etsy was
hiring many new people. In addition, they had noticed several product
initiatives that did not pay off as expected. These indicators led leadership
to re-evaluate the effectiveness of their product planning and delivery
processes.
Strategic Principles
Mike Fisher (CTO) and Keyur Govande (Chief Architect) created the
initial cloud migration strategy with these principles:
Minimum viable product – A typical anti-pattern Etsy wanted to avoid
was rebuilding too much and prolonging the migration. Instead, they used
the lean concept of an MVP to validate as quickly and cheaply as possible
that Etsy’s systems would work in the cloud, and removed the dependency on
the data center.
Local decision making – Each team can make its own decisions for what
it owns, with oversight from a program team. Etsy’s platform was split
into a number of capabilities, such as compute, observability and ML
infra, along with domain-oriented application stacks such as search, bid
engine, and notifications. Each team did proof of concepts to develop a
migration plan. The main marketplace application is a famously large
monolith, so it required creating a cross-team initiative to focus on it.
No changes to the developer experience – Etsy views a high-quality
developer experience as core to productivity and employee happiness. It
was important that the cloud-based systems continued to provide
capabilities that developers relied upon, such as fast feedback and
sophisticated observability.
There also was a deadline associated with existing contracts for the
data center that they were very keen to hit.
Using a partner
To accelerate their cloud migration, Etsy wanted to bring on outside
expertise to help in the adoption of new tooling and technology, such as
Terraform, Kubernetes, and Prometheus. Unlike a lot of Thoughtworks’
typical clients, Etsy didn’t have a burning platform driving their
fundamental need for the engagement. They are a digital native company
and had been using a thoroughly modern approach to software development.
Even without a single problem to focus on though, Etsy knew there was
room for improvement. So the engagement approach was to embed across the
platform organization. Thoughtworks infrastructure engineers and
technical product managers joined search infrastructure, continuous
deployment services, compute, observability and machine learning
infrastructure teams.
An incremental federated approach
The initial “lift &
shift” to the cloud for the marketplace monolith was the most difficult.
The team wanted to keep the monolith intact with minimal changes.
However, it used a LAMP stack and so would be difficult to re-platform.
They did a number of dry runs testing performance and capacity. Though
the first cut-over was unsuccessful, they were able to quickly roll
back. In typical Etsy style, the failure was celebrated and used as a
learning opportunity. It was eventually completed in 9 months, less time
than the full year originally planned. After the initial migration, the
monolith was then tweaked and tuned to situate better in the cloud,
adding features like autoscaling and auto-fixing bad nodes.
Meanwhile, other stacks were also being migrated. While each team
created its own journey, the teams were not completely on their own.
Etsy used a cross-team architecture advisory group to share broader
context, and to help pattern match across the company. For example, the
search stack moved onto GKE as part of the cloud, which took longer than
the lift and shift operation for the monolith. Another example is the
data lake migration. Etsy had an on-prem Vertica cluster, which they
moved to Big Query, changing everything about it in the process.
Not surprising to Etsy, after the cloud migration the optimization
for the cloud didn’t stop. Each team continued to look for opportunities
to utilize the cloud to its full extent. With the help of the
architecture advisory group, they looked at things such as: how to
reduce the amount of custom code by moving to industry-standard tools,
how to improve cost efficiency and how to improve feedback loops.
Figure 1: Federated
cloud migration