Modeling the Business: Improving ML Infrastructure at Stripe

Prioritizing with limited resources is at the heart of an engineering manager’s day-to-day work. This is especially true for internal tools teams, where your customers are other teams within the company who use your systems. Virtually any decision you make will disappoint someone, and you’ll have to answer a lot of questions about your impact and make hard choices. To make those tough decisions, you’ll need a strong model of how your work impacts the company’s core goals. In this post, I’ll walk through my experience building just such a model for Stripe’s ML Infrastructure team, and how it improved our users’ experience.

Motivation

When I joined Stripe’s ML Infrastructure group, everyone knew we wanted to make Stripe’s ML teams more productive, but opinions varied about what that meant. The infra team owned and maintained systems that were critical for the company’s core business, but difficult and painful for developers to use. Tradeoffs between maintenance and new feature development were difficult and arguments about priorities were frequent. As a result, it was difficult to make consistent progress over time, which was frustrating for our users and demoralizing for the team.

The first step towards improving this was to clarify our priorities so we could figure out what work actually had increased impact. Drawing on my time as an ML engineer (and the experience of the ML developers on our customer teams who helped me), I built a model inspired by a stocks-and-flows view of engineering development processes that Will Larson wrote about in An Elegant Puzzle.

The Model

In this model, the overall productivity of an ML engineering team is proportional to the number of improvements they can ship into production – that’s what improves the production models which help the business metrics. (As an example, one of our customer teams was working on fighting fraud, so improvements to their models directly helped safeguard Stripe’s customers and block fraud)

First, ML engineers complete experiments¹ at some experiment rate – some of them are successful and generate feature ideas we can use to improve models in production.
- The rate at which we get new feature ideas actually depends on multiple factors such as experiment speed and experiment success rate. These depend on different factors such as tooling quality and idea quality, respectively.
These feature ideas are built and shipped into production at some development rate. Only once they get into production do these have a net impact on the models and the business.
- We broke this down further along a number of axes related to model training, data preparation, coding, and deployment. Since development was so slow for us, it was worth getting into the weeds about exactly why.
Production issues also arise with some defect rate and are detected with some detection rate, causing outages that have to be fixed again. This adds to the queue of feature ideas again.
- For simplicity we can bundle outages-to-be-fixed in with feature ideas, as long as the process of fixing a bug looks a lot like deploying a new model improvement. (In practice, it usually did for us.)

How well are we doing?

Even though this was an overly-simplified model, it was a good starting point for prioritizing how we could help our customers. To better anchor it in reality, we took this model to our customer teams and worked to find metrics which would help us track and assess their experience.

A clear point of impact is helping experiments run faster. Even if we couldn’t improve the quality of ideas generated, we could improve the number of them ready to ship by just improving iteration speed. This also naturally led to good discussions about organizing developers’ work: what is a single experiment, and how do we track when it starts and stops? We used ideas from this discussion to build out operational metrics for the team, tracking things like models trained and notebooks created that could help us proxy for the number of ideas being explored by our users.

Getting feature ideas into production faster also jumped out as a great objective. Immediately we started asking questions like “what are the steps that get in the way once prototyping is complete?” This was even a bit easier to measure than experiment rate – the start and stop of a development effort is usually a bit more defined than exploration work, which can drift from topic to topic.

Reducing the defect rate in production sounds uncontroversial. No one ever wants to ship junk, and for products that fight fraud, mistakes are often painfully expensive! But there is often a tradeoff between how fast you push feature ideas into production and how often you have to rollback production changes: removing steps so that you can ship things faster can result in more outages.

This is painfully obvious when stated out loud. But it’s important to keep in mind when you are choosing projects to improve ship-to-production speed that there may be a tradeoff. Using both metrics together is a good example of a paired indicator, which can warn you if you are over optimizing for one goal at the expense of your overall productivity. (a great idea but hardly a novel one – Andy Grove wrote about this idea in the classic book High Output Management)

One last area we looked at was the detection rate – improving the speed with which we were able to identify model outages and regressions. While true outages were extremely rare for our system², there were a decent number of model regressions that took time for engineers to notice. More sophisticated monitoring – for example, tracking the data distribution of input features and model predictions to identify statistically significant shifts – could directly translate into faster fixes and better performance.

What next?

Once we had a rough set of metrics together, we were able to look back at the model and reason about what the greatest bottlenecks were. In our case, it was pretty clear that the time to ship feature ideas into production was by far the biggest blocker. Our existing production stack was extremely focused on minimizing production defects, but not optimized for fast deployments in the same way.

It was time to go back to our roadmap and make tough decisions. We cut back our plans to improve flexibility in a few areas, and instead launched an ambitious project to speed up model deployment by decoupling some precomputation steps. Since there were multiple areas we could improve in parallel, we also scoped out some projects to improve experiment speed via improved tooling and notebooks.

To staff everything at once, I worked with leadership to find more headcount for the team – our model made it possible to demonstrate the business impact we’d get from launching these projects faster. In my experience, it’s very difficult for an internal tools team to be able to demonstrate this kind of ROI, and that can make getting investment a challenge. If you find yourself in a similar situation, I highly recommend you build a model like this one and use it to talk through your team’s impact!

Notes

By experiments here I meant all forms of data exploration and analysis, e.g. “let’s see if adding previous_login_count is a useful feature”. I should have come up with a better term to avoid confusion with production experiments (i.e. A/B tests). ↩
Much to the credit of the great engineers at Stripe that I was fortunate enough to work with and learn from. ↩