Introducing MLOps — Why we need it, and how to apply it in your company (2/3)
This is part II of an archive of my tech talk `Introducing MLOps — Why we need it, and how to apply it in your company` at Code Chrysalis in September 2021.
In the previous episode, we defined what MLOps is. Now we will discuss why we need it. Sure, efficiency is good and all, but what’s in it for our dev teams? Isn’t simply serving a model on a VM enough? If it ain’t broken, why fix it?
Table of Contents
- ML Productionisation and MLOps
- Why do we need MLOps? (this article)
- How do you apply MLOps? + a simple Flow
Why do we need MLOps?
A Case Study Wrapped in a Short Story
In the tech talk, we used a simple story about a developer named Maurice, who has a food classifier running on a notebook as an example.
Now, say despite Maurice building this model on his own, it is an extremely good classifier that can predict whether a photo of a food is a hotdog or not a hotdog with a pretty high accuracy. (and yes this section of the tech talk was a Silicon Valley Reference)
At this point, Maurice’s team only has 1 member — himself, and he has developed this food classifier app for fun. However, at one point, he met a brilliant business person, Jim, who convinces him to make an app for it and claims that it can make the world a better place. Also Jim has enough capital to fund Maurice to develop this classifier into a full-blown app.
Let’s say after a few months, Maurice starts integrating this classifier into an app served to actual users, and continues to improve the model with different datasets and versions. On top of this, he now has to deal with user feedback, usability changes, and frontend/backend implementation for the app. Maurice and his notebook are now overwhelmed!
As a solo dev wearing too many hats, Maurice is now panicking about how to manage a plethora of datasets, model versions, model implementation & analysis. Colab notebooks, Google Drive, and his single laptop used for implementation for his service served on `ngrok` can’t serve everything anymore.
Maurice now has to think of a way to organize this whole operation, and fast, otherwise the system can fall apart and the users would stop using their application, thereby wasting Jim’s money.
From this, Maurice and Jim decide to hire Mathieu the Maîtres MLOps, to look at their current system and advise on how this can be optimized for scale, growth, and speed.
As with any system optimization, Mathieu tells the team to look at the big picture first and see the current architecture, in order to see pain points and plan how to migrate them into a more proper and manageable infrastructure.
They start looking at Maurice’s primitive notebook/Flask/mobile app infrastructure and improved them phase by phase based on the first article (ML / DEV / PROD phases)
Mathieu helps Maurice build automated model experiment/release pipelines in the ML phase using the latest tools, builds the CI/CD pipeline for quick updates and releases in the DEV phase, as well as setup monitoring and alerts for uptime and model performance in the PROD phase.
Thanks to this, they were able to scale to millions of users and have Jim continue with the marketing smoothly without fear of downtime, and Maurice’s development experience is as good as ever despite the small team of three.
Recap
While this might be an oversimplified story with certain comedic elements, it shows how the organisation of a very complex system into bite-size parts, as well as the automation of repeated operations, can do wonders not just for the developer workflow, but also to a product as a whole.
For a cool real-world example on how a small team of engineers built a huge yet extremely optimized system that serves a large scale of users, you can check a case study of Go-Jek, a ride-hailing service in Indonesia.
Going back, to give a quick recap what Mathieu and Maurice did, they:
- streamlined data pipeline & model implementation for the food classifier
- analysed and verified each component in the ML, DEV, and PROD phase
- automated staging/production deployments and testing
- implemented model performance monitoring and uptime checks
To summarize why introducing MLOps to a pipeline can help, here are a few bullet points:
- ML is more experimental than classic software
- It eliminates the bottlenecks in your production pipeline that can and will cost you resources (time and $$$)
- Any untested component can make or break your product
- It eases product scaling by providing speed and reliability for your end-to-end ML pipeline
Next Episode
Now that we’ve seen a simplified example, as well as a real-world case study link, on why we need MLOps, it’s time to get our hands dirty and dive into how we can apply it in the last installment, Part III: How do you apply MLOps? + a simple flow.