The brave new world of MLOps

Implementing ML

Jul 03, 2021

If you are in a startup, most likely you will need to build an ML system with limited budget, both in money and in time. You would also want to focus on what is unique to your startup and not try to rediscover the wheel. By now there is a lot of awareness of the hidden technical debt of machine learning.

The discipline of engineering that worries about this is MLOps. It is about making ML happen in a company without rediscovering the wheel, using standardized interfaces so that components can be swapped in/out based on needs, making ML happen in a company under practical constraints.

I attended a wonderful panel on MLOps (video below) on June 30th that covers how to not rediscover the wheel, and adapt new frameworks.

Panelists:

Chip Huyen at Stanford, ex-Snorkel and author of 200 tools
Laurence Moroney (Google)
Andrew Ng (Landing AI)
Robert Crowe (Google)
Rajat Monga (Entrepreneur, ex-Tensorflow Lead)

Notes:

Chip: On being asked “What is MLOps?”
- MLOps is the realization that ML is not just about the ML code but also the data. In fact we could keep the model same and iterate on data to improve performance. Hence we need to manage the entire data lifecycle.
Laurence:
- With ML we can change our logic frequently without changing the code.
- We have one principle in Google "Focus on the user and all else will follow".
Andrew Ng: Even early on, ML teams should figure out what data to collect and the versioning strategy of it.
Laurence: On being asked: “What does MLOps look like at Google?” … It depends.
- We have this little box in the middle (see image below), the ML code, which is a sort of meta-code. How can we scale it and not require those developers to do everything.
- Good monitoring infrastructure
- Process management
- Data verification.
- To keep all of these working together.
- Standards between these, open infrastructure between these.
- We concretized many of these into TFX.

Robert: On being asked “How is MLOps different in a small company vs Google?”
- The reason it is different at Google is partly historical. We had to build things when there wasn’t a solution out there we could use. We had to invent Tensorflow and before that DistBelief.
- They were our internal needs.
- I see it in a lot of companies. They need something and they feel the need to build it but that increases the burden to maintain it going forward.
- People are inventing tools becoz they need it.

Rajat: On being asked: “Are you inventing tools, or are you using existing ones in your startup?”
- There is a lot of value in using Tensorflow (TFX).
- Really get that end to end thing working and then add more applications.
- I would recommend to build the first model, and then build the pipeline around it.
- For us, to start with, each of these was more bespoke.
- When we had to scale we decided what to use. For each of these we try to use standard offerings.
- In the long term TFX is the right approach, but for a startup, I think it is not good advice to start with all the components of a full fledged solution like TFX.

Andrew:
- I have a different take on this that Laurence and Robert might not like.
- I worked on DistBelief, a precursor to Tensorflow and in retrospect a lot of bad decisions were made and you can hold me accountable for many of them. Thanks to Rajat and others these were fixed in TFX. For instance DistBelief is too CPP centric.
- Lesson: Even though TFX is the state of the art today, part of me wonders if ten years later TFX will be the DistBelief of today. BY that I mean that something that is state of the art might become limiting in future!

Chip:
- I kind of agree. For instance, a lot of new teams are using streaming based ML, but many legacy companies are based on batch systems. The tools your systems are built on box you in.
- It happened to me. From TF 1.0 to TF2.0 I had to change everything in my course.
- When I am teaching students, I try to empathize with this changing landscape of tools. As a student it is challenging to learn MLOps since you don't just want to learn about tools. You want to learn about principles.

Rajat:
- It is hard to teach and learn at a philosophical level. It is hard to understand how one would apply something like that.
- At a high level, MLOps is about recognizing a problem the ML community is facing at large, and how do we codify that so that it is easy for the next person facing the same problem.

Laurence:
- Changing upstream work since TF 2.0 is different from TF 1.0 was earlier claimed to be similar to I think MLOps does not parallel previous changes. For instance we are talking about TF 1.0 and TF2.0 but it does not compare to Java 1.0 etc. The biggest problem with MLOps is that people don't know about it and why they need it. People largely knew what is different between Java and C++ and they could easily talk about it. Not so for MLOps. It is this incredibly useful thing that people are constantly reinventing the wheel on.

Andrew: To add to that there are very few job descriptions that say MLOps but the role actually heavily requires that work.

Chip: For students I ask them not to focus on tools and ask them to deploy a project, deploy a simple project on the phone for instance.

Rajat: As a startup, make sure to launch some model, with the right metrics to hill climb. Then you can optimize. Don't try to get a perfect model before you have worked on production.

Laurence: There is a tension between non-ML stakeholders where they might be thinking that AI/ML is a magic pixie dust that can solve everything and the model development is held to a very high standard till it is deployed. It is actually the deploying of an ML application that allows more iteration and improvement.

Robert Crowe: Think about all the customers of your ML driven service and a good product manager driven workflow in developing an ML application is key. ML is a multi disciplinary thing that at its core requires understanding the model and the data lifecycle. hence it requires knowledge of the deployment chain and the nature of the input data.

Chip: On being asked “Almost all job profiles in MLOps require many years of experience. Where should people start?”
- I look for really good engineers and they can pick up ML on the job.

Andrew: On being asked “What are some important principles of MLOps?”
- Ensure consistently high quality data in the entire lifecycle of the ML project (scoping, data definition and processing)

Laurence:
- What we can learn from DevOps is that when there are multiple systems that interact with each other then the interaction between them should be standardized. For instance for observability a number of systems conform to a standard.
- The challenge is that the boundaries of ML systems are quite fuzzy and the main effort in TFX etc is to define them.

Rajat: On being asked “What would help you to go faster?”
- Setup the flywheel from the start, starting with the product objective and touching upon all parts, and most importantly the data lifecycle.
- Divide into components at least mentally so that each of these can be iterated upon later.

Chip Huyen:
- A key problem is that there is little awareness of the centrality of labeling. Andrej Karpathy mentioned that labeling is at least as important as coding at Tesla.

Panel also show a new MLOps Coursera course! I am enrolling today and I encourage you to do the same.

Disclaimer: These are my personal opinions only. Any assumptions, opinions stated here are mine and not representative of my current or any prior employer(s).

Applied ML | Recommender systems

Discussion about this post