The following article tries to summarize what is exactly ML Ops and attempts to give a few guidelines for anyone trying to dive into it (or assess if you need to dive into it).
Let me start by saying that MLOps is just the new hype in AI. This has started as a set of best practices in the field, that given the huge demand and opportunities are now transforming into a new discipline. If you need to summarize it for someone who is not too familiar with ML you might say that MLOps is just ML + DevOps but it is much more than that and I will try to explain why next.
Why am I talking about this?
It is probably good to give a little background about me. I work helping multiple companies putting machine learning into production. My work usually involves helping them define the problem definition, gather and label the data, design and train the models, deploy to production and posterior maintenance. One of the nice things about my job is that I get to work with companies from all sizes, from Fortune 50 companies to startups and in a wide range of industries (healthcare, retail, heavy industry, etc) in the different machine learning fields: computer vision, natural language processing and predictive analytics. This has allowed me to learn about their processes and best practices, using a wide spectrum of tools throughout this process.
Tackling this diverse problems, I have:
- Created my own tools for different parts of the process and even got into a 500 Startups program for that (www.oktopus.ai)
- Helped other companies design and implement their tools for this
- Assessed and used many of the different tools out there, adding them from scratch to new pipelines as well as improving existing ones
What is exactly MLOps?
MLOps is the process of deploying a machine learning model into production. Some people include the design phase into it. I personally like to see it as the ML lifecycle management. It is a set of best practices that aim to reduce friction, failure points and manual work involved in deploying a first model into production, and then all subsequent re-trained models.
A ML lifecycle usually includes the following steps:
- Data collection, cleaning & labelling
- Data exploration, pre-processing & augmentation
- Feature engineering (includes feature selection, ranking and pruning)
- Modelling & experiment tracking
- Model design
- Fine tuning
- Production data collection
- Re-training (iteration starts here)
Ideally, these steps are as automated as possible, reducing the possibility of human error and minimizing manual work (such as tracking experiments in a spreadsheet…). These steps should also take into account regulatory requirements (such as data privacy).
The simplest diagram that represents what MLOps is is the following:
What is so different to DevOps?
The main differences derive from the key role that data plays in the algorithms. Unlike traditional software, were you programmed rules based on the developer’s observations from data, now the inference algorithm is defined by the training algorithm’s usage of data, and the more data you have the better. The consequences of this include:
- Whenever you want to improve an algorithm you will probably want to include new data if available
- Data needs to be labeled and validated or else you can hinder your algorithm’s performance
- Real data drifts, so you need to update your algorithm every now and then
- Data is subject to a lot of regulation and additional security measures ought to be taken
- Data requires large storages and transferring can be costly. Never underestimate this.
Additionally, one of the key advantages of machine learning is that it can learn complex relationships with otherwise non trivial features and to use this to predict a target’s variable behavior or make some sort of classification. The downside is that this is usually really compute intensive, as “learning complex relationships” is basically doing a lot of matrix operations. GPUs, which tend to be very good and fast at doing matrix operations, also tend to be significantly more expensive than CPUs. This forces us to be extremely careful with hardware resources for both training and inference. Traditional software good practices on many companies are out of the question because of hard limitations on the available resources. This can include auto-scaling and deployments with no downtime (as these models can take even a few minutes to be loaded into memory and a single instance might not be able to load two models at once because it just does not have enough memory).
What is the value added by MLOps?
If done correctly, it can enable your organization to improve your model’s performance at a very low cost. You can significantly improve your development speed if:
- you successfully implement a pipeline which allows you to use production data to re-train your model
- reproducing experiments is easy to do. This involves:
- spinning up the necessary hardware (as you probably turned your training instance off to save a few bucks)
- getting the exact version of the dataset you used to train originally
- you know exactly which parameters did you use in the original training
- experimenting is not a time consuming task because you were able to automate hyperparameter search, rollback to previous experiments, etc
Additionally, you can potentially avoid issues such as data drift: data slowly changing in production from what your model was trained on.
What skills does your team need to do MLOps?
It is not too strange to find a company that has a data science or machine learning team and a DevOps team. Then, how do you get from there to MLOps?
MLOps requires taking ownership of the entire data pipeline, from the first dataset used to train the first model to the production data for which the model is doing the predictions. There are a lot of small details that make all the difference involved in modeling and training, so whoever is in charge of MLOps in your organization better be well informed about this.
Also, handling large volumes of data and correctly dimensioning hardware is not an easy task so you need someone who really knows about infra and can leverage cloud computing services like Databricks, SageMaker, etc. Otherwise you might end up paying large cloud bills caused by unnecessary data transfers or storage, idle instances, etc. Docker does not play too well with GPUs either (for example GPUs are not supported by docker-compose at this moment) and you need someone who is able to sort out low level libraries compatibility issues (cuDNN, CUDA, etc.).
Having the right person or team in charge of MLOps can have a large impact on business metrics. It allows you to focus on what really matters when trying to improve performance. In this line, I recommend this Andrew Ng lecture were he basically recommends to take a look at improving data instead of a model’s architecture or parameters when trying to improve performance.
Let’s get to business
The good news about MLOps is that there are many tools blooming at this moment that take care of really specific tasks as well as of the entire pipeline. The downside is that the market is still not mature, and there are no clear market leaders for most of the issues. Keep in mind that this makes it hard to make a recommendation as I understand there is no recipe or perfect solution. This should also be revisited frequently given the high speed of change of these (and new) tools.
I recommend for specific use cases tools like dvc for dataset versioning, Airflow for workflow orchestration, Kubernetes and Kubeflow for deployment orchestration, MLFlow for experiment tracking (although it’s intended for more than just that), Weights & Biases also for experiment tracking and Spark (and Pyspark) for data processing.
In this link you can find a more exhaustive list of the different tools out there.