MLOps@Azure: Architectural overview
Recently, I designed an MLOps pipeline for Azure [1,2,3,4]. The use case included the deployment of several Jupyter notebooks with report functionality accessible by a browser. Here are my takeaways.
Code
Production readiness: The deployment of Jupyter notebooks in a cloud environment is possible but has several drawbacks. They are mainly intended for interactive (non-sequential) use and contain code related to the plotting of figures and the formatting of text. The latter features are not well supported by version control systems, which are required for traceability in production. Consequently, it is a good idea to convert the notebooks to Python (py-)-files (via jupyter nbconvert — to python <initial file>) and refactor the corresponding output files [5].
Some guiding aspects:
— Remove all code related to figure plotting
— Introduce functions for workflow steps (e.g. create_csv(), generate_test_data(), train_model())
— Add functionality for specifying relevant parameters externally (e.g. command line parameters, configuration file)
— Add flexibility to the import of data (e.g. as command line arguments, posting to a REST-service component created with FastAPI/Flask)
— Export relevant results to a file (or somewhere else)
— Separate training and inference step to different files (e.g. by exporting the trained model and then importing it in the corresponding inference file)
An interesting aspect is, how to sustain these code adjustments, so that you do not have to update the Python files manually whenever the reference notebook changes. Technically, this can be solved by structuring your code in a way, which separates interactive and non-interactive code to different notebooks. However, if different people (e.g. data scientist and MLOps engineer) are involved, it also boils down to finding a solution with which everyone feels comfortable with.
Production shipment: The standard approach consists in dockerizing your app, which would then run in the cloud on a virtual machine/Kubernetes cluster (e.g. created with the help of Ansible/Terraform/Helm charts). The images can be built via Jenkins/Gitlab/Azure DevOps and then deployed to the host environment via Azure Container Registry or JFrog (Marketplace/self-hosted).
Some guiding aspects:
— Docker images should be build according to best practices (such as minimal file size, intuitive directory structure) and only contain one workflow
— Infrastructure files (such as Helm charts) should be validated early-on by corresponding commands (e.g. helm lint) and checked on a toy Kubernetes cluster (e.g. MicroK8s)
— Each workflow should have a pipeline on its own.
Data
Production readiness: The key factors for the integration in the overall design are the kind of data (structured text, unstructured text, image, video, …), the kind of processing (batch vs streaming) and how data is retrieved (internal/external partners, few/many data sources, …). Textual data (numerical, categorical, time series data) can be readily persisted in a corresponding database/messaging platform. But also in the more general case, versioning of data (e.g. by DVC) and cloud transfer (to Azure Blob storage) is possible. Overall, such technical challenges are quite manageable. However, in a lot of cases, you will have to worry about securing your (business-sensitive) data against access by third-parties, including your cloud provider. Somehow related to that, data governance is crucial, which e.g. means that you have to set up corresponding Azure policies to comply with local legal regulations such as GDPR.
A guiding aspect:
— Governance and security aspects are first-class citizens and should be incorporated in the initial design.
Production shipment: The proper way to transfer data to the cloud depends on your infrastructure and requirements. In the cloud it is initially stored in a data lake (Azure Data Lake, Azure Cosmos DB, MongoDB/PostgreSQL/Kafka/Spark via Marketplace or Containerized). From there, it runs through a data pipeline (Azure Data Factory, Containerized Services), which most notably includes steps for data cleaning and validation. The resulting output is stored in a feature store (Azure Machine Learning, MongoDB/PostgreSQL/Kafka/Spark via Marketplace or Containerized), which provides the initial data for the actual machine learning process.
A guiding aspect:
— A central design aspect is, whether your business case is built around online learning (with streaming data) or offline learning (with batch data respectively) or a mixture of both.
Cloud architecture
For supervised learning, the actual machine learning process consists of a training, validation and inference pipeline. During the training step, the model learns from data, which is labeled with respect to the feature of interest. The accuracy of the resulting model is then checked within the validation pipeline. For both tasks, the initial data set is usually split into 70 percent for training and 30 percent for validation respectively. The trained model can be exported and archived in a corresponding storage solution (JFrog, Azure Machine Learning) in a file format such as Pickle. In the next step, these models can be downloaded and imported for the actual production (inference) machine learning task. Thus, training and inference can be in principle completely decoupled. However, monitoring the model in production can give valuable insights on the performance and detect drifts [6], which reduce the predictive accuracy of the model. This new data can be fed back into the data pipeline and used for a retraining of the model. Consequently, the overall workflow has a loop-like structure, with training and inference iteratively improving each based on the other.
For completeness, the customer web interface can be implemented by using a web framework (Flask) and then deploying the resulting app to Azure app service [7].
Some guiding aspects:
— One rather general aspect (which is quite often forgotten): There is no inherent performance boost from running applications in the cloud. In practice, it is quite possible that your workload actually runs slower than in your on-premise infrastructure (depending on a lot of factors such as the allocated hardware, network traffic, …). This is a manifestation of the “no free lunch” theorem (for cloud computing), and means that you really have to invest in the design of your cloud applications to get the best performance possible.
— Another rather fundamental aspect is the coupling to the cloud provider, which can result in a vendor lock-in. For complex applications (with a lot of assets belonging to your company) you should avoid a deep integration in the Azure cloud stack (such as Azure Machine Learning Studio). Otherwise, it could become very costly to switch to a different provider for reasons such as changes in the pricing or terms of service policies. The same reasoning also applies to decisions on the tooling (e.g. database, frameworks, …).
— An ML pipeline has a lot of moving parts and involves a lot of tools (e.g. for versioning control, artifactory) [8]. Consequently, it does not make sense to start by implementing a fully-automatized pipeline, but rather take some steps in between (a ‘bottom-up’ approach). This is referred to as the ‘Maturity model’ [9], which suggests increasing the degree of automation and sophistication incrementally.
— The most popular frameworks for ML lifecycle management are Kubeflow/MLflow.
— The key best-practices are velocity (of the pipeline), validation (to find bugs) an versioning (of all relevant artifacts)
— The release strategy is often quite important and there are several well-established options to choose from [10].
Conclusion
We have discussed the building stones of an MLOps pipeline. Since it is usually far more complex than a common DevOps pipeline, one has to be more considerate about design decisions and the develop & deployment strategies do play a bigger role.
[1] https://ml-ops.org
[2] https://github.com/EthicalML/awesome-production-machine-learning
[3] https://github.com/microsoft/MLOps
[4] https://learn.microsoft.com/en-us/azure/machine-learning/v1/concept-azure-machine-learning-architecture?source=recommendations&view=azureml-api-1
[5] https://learn.microsoft.com/en-us/azure/machine-learning/v1/how-to-convert-ml-experiment-to-production?view=azureml-api-1
[6] https://www.datacamp.com/tutorial/understanding-data-drift-model-drift
[7] https://www.analyticsvidhya.com/blog/2020/10/how-to-deploy-machine-learning-models-in-azure-cloud-with-the-help-of-python-and-flask/
[8] https://learn.microsoft.com/en-us/azure/architecture/example-scenario/mlops/aml-decision-tree
[9] https://learn.microsoft.com/en-us/azure/architecture/example-scenario/mlops/mlops-maturity-model
[10] https://neptune.ai/blog/model-deployment-strategies