Building ML Infrastructure and Deploying Production-Ready Models

Want a visual companion to this course? A slide deck is available to accompany the course content. It summarizes key concepts in a presentation format — ideal for revision, teaching, or team discussions.

👉 Download the slide deck (PDF)

Prefer listening? This course is also available in audio podcast format. You can listen to the full content using the episodes below — ideal for learning on the go or as a complement to the written material.

Course Overview

This course is designed to guide you through the critical steps needed to transform machine learning prototypes into reliable, scalable, and maintainable production systems.

Participants will learn how to move beyond experimentation in Jupyter notebooks and develop a deep understanding of the practical challenges involved in deploying machine learning models in real-world environments — both in the cloud and on-premise infrastructures.

The course follows the complete MLOps lifecycle, focusing on:

Productionizing machine learning code: understanding why notebooks are excellent for research but insufficient for production, and how to structure code for scalability and reliability.
Model deployment strategies: exploring different ways to serve machine learning models using APIs, and choosing appropriate protocols such as REST, gRPC, WebSocket, or MQTT depending on the application's needs.
Containerization and virtualization: using Docker to package applications for reproducibility and portability across different environments.
Container Orchestration: managing multi-service systems with Docker Compose, and understanding how Kubernetes supports large-scale, distributed deployments.
Infrastructure management: comparing cloud and on-premise deployments, and automating infrastructure creation with Infrastructure as Code tools.
Data storage and versioning: introducing object storage solutions such as S3-compatible systems to manage datasets, model artifacts, and logs, ensuring reproducibility and auditability.
CI/CD pipelines for ML: automating code testing, container builds, and model deployment using modern CI/CD systems, focussing on Github Actions.
Data pipelines and orchestration frameworks: designing data-centric workflows using tools like Airflow, Dagster, and Prefect to support ingestion, preprocessing, and model retraining.

By the end of this course, participants will have a clear understanding of how to bridge the gap between machine learning research and production-grade AI systems.

They will be equipped with the knowledge and best practices to build machine learning services that are reliable, maintainable, scalable — and ready for real-world deployment.

From Notebooks to Production Code
Model Deployment and Serving
Containers and Virtual Machines
Introduction to Container Orchestration
Infrastructure Management: Cloud, On-Premise and Infrastructure as Code
Scalable Storage for Machine Learning: Object Storage
Continuous Integration and Continuous Deployment (CI/CD) for Machine Learning
Data Pipelines and Orchestration Frameworks

1. From Notebooks to Production Code

Machine learning development often begins in Jupyter notebooks. Jupyter notebooks are a powerful tool for data exploration, visualization, and rapid experimentation.

However they are not ideal for building production-ready machine learning systems. In this section, we explore why notebooks don't scale to production, and how to transition toward modular, testable Python code, using a realistic example.

Why notebooks don’t scale to production

Notebooks are great for prototyping, but they introduce a number of limitations when used in collaborative, production-grade environments.

Notebooks are not plain text

Jupyter notebooks are stored as JSON files. This format is:

Unfriendly to version control tools like Git (merging is difficult, diffs are unreadable)
Fragile: one corrupted cell or metadata block can break the notebook
Not human-friendly: changes in code and output are interleaved and hard to review

While tools like nbdime help, the experience is still much less elegant than using standard .py files.

Hard to reuse or test

Notebook code is typically structured top-down, without reusable functions or clean separation of concerns. This means:

You cannot easily import logic from one notebook into another
There's no consistent way to pass parameters
It's difficult to write unit tests for any code inside a notebook

This makes notebooks essentially “black boxes” — similar to Excel sheets with hidden formulas.

Poor integration with tooling

Most tools in the Python ecosystem — linters, formatters like black, IDEs, static analyzers — are designed for .py files, not notebooks. Code in a notebook:

Cannot be formatted consistently
Often lacks docstrings or typing
Cannot be validated automatically by CI/CD

No clear interface

Notebook cells don’t expose standard function interfaces or entry points. It’s possible to run them with tools like Papermill or nbconvert, but this often involves non-standard hacks (e.g. parameterizing with environment variables).

This lack of a defined entrypoint makes notebooks unsuitable as production APIs, training jobs, or inference services.

What to do instead

The recommended pattern is to:

Move your logic into modular Python code (organized in .py files)
Use notebooks only as thin wrappers that call this logic and visualize outputs

This hybrid approach allows you to retain the strengths of notebooks (exploration and visualization), while unlocking all the benefits of modern software engineering practices.

Benefits of refactoring into Python modules

You can version and diff your code using Git
You can write unit tests and run them automatically
You can reuse the same functions in notebooks, training scripts, APIs, and pipelines
You can run code in CI/CD, Docker, and orchestration frameworks
You can separate your logic cleanly from your visualizations

Notebooks as thin wrappers

Instead of putting all your code into a notebook, structure your project like this:

All business logic lives in data.py, model.py, visualize.py, etc.
The notebook becomes a thin orchestration layer:
- Calls functions from modules
- Loads inputs, trains model...
- Displays plots or tables
Scripts like train_model.py or run_inference.py can automate training or prediction workflows. Each script serves a single purpose, without the clutter of a notebook.

This makes your project easier to understand, easier to maintain, and ready for production.

Example: Refactoring a messy notebook

Let’s take a typical, messy machine learning notebook and refactor it into clean, modular components.

Original notebook (all-in-one, exploration phase)

# cell 1: install dependencies
!pip install pandas scikit-learn matplotlib seaborn

# cell 2: import libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression

# cell 3: load data
df = pd.read_csv("data.csv")
X = df[["feature1", "feature2"]]
y = df["target"]

# cell 4: visualize data
sns.pairplot(df)
plt.title("Input feature distribution")
plt.show()

# cell 5: train model
model = LinearRegression()
model.fit(X, y)

# cell 6: plot model output
plt.scatter(X["feature1"], y, label="True")
plt.plot(X["feature1"], model.predict(X), color="red", label="Prediction")
plt.legend()
plt.title("Regression Results")
plt.show()

# cell 7: save model
import joblib
joblib.dump(model, "model.joblib")

This is very common in practice, but has serious limitations:

Everything is mixed together (data, training, visualization, saving)
Impossible to test
Hard to reuse
Impossible to use in CI/CD or deployment

Refactor into reusable modules

Project structure:

mlproject/
├── data.py
├── model.py
├── visualize.py
├── train_model.py
├── run_inference.py
├── requirements.txt
├── tests/
│   └── test_model.py
└── notebooks/
    └── explore_and_plot.ipynb

Each file now has a single responsibility. Don’t focus too much on the exact filenames — this is just an example to illustrate the idea. You have a lot of creative freedom in how you separate your code.

data.py

Handles loading data and extracting features. In a more complex scenario, the data could, for example, be downloaded from object storage (S3), and the code might include functions to clean up or preprocess the data.

import pandas as pd

def load_data(path="data.csv"):
    df = pd.read_csv(path)
    X = df[["feature1", "feature2"]]
    y = df["target"]
    return X, y

def load_raw_df(path="data.csv"):
    return pd.read_csv(path)

model.py

Encapsulates the model training, saving, loading, and prediction logic.

from sklearn.linear_model import LinearRegression
import joblib

def train_model(X, y):
    model = LinearRegression()
    model.fit(X, y)
    return model

def save_model(model, path="model.joblib"):
    joblib.dump(model, path)

def load_model(path="model.joblib"):
    return joblib.load(path)

def predict(model, X):
    return model.predict(X)

visualize.py

Keeps plotting logic clean and reusable, separate from model code.

import matplotlib.pyplot as plt
import seaborn as sns

def plot_input_distribution(df):
    sns.pairplot(df)
    plt.title("Input feature distribution")
    plt.show()

def plot_prediction(X, y, y_pred):
    plt.scatter(X["feature1"], y, label="True")
    plt.plot(X["feature1"], y_pred, color="red", label="Prediction")
    plt.legend()
    plt.title("Regression Results")
    plt.show()

train_model.py

Simple script to automate model training and saving.

from data import load_data
from model import train_model, save_model

X, y = load_data()
model = train_model(X, y)
save_model(model)

This is the kind of script that can run in CI/CD — no visualizations, just logic.

run_inference.py

Performs inference using a trained model.

from data import load_data
from model import load_model, predict

X, _ = load_data()
model = load_model()
y_pred = predict(model, X)

print(y_pred[:5])

Useful in production scenarios, or when exposing model predictions via a REST API.

tests/test_model.py

Unit tests made possible by modular code.

from data import load_data
from model import train_model, predict

def test_model_training_and_prediction():
    X, y = load_data()
    model = train_model(X, y)
    preds = predict(model, X)

    assert len(preds) == len(y)
    assert preds[0] is not None

This can now be executed using:

pytest tests/

Refactored notebook (cleaned up)

The notebook now becomes a clean interface for analysis and visualization.

from data import load_data, load_raw_df
from model import train_model, predict
from visualize import plot_input_distribution, plot_prediction

df = load_raw_df()
plot_input_distribution(df)

X, y = load_data()
model = train_model(X, y)
y_pred = predict(model, X)

plot_prediction(X, y, y_pred)

This makes the notebook readable, maintainable, and focused — while all reusable logic lives in modules.

Python packages: requirements.txt

Instead of installing packages inside your notebook using !pip install ..., it’s better to define your dependencies in a dedicated file called requirements.txt. This is a standard practice in Python projects.

This approach has several benefits:

It keeps your notebook clean and focused on logic, not package management.
It allows others (or automation tools) to install everything with a single command: pip install -r requirements.txt
It enables better version control and reproducibility.

Pinned versions ensure that everyone uses the exact same library versions, which is essential for consistent results across environments.

pandas==2.2.3
scikit-learn==1.6.1
matplotlib==3.10.1
seaborn==0.13.2

Even minor version changes can introduce bugs or incompatibilities — especially in fast-evolving libraries like pandas or scikit-learn.

To generate this file automatically from your current environment, run:

pip freeze > requirements.txt

Virtual environments

To avoid polluting your system Python and to isolate dependencies per project, it’s a good idea to use a virtual environment. This ensures you're only installing packages locally to the project.

Typical usage:

python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -r requirements.txt

You can add .venv/ to your .gitignore so that it's not tracked in version control.

We’ll take this further in later chapters by containerizing the entire environment using Docker, which captures not just Python packages but also the underlying system dependencies. In fact, requirements.txt will still play a key role inside Docker: it will be used to install the exact Python dependencies when building Docker images.

Maintaining a clean requirements.txt and using virtual environments are great first steps toward building reliable and reproducible ML workflows.

Notebooks conclusion

By refactoring notebooks into Python modules:

You gain maintainability, testability, and automation
You enable CI/CD and containerization
You keep the benefits of notebooks (for visuals and exploration)
You future-proof your projects for production use

This principle — separating core logic from notebooks — is foundational for everything we’ll cover next.

2. Model Deployment and Serving

In previous courses, we've explored how to train machine learning models using datasets, notebooks, and Python scripts. However, training a model is only part of the journey. The next crucial step is making the model accessible to users or other systems — a process known as model serving.

Client-Server vs. Edge Deployment

There are two primary approaches to deploying machine learning models:

Server-Side Deployment

This is the most common method. The trained model is hosted on powerful server hardware, and clients — such as web browsers, mobile apps, or other services — send requests to the server to obtain predictions. This approach offers several advantages:

Centralized Control: Easier to update and maintain the model.
Resource Availability: Servers typically have more computational power.
Security: Data can be processed in a controlled environment.

Edge Deployment

In this approach, the model is deployed closer to the data source, such as on a user's device or an embedded system. Advantages include:

Reduced Latency: Faster responses as data doesn't need to travel to a server.
Offline Functionality: Models can operate without an internet connection.
Privacy: Sensitive data doesn't leave the local device.

While edge deployment has its benefits, it also presents challenges, such as limited computational resources and difficulties in updating models. For now, we'll focus on server-side deployment, with edge deployment covered in a subsequent course.

Communication Protocols for Model Serving

To make a server-hosted model accessible, we need to establish a communication protocol between the client and the server. Let's explore the most relevant protocols:

REST APIs

REST (Representational State Transfer) is a widely adopted architectural style for designing networked applications. It leverages standard HTTP methods, making it compatible with a vast array of clients.

GET: Retrieve data from the server.
POST: Submit data to the server, often resulting in a new resource.
PUT: Update or replace an existing resource.
DELETE: Remove a resource from the server.

These methods align with CRUD (Create, Read, Update, Delete) operations, providing a clear and intuitive interface for clients. REST's stateless nature ensures scalability, as each request contains all the information needed for processing.

Example:

Imagine a model that predicts house prices. A client might send a POST request to /predict with a JSON payload containing features like square footage and location. The server processes this request and returns a JSON response with the predicted price.

Advantages:

Simplicity: Easy to understand and implement.
Wide Adoption: Supported by most programming languages and tools.
Scalability: Statelessness facilitates horizontal scaling.

Considerations:

Overhead: JSON payloads can be verbose, leading to increased bandwidth usage.
Latency: Each request-response cycle involves establishing a new connection, which can add latency.

gRPC

gRPC is a high-performance, open-source RPC (Remote Procedure Call) framework developed by Google. It uses Protocol Buffers (Protobuf) for efficient serialization, resulting in faster communication and smaller payloads compared to traditional formats like JSON.

Statically Typed: Protobuf enforces a strict schema, enhancing type safety and reducing errors.
Efficient Serialization: Binary format ensures compact messages, ideal for high-throughput systems.
HTTP/2 Support: Enables multiplexing and improved performance over traditional HTTP/1.1.

gRPC is particularly suited for internal microservices communication, where performance and type safety are paramount.

Advantages:

Performance: Efficient binary serialization reduces latency.
Strong Typing: Early detection of errors through strict schemas.
Streaming: Supports client, server, and bidirectional streaming.

Considerations:

Complexity: Requires learning Protobuf and setting up additional tooling.
Browser Support: Not natively supported in browsers, limiting its use in web applications.

WebSockets

WebSockets provide a full-duplex communication channel over a single, long-lived connection. This allows for real-time data exchange between client and server.

Bi-directional Communication: Both client and server can send messages independently.
Persistent Connection: Eliminates the overhead of establishing a new connection for each message.
Cross-Platform Support: Beyond browsers, many programming languages and platforms support WebSockets, making them viable for edge devices and non-browser clients.

This protocol is ideal for applications requiring immediate feedback, such as live dashboards or chat applications.

Advantages:

Real-Time Communication: Enables instant data exchange.
Efficiency: Reduces overhead by maintaining a single connection.
Firewall-Friendly: Operates over standard HTTP ports, easing traversal through firewalls.

Considerations:

Resource Management: Maintaining numerous open connections can strain server resources.
Complexity: Requires managing connection states and handling reconnections.

MQTT

MQTT (Message Queuing Telemetry Transport) is a lightweight messaging protocol designed for low-bandwidth, high-latency, or unreliable networks.

Publish/Subscribe Model: Clients can subscribe to topics and receive messages without polling.
Minimal Overhead: Designed for constrained devices and networks.
Persistent Sessions: Supports message retention and delivery guarantees.

MQTT is prevalent in IoT scenarios, where devices need to send data to a central server with minimal resource consumption.

Advantages:

Lightweight: Ideal for devices with limited resources.
Scalability: Efficiently handles thousands of concurrent connections.
Reliability: Offers different Quality of Service (QoS) levels to ensure message delivery.

Considerations:

Limited Browser Support: Not natively supported in web browsers.
Security: Requires additional measures to secure communication channels.

OPC UA

OPC UA (Open Platform Communications Unified Architecture) is a machine-to-machine communication protocol for industrial automation.

Platform Independence: Works across various operating systems and hardware.
Integrated Security: Built-in authentication and encryption mechanisms.
Complex Data Modeling: Supports rich data structures and relationships.

While OPC UA is more niche, it's indispensable in industrial settings, facilitating interoperability between diverse systems and devices.

Advantages:

Standardization: Widely adopted in industrial automation, ensuring compatibility across vendors.
Security: Robust security features tailored for industrial environments.
Flexibility: Supports both client-server and publish-subscribe communication models.

Considerations:

Complexity: Steeper learning curve compared to other protocols.
Overhead: More resource-intensive, which may not be suitable for all applications.

Protocol Comparison Summary

REST APIs: Best for web applications requiring stateless, request-response communication.
gRPC: Ideal for internal microservices needing efficient, strongly-typed communication.
WebSockets: Suitable for real-time applications requiring bi-directional communication.
MQTT: Designed for IoT devices needing lightweight, publish-subscribe messaging.
OPC UA: Tailored for industrial automation with complex data modeling and security needs.

Choosing the right protocol depends on your specific use case, considering factors like client type, network conditions, and performance requirements.

Introduction to FastAPI

FastAPI is a modern Python web framework designed specifically for building high-performance REST APIs with minimal effort. It's built on top of standard Python type hints, making it both intuitive and fast.

Why is FastAPI so useful?

Simplicity: You can define an entire API by writing just a few Python functions and adding decorators.
Speed: Thanks to asynchronous capabilities and efficient tooling, it performs extremely well.
Automatic Documentation: Every FastAPI application includes a fully interactive API documentation page — powered by Swagger UI — without requiring any extra setup.

FastAPI is especially attractive for data scientists and ML engineers because it allows them to expose models as APIs without switching away from Python.

Creating Routes with FastAPI

Let’s begin with a basic example. Below is a small FastAPI application that exposes a REST API for working with blog posts.

from fastapi import FastAPI
from pydantic import BaseModel
from typing import List

app = FastAPI()

# Example blog post structure
class BlogPost(BaseModel):
    title: str
    content: str

posts = []

@app.get("/posts", response_model=List[BlogPost])
def list_posts():
    return posts

@app.post("/posts")
def create_post(post: BlogPost):
    posts.append(post)
    return {"message": "Post added"}

This API allows you to:

GET /posts: retrieve a list of all posts.
POST /posts: add a new post.

This example doesn’t use a real database — it stores posts in memory — but it clearly shows the mechanics of route definitions. Each route is defined by a Python function and annotated with a decorator like @app.get(...) or @app.post(...).

You can automatically validate inputs using Pydantic models, like BlogPost, and FastAPI will generate detailed documentation pages for you.

To run your FastAPI application:

Development mode: fastapi dev main.py. This enables auto-reload and is ideal during development.
Production mode: fastapi run main.py. This disables reload and makes the app accessible on the network.

In larger projects, it's common to version your APIs (e.g. using paths like /v1/posts) to manage changes over time while keeping older clients compatible.

Using FastAPI to Serve Model Predictions

Now that you understand the basics of defining routes, let’s look at how you can use FastAPI to serve a trained machine learning model. Here’s a simplified example with a PyTorch model saved to disk.

from fastapi import FastAPI
from pydantic import BaseModel
import torch

app = FastAPI()

# Dummy PyTorch model (replace with your real model)
model = torch.load("model.pt")
model.eval()

class InputData(BaseModel):
    feature1: float
    feature2: float

@app.post("/predict")
def predict(data: InputData):
    with torch.no_grad():
        inputs = torch.tensor([[data.feature1, data.feature2]])
        prediction = model(inputs)
    return {"prediction": prediction.item()}

This exposes a POST endpoint at /predict. A client can send a JSON payload like:

{
  "feature1": 3.5,
  "feature2": 1.2
}

The FastAPI app will pass this input into your PyTorch model and return the prediction in the HTTP response.

This is an easy and effective way to turn a trained ML model into a live, production-ready API.

Swagger UI: Built-In API Documentation and Testing

One of FastAPI’s most powerful features is the automatic documentation system based on Swagger UI.

Once your FastAPI app is running, you can visit http://localhost:8000/docs in your browser to see:

A list of all defined endpoints
The request format and response schema for each route
Example inputs for every route
A Try it out button to execute live HTTP requests from the browser

This means that even if you haven’t written any frontend yet, you or your users can already explore and test your API directly in the browser.

Swagger UI makes it easier to:

Understand the structure of the API
Debug endpoints during development
Share your API with other developers, who can test it immediately
Replace tools like Postman for quick manual testing

This automatic documentation is especially useful when exposing your ML models to teams who might not have access to the backend code. It improves collaboration, onboarding, and transparency — all for free, just by using FastAPI.

Frontends for Model APIs

Now that we’ve built a backend REST API to serve model predictions, we need a way for users or systems to interact with it. This means we need a frontend — a client that can send requests to our backend.

There are many kinds of clients: web browsers, smartphone apps, desktop applications, or even embedded devices. Each use case comes with its own trade-offs. In this section, we’ll focus primarily on browser-based frontends, as they are the most accessible and widely used.

JavaScript Frameworks

Frontend frameworks like React and Vue.js are the most common way to build browser applications that consume REST APIs.

These applications are written in JavaScript and run entirely in the browser. When the user interacts with the interface (clicking a button, submitting a form), the frontend sends HTTP requests directly to the REST API — often using AJAX or Fetch.

This architecture is very flexible and scalable. It allows you to completely decouple the frontend from the backend. The backend serves model predictions (or other logic), while the frontend focuses on the user interface.

Frameworks like React, Vue.js, and Svelte dominate this space. We won’t cover them in depth, but it’s important to know they exist — and they’re likely what frontend developers will use to interface with your ML backend.

Python-Based Frontends with Flask

Sometimes, especially in prototyping or internal tools, it’s convenient to write the frontend in Python too.

With a minimal web framework like Flask, you can serve both the frontend and backend from the same language — even the same server. In this setup, the user's browser sends a request to Flask, and Flask itself sends a second request to the model-serving REST API.

Here’s a simplified example:

from flask import Flask, request, render_template_string
import requests

app = Flask(__name__)

TEMPLATE = """
<form action="/predict" method="post">
  <input name="feature1" type="text" />
  <input name="feature2" type="text" />
  <button type="submit">Submit</button>
</form>
<p>Prediction: {{ prediction }}</p>
"""

@app.route("/", methods=["GET"])
def homepage():
    return render_template_string(TEMPLATE, prediction="")

@app.route("/predict", methods=["POST"])
def predict():
    f1 = float(request.form["feature1"])
    f2 = float(request.form["feature2"])
    response = requests.post("http://localhost:8000/predict", json={"feature1": f1, "feature2": f2})
    return render_template_string(TEMPLATE, prediction=response.json()["prediction"])

This is very simple — no JavaScript at all — but it lets you build useful prototypes quickly, especially as a Python developer. In production, the extra hop from browser → Flask → REST API adds some overhead, but for many internal tools, that’s perfectly acceptable.

Dashboarding Frameworks: Dash, Streamlit, and Gradio

If your goal is to quickly prototype an interactive ML application or dashboard, Python dashboarding frameworks are extremely useful.

The most prominent tools in this space are Plotly Dash, Streamlit, and Gradio. They allow you to build rich interfaces without writing any HTML or JavaScript.

The general architecture is similar to Flask: the dashboard runs on the server, and server-side Python code can make calls to your backend REST API.

Here’s a small example using Plotly Dash:

from dash import Dash, html, dcc, Input, Output
import requests

app = Dash(__name__)

app.layout = html.Div([
    dcc.Input(id="f1", type="number", placeholder="Feature 1"),
    dcc.Input(id="f2", type="number", placeholder="Feature 2"),
    html.Button("Submit", id="submit"),
    html.Div(id="output")
])

@app.callback(
    Output("output", "children"),
    Input("submit", "n_clicks"),
    [Input("f1", "value"), Input("f2", "value")]
)
def call_api(_, f1, f2):
    if f1 is None or f2 is None:
        return ""
    response = requests.post("http://localhost:8000/predict", json={"feature1": f1, "feature2": f2})
    return f"Prediction: {response.json()['prediction']}"

if __name__ == "__main__":
    app.run_server(debug=True)

This builds a tiny app with two input boxes and a prediction output. It’s perfect for internal tools, demos, and low-volume use cases.

Dash is the most scalable of these tools, but still not suitable for high-traffic applications.
Streamlit is great for exploration, with a friendly interface for data scientists.
Gradio makes it easy to wrap ML models into interactive components and is especially popular for showcasing models.

These tools are limited in customization compared to full JavaScript frameworks, but they’re very productive — and entirely in Python.

Beyond the Browser

Browser-based clients aren’t the only consumers of your model APIs.

Smartphone apps may use Kotlin, Swift, or Flutter.
Desktop applications may use C#, Qt, or Electron.
Embedded and IoT devices may use C++, Go, or Rust.

All of these can send HTTP requests (or MQTT messages, etc.) to your model server, just like a browser would.

In this section, we focused on web-based frontends — the most common way to interact with APIs. But remember that any system that can connect to your API over the network can act as a frontend.

Authentication and Access Control

When exposing your model through an API, it's important to consider who should be allowed to access it. Without authentication, anyone with the URL can call your endpoints — potentially overloading your server, accessing sensitive data, or modifying system state through update or delete routes.

Even seemingly harmless GET endpoints can be abused in large volumes, leading to performance issues or denial-of-service attacks. For routes that create or modify data (like POST, PUT, DELETE), access control is absolutely essential.

Authentication is the process of verifying who the user or client is. Authorization comes afterward — determining what that user is allowed to do. For example, a user may be authenticated, but only authorized to read data, not modify it.

Here are several common authentication mechanisms:

HTTP Basic Authentication

This is the simplest form: clients send a username and password with each request, encoded in the HTTP headers. It’s easy to implement but insecure unless combined with HTTPS (encrypted connections). It’s rarely used in production but still useful for internal tools or quick testing.

Bearer Tokens and JWT (JSON Web Tokens)

A more robust method is to issue a token to authenticated users. This token — often a signed JWT — is sent with each request using the Authorization: Bearer <token> header. JWTs can carry user claims (like roles or permissions) and are stateless, meaning the server doesn’t need to store session data.

OAuth2

OAuth2 is an industry-standard protocol for delegating access. It’s more complex, but widely used — for example, when logging in with Google or GitHub. It allows users to authenticate through an identity provider and get access tokens to call your API. Many cloud services (like AWS, Azure, Google Cloud) use OAuth2 internally for securing APIs.

Mutual TLS (mTLS)

In typical HTTPS connections, the server has a certificate, but the client does not. In mutual TLS, both server and client authenticate with certificates. This is ideal for machine-to-machine communication, such as embedded devices or PLCs authenticating to your model server. Each device holds a unique private key and certificate, allowing for secure, identity-bound access — without user accounts or passwords.

In this course, we won't go deep into authentication implementation, but it's important to understand the basics — especially when your model is exposed to users, systems, or devices beyond your direct control.

3. Containers and Virtual Machines

In the previous chapters, we discussed how to train machine learning models and expose them through production-ready APIs. But to reliably deploy these services, we also need to package and run them in a controlled and repeatable way — not just on our development machine, but across staging and production systems as well.

This is where virtual machines and containers come into play.

Before diving into Docker and containerization workflows, we first need to understand the basic concepts behind virtual machines and containers, how they differ, and why containerization is such a powerful tool for MLOps.

What is a Virtual Machine?

A virtual machine (VM) is a full emulation of a computer system. It allows you to run an entire operating system as a guest inside another host operating system. This is made possible by a hypervisor (such as VMware, VirtualBox, or KVM) that provides the underlying virtualization layer.

Each VM includes its own:

Operating system kernel
File system and libraries
Networking and system tools
Processes, memory, and compute space

Because of this full isolation, virtual machines are very secure and flexible — they can run any OS (Linux, Windows, BSD), independent of the host system. You can use them to simulate complex systems, test different software environments, or create reproducible infrastructure snapshots.

However, VMs come with overhead:

Each VM can use several gigabytes of RAM and disk space.
Startup time is slow (tens of seconds to minutes).
Managing many VMs can be resource-intensive and complex.

What is a Container?

A container, in contrast, does not emulate an entire operating system. Instead, it uses the host OS kernel but provides process and file system isolation for the application it runs.

A container includes:

Your application code
Its dependencies (e.g. Python packages)
Any system-level binaries or shared libraries needed by the app

But it shares the host OS kernel, which makes containers:

Much faster to start (milliseconds or seconds)
Much lighter on system resources (megabytes instead of gigabytes)
Much easier to deploy at scale

This lightweight isolation is what makes containers ideal for building and deploying microservices — each container does one thing, and does it well.

Comparing Virtual Machines and Containers

Let’s briefly compare both technologies in terms of functionality and use cases.

Virtual Machines:

Run a full guest OS, including its own kernel.
Provide strong isolation and full system emulation.
Heavier on disk, memory, and CPU resources.
Ideal for infrastructure-level virtualization, OS-level testing, or isolating security-critical workloads.

Containers:

Share the host OS kernel.
Offer process-level isolation and dependency packaging.
Lightweight and faster to start and stop.
Perfect for deploying applications and services in a repeatable and portable way.

In practice, it’s common to run containers inside virtual machines. The VM handles system-level concerns (security, OS-level configuration, infrastructure), while containers handle application-level concerns (code, dependencies, runtime environment). This hybrid approach combines the best of both worlds.

Why Containers Matter for Machine Learning

For machine learning workflows, containers solve several real-world problems:

Reproducibility: ML models depend on exact versions of libraries like NumPy, PyTorch, and CUDA. A container ensures the entire environment — not just the Python code — is packaged and versioned.
Portability: A container that runs on your laptop will also run on a server, on a colleague’s machine, or in a cloud cluster. You don’t need to “set up the environment” every time.
Deployment: You can deploy training jobs, APIs, dashboards, or batch scripts as isolated services, without interfering with other tools or environments.
Traceability: If a model was trained using container ml-env:2.1.3, you can later re-run that exact container and trace the results — even years later.

These features make containers a foundational tool in any production ML pipeline.

Versioning and Reproducibility

A key strength of containers is the ability to version-control your environments:

You can build a container image and tag it (e.g. ml-service:1.0.0).
Each tag corresponds to a specific set of libraries, OS packages, and configurations.
Anyone can pull the image and run it with consistent behavior, regardless of their underlying system.

If something goes wrong, you can easily roll back to an earlier image. If a colleague needs to reproduce your results, they can run the exact same version of the container.

This aligns closely with the principles of software engineering: deterministic builds, rollback strategies, and environment isolation — all of which are crucial for scalable MLOps.

Common Container Technologies

Before we dive into the details of Docker, it’s important to know that Docker is not the only container technology.

Other popular options include:

Podman: A daemonless container engine compatible with Docker’s CLI, often used in more secure environments (e.g. Red Hat).
LXC (Linux Containers): A lower-level container runtime focused on system-level isolation. Popular in virtualization platforms like Proxmox.
CRI-O: A lightweight container runtime used by Kubernetes as an alternative to Docker, optimized for container orchestration.

Despite these alternatives, Docker remains a widely used container platform, especially in data science and DevOps contexts.

We’ll use Docker throughout this course — not because it’s the only choice, but because it offers a simple, practical, and standard way to build and run containers.

Introduction to Docker

In the previous section, we introduced containers and how they differ from virtual machines. We now turn our attention to Docker, the most widely adopted tool for working with containers.

Docker allows developers to package applications and all their dependencies into a standardized, portable container image. These images can be shared with others and run on any machine that supports Docker, ensuring consistent behavior across environments.

By using Docker, we avoid the classic "it works on my machine" problem. We can develop an application locally and be confident that it will run exactly the same on a colleague’s laptop, a test server, or a production deployment — all because it’s wrapped inside the same image.

Before we begin working with Docker, it’s important to understand a key assumption: Docker images are Linux-based. Even if you run Docker on macOS or Windows, what’s actually happening behind the scenes is that your system is running a lightweight virtual machine running Linux, and your containers run inside that. On native Linux systems, Docker uses the host’s kernel directly, making it more lightweight and efficient. While there are ways to run Windows containers, we focus exclusively on Linux-based containers throughout this course.

Running Existing Docker Containers

To start exploring Docker, we first learn how to run containers from existing images. Docker Hub (https://hub.docker.com) is a public registry that hosts thousands of pre-built container images that you can use directly.

For example, to run a simple test container:

docker run hello-world

This pulls a minimal image and verifies that Docker is working on your machine.

To start a simple Python container:

docker run -it python:3.12

This starts a Python 3.12 container in interactive mode (-it), giving you a Python REPL inside the container. You are now running a Linux container that has Python installed — isolated from your host system. To exit the Python interpreter and stop the container, simply type exit().

Other essential Docker commands you'll frequently use:

docker ps: Lists running containers.
docker ps -a: Lists all containers, including those that have stopped.
docker stop <container>: Gracefully stops a running container.
docker rm <container>: Removes a stopped container.
docker images: Lists all locally stored images.

These commands help you manage the lifecycle of containers and images as you experiment and build more complex projects.

Common Docker Run Flags

To run real-world services with Docker, you often need to pass extra options to docker run. These flags control how containers interact with your system, your network, and your development workflow. Below are the most commonly used flags, with a short explanation of each.

-d (detached mode)

Runs the container in the background. By default, Docker containers run in the foreground and occupy your terminal. For long-running services — such as databases, web servers, or machine learning inference APIs — you typically want to start them in detached mode so they continue running after you close your terminal or SSH session.

--rm (remove after exit)

Automatically removes the container when it exits. This prevents leftover stopped containers from piling up on your system. It's especially useful for temporary one-off jobs. If you’re running an interactive container or test script and you don’t need the container afterwards, add --rm to clean up automatically.

-e (environment variables)

Passes environment variables into the container. Containers run in isolated environments and don’t inherit variables from your host by default. You need to explicitly set each one you want to pass.

Example: -e POSTGRES_PASSWORD=secret sets the environment variable POSTGRES_PASSWORD inside the container.

-p (port mapping)

Exposes container ports to the host machine. By default, ports inside a container are not accessible from your host. You must explicitly map them if you want to connect to the service from your browser or from another application.

Example: -p 8080:80 makes port 80 from inside the container accessible on port 8080 of your host.
The format is host_port:container_port.

-v (bind mount)

Mounts a folder from your host system into the container’s file system. Containers are isolated by default — they do not have access to your local files. Bind mounts let you share files between the host and the container.

Example: -v ./data:/app/data mounts the local folder ./data into the path /app/data inside the container.
Useful for sharing datasets, configuration files, or saving outputs.

--name (assign a container name)

Gives the container a human-readable name, which makes it easier to manage. If you don’t use this flag, Docker will generate a random name using a whimsical two-word combination (like stoic_fermi or epic_pike), which is fun but not very descriptive.

With a name assigned, you can refer to the container using commands like docker stop mydatabase.

These flags are essential for developing, debugging, and deploying containerized services.

Example: Running a PostgreSQL Container

Here’s a complete example of running PostgreSQL in Docker using several of the flags we just discussed:

docker run -d \
  --rm \
  -e POSTGRES_PASSWORD=secret \
  -p 5432:5432 \
  -v ./pgdata:/var/lib/postgresql/data \
  --name my-postgres \
  postgres:17

This command runs PostgreSQL version 17 in the background (-d), and:

Removes the container automatically when it stops (--rm)
Sets the database password using an environment variable (POSTGRES_PASSWORD)
Exposes port 5432, allowing local applications to connect to the database
Mounts a local folder (./pgdata) into the container to persist database data
Assigns a name to the container (my-postgres) for easier reference

This setup gives you a fully functioning PostgreSQL server running inside a container, ready for local development or testing.

Building Custom Docker Images

Besides running existing images, you can also build your own using a Dockerfile. A Dockerfile is a plain-text script that describes how to construct an image — from installing dependencies to copying in your code.

Let’s start with a basic example that serves a Python FastAPI application:

FROM python:3.12-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install -r requirements.txt

# copy source code
COPY . .

CMD ["fastapi", "run", "main.py"]

You can build this image with the following command:

docker build -t mlservice:1.0 .

This creates a Docker image called mlservice with the tag 1.0. You can now run it just like any other image.

Let’s now break down the most important instructions in a Dockerfile:

FROM

This sets the base image. Every Docker image is based on another image — typically one that provides a minimal Linux system with some preinstalled tools. In this case, python:3.12-slim gives you a lightweight Debian image with Python 3.12 preinstalled.

WORKDIR

This sets the working directory inside the image. Any commands that follow (like COPY or RUN) will use this directory as their base. This helps you avoid writing long absolute paths in each line.

COPY

This copies files from your host machine into the image.

COPY requirements.txt . copies the requirements.txt file into the working directory (/app).
COPY . . copies all source code from the current directory into /app.

RUN

Executes a command inside the image while it’s being built. Here we use it to install Python dependencies from requirements.txt using pip. These commands are cached as build layers — meaning Docker will skip them on future builds if the inputs haven’t changed.

CMD

This sets the default command that runs when a container is started from the image. Here, we tell Docker to start the FastAPI service using fastapi run main.py.

The CMD instruction takes a list format (["command", "arg1", "arg2"]) to avoid relying on a shell. This is more predictable and cross-platform.

These five instructions — FROM, WORKDIR, COPY, RUN, and CMD — are the foundation of most Dockerfiles. With just these, you can build and run production-grade container images for your Python projects.

Publishing Docker Images to a Registry

Once you've built a Docker image, you might want to share it — with your team, your servers, or the broader world. This is done by pushing the image to a container registry, which acts as a versioned storage location for your images.

By default, Docker uses Docker Hub (https://hub.docker.com). If you don’t specify a registry at login, tagging or pushing an image, it will automatically use Docker Hub.

Docker Hub (default)

Docker Hub is the default public container registry. To push an image here:

Log in to Docker Hub:
```
docker login
```

Tag the image with your Docker Hub username:

docker tag mlservice:1.0 your-username/mlservice:1.0

Push the image:

docker push your-username/mlservice:1.0

You (and others) can now pull and run the image from any machine using:
```
docker run your-username/mlservice:1.0
```

However, note that Docker Hub imposes strict rate limits for unauthenticated or free-tier users. If you're working on open source or need to avoid throttling, consider an alternative.

Quay.io (by Red Hat)

Quay.io is a registry hosted by Red Hat. It's especially popular in the open-source community and has more generous rate limits than Docker Hub.

docker login quay.io
docker tag mlservice:1.0 quay.io/your-org/mlservice:1.0
docker push quay.io/your-org/mlservice:1.0

Custom or Private Registries

Many organizations host their own container registry. This allows them to keep control of their images, improve performance, and keep image transfers within their own infrastructure.

A self-hosted registry might run on something like registry.example.com.

docker login registry.example.com
docker tag mlservice:1.0 registry.example.com/your-team/mlservice:1.0
docker push registry.example.com/your-team/mlservice:1.0

Self-hosted registries are especially useful when:

You want full control over your deployment pipeline
You work in a restricted or private environment
You want to avoid external dependencies or rate limits
You want fast image downloads within a local network

Understanding Images and Containers

A Docker image is a static, read-only snapshot that defines what files and programs exist inside a container. It is built once and can be reused many times.

A Docker container is a running instance of an image — it's a live process that uses the image as its root filesystem. Containers are isolated from your host system and from each other unless you explicitly allow communication.

Each time you run an image, Docker creates a new container with a temporary writable layer. When the container is stopped and removed, changes made inside it are lost unless you persist data via mounts.

Layers and Caching in Docker

Docker images are built in layers. Each line in your Dockerfile creates a new layer. Docker caches these layers to avoid rebuilding them unnecessarily.

For example, if your Dockerfile contains:

COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .

Then Docker will only rerun the pip install command if requirements.txt changes. If you update your source code but not your dependencies, the cached layers are reused, speeding up builds.

Tip: Place frequently changing instructions (like COPY . .) near the bottom of your Dockerfile, and stable steps (like installing system packages) near the top. This maximizes caching efficiency.

Base images (like python:3.12-slim) are also layers. They are downloaded once and reused across images, saving disk space.

Why You Should Not Use `latest`

One of Docker’s biggest advantages is reproducibility. Once an image is built and pushed to a registry, anyone can pull and run the exact same environment.

But reproducibility only works if you avoid using floating tags like latest.

For example, if your Dockerfile starts with:

FROM python:latest

Then every time you build the image, it may use a different Python version — leading to inconsistencies and subtle bugs.

Instead, always use pinned versions:

FROM python:3.12

The same advice applies to your Python packages. Don’t just install them freely — use a requirements.txt file with exact versions:

fastapi==0.110.0
pandas==2.2.3
scikit-learn==1.6.1
matplotlib==3.10.1

By pinning both your Docker base image and your Python dependencies, you ensure that:

Your builds are consistent across time and machines
Bugs are easier to reproduce and fix
Deployments are safer and more predictable

This combination — a fixed Dockerfile and a pinned requirements file — is the foundation for reliable machine learning operations.

4. Introduction to Container Orchestration

In earlier chapters, we explored how to build and run individual Docker containers using off-the-shelf images from public registries or custom-built images defined via Dockerfiles. This works well for isolated, single-purpose tasks, but quickly breaks down when building real-world applications.

In real-world production systems, multiple containers often need to work together to support the full ML workflow. A typical deployment might include:

A model server (e.g. FastAPI serving a PyTorch model)
A database to store user inputs (e.g. Postgres)
An object storage service for datasets and model files (e.g. S3)
A frontend for interacting with users (e.g. React, Vue.js)
A caching layer for performance (e.g. Redis)
Monitoring and observability tools like Prometheus or Grafana

The challenge is no longer just “how do I run one container,” but “how do I run and manage a group of containers that work together?”

Each of these components may run in a separate container. To run them reliably, they need to be started in the right order, connected over a shared network, configured via environment variables, and sometimes assigned persistent storage. If any component crashes, it may need to be restarted automatically. And when running in production, you may want to scale some services across multiple replicas or even multiple machines.

This problem — managing a group of related containers as a single system — is the domain of container orchestration.

Two tools are commonly used for this purpose:

Docker Compose, a developer-friendly orchestration tool for local or single-node setups.
Kubernetes, a production-grade orchestration platform for distributed, scalable environments.

First, we’ll start with Docker Compose, because it’s simpler to understand and sufficient for many use cases — especially during development or for smaller deployments. Later we’ll introduce Kubernetes and compare the two systems in detail.

Docker Compose: Lightweight, Local Orchestration

Docker Compose is a developer-friendly tool that allows you to define and run multi-container applications using a simple YAML configuration file. It was created to help teams manage small to medium-sized projects locally or on single-node servers.

To understand Docker Compose more concretely, let’s look at a minimal example that captures a common machine learning deployment pattern.

This project includes:

A frontend, such as a Vue.js application
A backend, written in FastAPI and serving a machine learning model
A PostgreSQL database for persistent data storage

This is not a fully runnable example — it references placeholder image names like my-frontend and assumes a ./backend folder containing a Dockerfile — but it illustrates the basic structure and syntax of a Compose file.

services:
  frontend:
    image: my-frontend:1.0
    ports:
      - "3000:3000"

  backend:
    build:
      context: ./backend
    ports:
      - "8000:8000"
    environment:
      - DATABASE_URL=postgresql://user:${PASSWORD}@db:5432/appdb

  db:
    image: postgres:15
    environment:
      - POSTGRES_USER=user
      - POSTGRES_PASSWORD=${PASSWORD}
      - POSTGRES_DB=appdb
    volumes:
      - pgdata:/var/lib/postgresql/data

volumes:
  pgdata:

Let’s break down what this file does:

The frontend runs from a pre-built image and is exposed to port 3000 on the host.
The backend is built from a local folder named ./backend, which should include a Dockerfile. It communicates with the database using an environment variable DATABASE_URL.
The db service uses the official Postgres image. It initializes the database and uses a named volume (pgdata) to persist its state.

All three services share a private internal network and can communicate with each other using their service names (frontend, backend, db). However, only the frontend and backend are exposed to the host machine.

You can start this system with a single command:

docker compose up

Core Docker Compose Commands

Here are the most useful Docker Compose commands you’ll use:

docker compose up Starts all services, building any required images if necessary.
docker compose up -d Starts all services in detached mode, so they run in the background.
docker compose up --build Forces a rebuild of any services with local Dockerfiles, even if the Dockerfile hasn’t changed. This is useful when you’ve modified the base image or source code.
docker compose down Stops and removes all services and networks defined in your Compose file.
docker compose build Builds any services that are defined using a build: context.
docker compose ps Lists all running containers in the current Compose project.
docker compose logs Shows the logs from all services.
docker compose logs -f backend Follows the live logs of a specific service, such as the backend.

These commands are all you need to manage, debug, and iterate on your Compose applications.

Managing Images and Builds

Docker Compose supports two main ways to launch a service: using pre-built images or building images locally from source.

Using a pre-built image

This is the simplest and most common option. If you already have a container image available on Docker Hub or another registry, you can reference it using the image: directive.

services:
  frontend:
    image: registry.example.com/my-frontend:1.0

Compose will pull the image if it’s not available locally and then start the container.

Building a local image

If you're actively developing a service — like a backend model server — you typically want to build the container from your local code using a Dockerfile.

services:
  backend:
    build:
      context: ./backend

Here, ./backend is a folder that contains a Dockerfile. Docker Compose will use this as the build context, create an image, and then start a container from it.

To ensure your local image gets rebuilt, use:

docker compose up --build

This forces Docker to rebuild the image even if the Dockerfile hasn't changed, which is useful when you've modified source code or base images.

Networking and Inter-Service Communication

One of Docker Compose's most powerful features is its built-in networking model. When you run docker compose up, Docker automatically creates a private network and connects all the defined services to it. This allows containers to communicate with each other using simple hostnames — specifically, their service names.

Internal communication by service name

Each service in a Compose file is assigned a DNS name equal to its service name. This means that services can talk to each other using those names, without needing to know the container's IP address.

For example, if your backend service wants to connect to your db service (PostgreSQL), you can use the hostname db:

services:
  backend:
    image: my-backend
    environment:
      - DATABASE_URL=postgresql://user:${PASSWORD}@db:5432/appdb

  db:
    image: postgres:15

Here, the backend can reach the Postgres server using the hostname db, which resolves to the internal IP address of the database container on the Docker network.

This is much simpler than managing IP addresses manually, and ensures that everything will still work even if Docker reassigns container IPs when restarting.

Exposing ports to the outside world

By default, services are only accessible from within the Docker Compose network. If you want to access a service from outside (e.g. from your browser or Postman), you need to explicitly publish its ports using the ports: directive.

services:
  frontend:
    image: my-frontend
    ports:
      - "3000:3000"

  backend:
    image: my-backend
    ports:
      - "8000:8000"

This configuration maps:

Port 3000 in the frontend container to port 3000 on your host
Port 8000 in the backend container to port 8000 on your host

Now you can open your browser and go to http://localhost:3000 to access the frontend, or make a request to http://localhost:8000/predict to call a machine learning API in the backend.

Security benefits of limiting port exposure

Not every service should be accessible from outside the Docker network. For example, a database should typically only be accessed by your backend — not by external users or applications. In Docker Compose, you simply omit the ports: section to keep a service internal:

services:
  db:
    image: postgres:15
    environment:
      - POSTGRES_USER=user
      - POSTGRES_PASSWORD=${PASSWORD}
      - POSTGRES_DB=appdb

This makes the database reachable by the backend (which can connect to db:5432), but not by anything outside the Docker environment. If you tried to connect from your laptop to localhost:5432, it would fail — because the port is not published.

This design improves security by reducing the system's attack surface. Only the services that need to be public are made public, and everything else stays internal.

Volumes and Persistence

When working with Docker Compose, it's important to understand what happens to your containers and their data over time. By default, containers created with docker compose up are ephemeral — meaning they are rebuilt from scratch every time you recreate them.

If you run:

docker compose down

Docker Compose removes all running containers, along with their internal data. The next time you run docker compose up, even if the image is unchanged, Docker will create new containers from scratch. This behavior is fine for stateless services like frontend servers or APIs, but it becomes a problem when dealing with stateful services, such as databases or caching layers, which need to preserve data across runs.

To retain data between container restarts, you must use some form of persistent storage. Docker Compose offers two main options: bind mounts and named volumes.

Bind mounts

A bind mount maps a folder on your host machine into the container. This is useful when you want to directly share files between your system and the container — for example, to make code or data available in both environments.

services:
  notebook:
    image: jupyter/scipy-notebook
    ports:
      - "8888:8888"
    volumes:
      - ./notebooks:/home/jovyan/work

In this example, the ./notebooks folder on your host is mounted inside the container at /home/jovyan/work. Any changes made in the notebook interface are saved directly to your host file system.

Bind mounts are easy to use and great for development, but they come with a few drawbacks:

File permission issues can arise, especially on Windows or when the container runs as a different user.
The container depends on the structure and presence of files on the host machine.
They are not portable — the Compose file may break on another computer if the paths don’t exist.

Named volumes

A named volume is a Docker-managed storage location that exists independently of your host file system. Named volumes are ideal for production use, or anytime you want Docker to manage persistence automatically.

services:
  db:
    image: postgres:15
    environment:
      - POSTGRES_USER=user
      - POSTGRES_PASSWORD=${PASSWORD}
      - POSTGRES_DB=appdb
    volumes:
      - pgdata:/var/lib/postgresql/data

volumes:
  pgdata:

This configuration creates a named volume called pgdata. Docker Compose ensures that this volume persists across container restarts — even if you run docker compose down and up again, the database files will still be there.

Compared to bind mounts, named volumes are:

Easier to manage and more portable between systems
Isolated from the host’s file system structure
Less prone to permission problems
Automatically created and cleaned up by Docker when you remove the project

When to use persistence

Use bind mounts when developing code, editing notebooks, or needing to sync files between host and container.
Use named volumes when storing persistent application data such as databases, caches, or logs that shouldn’t be tied to a specific file path on the host.

Understanding the difference between bind mounts and volumes is essential when building robust, multi-service environments. If your project includes a database, object storage layer, or experiment tracker, always think about how and where the data should be persisted. Docker Compose gives you the flexibility to choose the approach that fits your workflow.

Environment Variables and Configuration

Environment variables are a common convention across containerized applications. Many public Docker images — such as postgres, redis, mlflow, or nginx — support configuration via environment variables out of the box. It’s essential to consult the image documentation to know which variables are supported.

Likewise, if you write your own services (in Python, Go, Node.js, etc.), it’s a good practice to make your configuration parameters configurable via environment variables. This enables reproducible deployments, easier testing, and compatibility with Compose, Kubernetes, and CI/CD pipelines.

In Docker Compose, environment variables are not inherited from your host machine by default. This means that if you have a variable set in your terminal (on Linux, macOS, or Windows), it won’t automatically be visible inside your containers. You must explicitly pass them into the container using one of several supported mechanisms.

Let’s walk through the three most common ways to define environment variables in Docker Compose.

Inline environment variables in the Compose file

This is the simplest option. You define the environment variables directly under the environment: section of a service. This approach is convenient for fixed or default values that don't change across environments.

services:
  frontend:
    image: my-frontend
    environment:
      - DEBUG=true
      - MAX_WIDTH=1000
      - NAME=example.com

This configuration sets three variables that will be available to the frontend container. Inline variables are straightforward, but they hardcode values into the Compose file — which is not ideal for secrets or values that vary between deployments.

Variable substitution using a .env file

Another approach is to define your environment variables in a separate .env file. Docker Compose automatically loads this file and allows you to reference its contents inside the Compose YAML using ${VARIABLE_NAME} syntax.

Example .env file:

BACKEND_PORT=8000
DEBUG=true
DATABASE_URL=postgresql://user:${PASSWORD}@db:5432/appdb

Compose file using variable substitution:

services:
  backend:
    build:
      context: ./backend
    ports:
      - "${BACKEND_PORT}:${BACKEND_PORT}"
    environment:
      - DEBUG=${DEBUG}
      - DATABASE_URL=${DATABASE_URL}

This technique is useful when you want your configuration to be customizable by each user or deployment environment. The .env file is typically included in .gitignore, and teams may provide a .env.example file to illustrate the required settings.

Loading all variables from a file into a service

If you have a large number of environment variables — for example, service-specific secrets or third-party credentials — you can load them all from a dedicated file using the env_file: directive.

services:
  backend:
    build:
      context: ./backend
    env_file:
      - secrets.env

Example secrets.env file:

SECRET_KEY=supersecretvalue
JWT_ISSUER=https://my-app.com
TOKEN_EXPIRATION=3600

This approach is especially helpful for isolating secret configuration from public project settings. In production, secret files are often injected dynamically through CI/CD pipelines or mounted as volumes from secure storage, rather than being committed to version control.

Service Lifecycle and Restart Behavior

When using Docker Compose to run long-lived services — such as APIs, model servers, dashboards, or databases — it’s important to understand how these services behave over time, especially in the face of crashes or system reboots.

By default, Docker Compose does not keep your services running if something goes wrong. If a service crashes or your machine reboots, the container will not restart automatically. This is because the default restart policy is no.

For production scenarios or any setup where uptime matters, this default behavior is rarely what you want. Let's explore how to manage service lifecycle more explicitly using restart policies.

Why restart policies matter

Restart policies determine how the Docker engine reacts when a container stops. Without one, containers are treated as disposable: they run once, and if they exit — whether due to an error or a system shutdown — they simply stay stopped.

If you're running backend services, dashboards, or databases as part of your Docker Compose setup, you likely want them to behave more like daemons: always running, even if the host machine restarts or the process crashes.

This is where restart policies come in. They tell Docker Compose to automatically bring containers back online in various scenarios.

Common restart policy options

Docker supports several restart policies, which can be set per service in your Compose file:

services:
  backend:
    build:
      context: ./backend
    restart: always

Here’s what the available options mean:

no (default): Do not restart the container automatically. This is safe for short-lived tasks, one-off scripts, or development experiments — but unsuitable for anything that should remain available.
always: Restart the container if it ever stops. This includes normal failures as well as system reboots. Ideal for services that should be running 24/7, like model APIs or dashboards.
unless-stopped: Restart the container like always, except if you manually stop it. This is useful if you want services to persist across reboots but still retain the ability to shut them down manually (e.g. with docker compose stop).
on-failure: Only restart the container if it exits with a non-zero exit code — and optionally, up to a certain number of retries. This is more relevant for short-lived jobs or batch tasks.

Example with unless-stopped:

services:
  mlflow:
    image: mlflow/mlflow
    restart: unless-stopped

This ensures the MLflow tracking server comes back up after crashes or reboots, but stays stopped if the user has explicitly taken it down.

Kubernetes: Industrial-Scale Container Orchestration

While Docker Compose is ideal for simpler scenarios, it lacks the flexibility and scalability needed in larger, distributed systems. This is where Kubernetes comes in. Originally developed by Google and now maintained by the Cloud Native Computing Foundation (CNCF), Kubernetes is the industry standard for managing containers across multiple servers.

Kubernetes is not a tool in the same sense as Docker Compose. It’s a platform, and setting it up requires running a control plane with several internal components. These include schedulers, service discovery systems, network controllers, volume provisioners, and more.

A central idea in Kubernetes is declarative configuration. Instead of issuing commands to create and start containers, you define the desired state of the system — for example, “run one instance of a PostgreSQL database with this password and a persistent volume.” Kubernetes takes care of matching the real state to your desired state, automatically creating or restarting containers as needed.

Here’s a basic example using Kubernetes YAML to run a Postgres service:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pgdata
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi

---

apiVersion: apps/v1
kind: Deployment
metadata:
  name: db
spec:
  replicas: 1
  selector:
    matchLabels:
      app: db
  template:
    metadata:
      labels:
        app: db
    spec:
      containers:
        - name: db
          image: postgres:15
          env:
            - name: POSTGRES_USER
              value: user
            - name: POSTGRES_PASSWORD
              value: password
            - name: POSTGRES_DB
              value: appdb
          volumeMounts:
            - name: pgdata
              mountPath: /var/lib/postgresql/data
      volumes:
        - name: pgdata
          persistentVolumeClaim:
            claimName: pgdata

As you can see, even a simple service like Postgres requires a bit more boilerplate than Docker Compose. But Kubernetes brings enormous power: automatic restarts, load balancing, scaling, monitoring, health checks, and tight integration with cloud services.

Also important to understand is that Kubernetes is not a single implementation, but rather a standardized set of APIs and expectations. Many cloud providers (like AWS, Azure, and GCP) offer Kubernetes-as-a-Service products, while others run their own distributions (like K3s or MicroK8s). Kubernetes is flexible and powerful — but also complex to manage and maintain.

The complexity comes not just from the YAML files, but from running and maintaining the cluster itself: distributed file systems, service discovery, ingress controllers, GPU scheduling, and so on.

Alternatives to Kubernetes

A few other tools exist in this space, though they are much less commonly used today:

Docker Swarm: An older clustering system built into Docker itself. It attempted to make orchestration easy but never gained the adoption or feature set of Kubernetes. Today, it's largely considered deprecated.
Nomad: Developed by HashiCorp, Nomad is a general-purpose orchestrator that supports containers but hasn’t reached the popularity of Kubernetes.
OpenShift: A powerful enterprise platform built on top of Kubernetes. It adds developer tooling, CI/CD integration, and advanced security — but comes with a high operational cost and learning curve. It's mainly used by large enterprises building internal developer platforms.

While these alternatives are interesting from an architectural point of view, in practice, most professionals and organizations choose either Docker Compose for local/small projects, or Kubernetes for production-scale deployments.

Compose vs. Kubernetes: A Conceptual Comparison

Both Docker Compose and Kubernetes allow you to define multi-container systems — but they solve problems at different scales.

Docker Compose is designed for:

Single-machine environments
Local development and prototyping
Small projects with few containers
Easy configuration and startup

Kubernetes is designed for:

Multi-node, production environments
High availability and redundancy
Scalable and resilient infrastructure
Automated monitoring and maintenance

They are not mutually exclusive. A common workflow is:

Start with Docker Compose for local development.
Convert your setup to Kubernetes when your project needs to scale.

To make this comparison even clearer: there are tools that automatically convert a compose.yaml file into Kubernetes manifests. One such tool is kompose. This shows how closely the two tools align in terms of concepts — despite the differences in scale and complexity.

Advanced Capabilities in Kubernetes

Once you understand the basic concepts of Kubernetes, it's worth highlighting a few powerful features that go far beyond what Docker Compose can offer. These features demonstrate why Kubernetes is the tool of choice for large-scale, production-grade systems.

Horizontal Scaling and Load Balancing

One of Kubernetes’ core strengths is its ability to scale services horizontally. In a Docker Compose setup, each service typically runs as a single container. With Kubernetes, you can declare that a service should have n replicas — and Kubernetes will ensure they are created, distributed across the cluster, and automatically load-balanced.

If one container crashes, Kubernetes will replace it. If you want to handle more traffic, you can scale up with a single command or automatically based on CPU usage.

This makes Kubernetes ideal for high-availability systems or workloads with variable demand.

Rolling Updates and Self-Healing

Kubernetes can perform rolling updates, gradually replacing old versions of your application with new ones while minimizing downtime. If something goes wrong during an update, Kubernetes can automatically roll back to the previous stable version.

Kubernetes also performs continuous health checks. If a container fails its liveness or readiness probe, Kubernetes will restart it. This ensures that your services stay available even in the face of unexpected failures.

Docker Compose does not have built-in support for these features — it's up to you to restart services manually or build your own health checks.

Persistent Storage Across the Cluster

In Kubernetes, you can define persistent volumes that are backed by storage systems outside the individual nodes — like cloud block storage, distributed file systems (e.g. Ceph), or networked storage (e.g. NFS, iSCSI). These are managed independently of the containers and can be reused across restarts or reassignments.

In contrast, Docker Compose relies on local volumes tied to a single host. This works well for development, but doesn't scale to multi-node clusters.

Declarative Resource Management

Kubernetes lets you specify detailed resource requests and limits for each container: how much CPU and memory it needs, and how much it is allowed to consume. The scheduler uses this information to decide where to run containers, optimizing for efficiency and fairness.

You can also define affinity rules, taints and tolerations, and more — allowing for complex placement logic (e.g. "only run this workload on GPU nodes").

This level of control is critical in large teams and production systems. Docker Compose offers none of this.

Choosing the Right Tool for the Job

Docker Compose is a great way to get started with orchestration. It’s developer-friendly, well-supported, and fits perfectly with the kind of systems we’re building throughout this course: machine learning APIs, training pipelines, dashboards, and so on.

But it’s important to keep an eye on the bigger picture. In real-world MLOps environments — especially at scale — teams will often move toward Kubernetes for its rich ecosystem and operational capabilities.

5. Infrastructure Management: Cloud, On-Premise and Infrastructure as Code

As we’ve seen in previous chapters, building modern machine learning applications often involves more than just writing code. We’ve discussed how to containerize software with Docker, orchestrate it locally with Docker Compose, and scale it across nodes using Kubernetes. But all of this raises a crucial question:

Where do we actually run our containers and services?

Up to now, you may have experimented on your own laptop or a university server. But in real-world deployments, this won’t be enough. You’ll need infrastructure that is reliable, scalable, and always available — whether you’re serving models to users, running scheduled training jobs, or hosting a web frontend.

In this chapter, we explore the practical side of infrastructure management: running your workloads either in the cloud or on your own servers, understanding what services cloud providers offer, and how to automate everything to keep your system consistent and maintainable. Along the way, we’ll introduce key concepts like virtualization, managed container platforms, and infrastructure as code.

What Is a Cloud Provider?

A cloud provider is a company that rents out computing resources over the internet — things like virtual machines, storage, networking, databases, and more. Instead of buying and maintaining your own physical servers, you pay only for what you use, and the provider takes care of the hardware, maintenance, and uptime.

This model is especially powerful for machine learning workflows, where compute needs can vary dramatically over time. You might need a GPU server for training this week, and nothing next week. With the cloud, you don’t need to invest in your own data center — you just rent what you need, when you need it.

The three major cloud providers used across the industry today are:

Amazon Web Services (AWS) The market leader, offering a massive range of services, including compute (EC2), storage (S3), databases, machine learning tools (like SageMaker), and many more.
Microsoft Azure Popular in enterprise environments, Azure integrates well with Microsoft technologies (like Active Directory, Windows Server, and Office 365) and offers a broad set of cloud services.
Google Cloud Platform (GCP) Known for its data and AI tooling, including managed services for Kubernetes (GKE), machine learning (Vertex AI), and big data processing (BigQuery).

These three make up the majority of the global cloud market and will be our focus in this chapter.

Other cloud providers exist as well. Some notable ones include:

IBM Cloud – Offers enterprise services and some machine learning support.
Heroku – A simpler platform-as-a-service provider, popular with startups and smaller web applications.
DigitalOcean – Offers developer-friendly virtual machines and container hosting with a simplified interface.

While these alternatives have their place, AWS, Azure, and Google Cloud are by far the most common in both startups and large-scale enterprises — and they offer the widest support for modern ML infrastructure.

Understanding Infrastructure Layers

When working with cloud platforms, you’ll often come across three key terms: Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). These describe different levels of abstraction that cloud providers offer — from giving you raw infrastructure to delivering ready-to-use software.

You can think of them as layers in a stack:

IaaS gives you the raw building blocks — virtual machines, networking, and storage — but you manage most of what’s running on top.
PaaS provides you with a managed environment to run your applications — like hosting a web API or a database — without worrying about the underlying infrastructure.
SaaS delivers fully finished applications that you can just use — like email, chat, or an AI model accessible via API.

Let’s briefly explore each:

Infrastructure as a Service (IaaS) At the IaaS level, cloud providers like AWS, Azure, and Google Cloud give you virtual machines (VMs) that act like your own servers in the cloud. You can install whatever software you like, run containers, or even set up your own Kubernetes cluster. You’re responsible for configuring and maintaining everything inside the VM.

Example: AWS EC2, Azure Virtual Machines, Google Compute Engine

Platform as a Service (PaaS) PaaS offerings abstract away the operating system and let you focus on your application. You might push your code, upload a container, or connect a database — and the platform handles scaling, patching, and networking for you.

Example (Web Development): AWS Elastic Beanstalk, Azure App Service, Google App Engine
Example (SQL Databases): AWS RDS, Azure SQL Database, Google Cloud SQL

Software as a Service (SaaS) SaaS is the highest level of abstraction. You don’t manage code or containers — you just use a service via a user interface or API. Most people use SaaS every day without thinking about it.

Example (AI/ML): OpenAI’s ChatGPT API, AWS Rekognition, Google Vision API

We’ll explore each of these layers in more detail throughout the chapter. For now, just keep in mind that cloud services are organized along this spectrum — from raw infrastructure to full applications — and choosing the right level depends on how much control and responsibility you want to take on.

Virtual Machines in the Cloud (IaaS)

In the early days of IT, companies bought and maintained their own physical servers. These machines were expensive, required dedicated space and power, and had to be manually maintained by in-house system administrators. Every service — from a web app to a database — ran directly on an operating system installed on bare metal.

Today, the majority of infrastructure has moved to the cloud, thanks to a technology called virtualization. Instead of owning servers, you can now rent virtual machines (VMs) — software-defined computers that behave just like physical ones but run on shared hardware in massive cloud data centers.

This model is known as Infrastructure as a Service (IaaS). You’re still responsible for everything inside the virtual machine — installing software, configuring services, handling updates — but you don’t have to manage the physical hardware. You get more flexibility and scalability without needing a server room.

Cloud providers offer much more than just virtual CPUs. The IaaS layer also includes:

Block storage (like virtual hard drives) for persistent data
Object storage for storing large datasets, files, or models
Virtual networking (private networks, firewalls, load balancers) so services can communicate securely
Custom machine types, including access to GPUs for machine learning workloads

Here’s what IaaS looks like on the major cloud platforms:

AWS
- Virtual machines: EC2 (Elastic Compute Cloud)
- Block storage: EBS (Elastic Block Store)
- Object storage: S3 (Simple Storage Service)
- Networking: VPC (Virtual Private Cloud), Elastic Load Balancer
Microsoft Azure
- Virtual machines: Azure Virtual Machines
- Block storage: Azure Disks
- Object storage: Azure Blob Storage
- Networking: Azure Virtual Network, Azure Load Balancer
Google Cloud Platform (GCP)
- Virtual machines: Compute Engine
- Block storage: Persistent Disks
- Object storage: Cloud Storage
- Networking: VPC Network, Cloud Load Balancing

The key benefit of IaaS is control. You can run anything you want inside a VM — from traditional software like web servers, relational databases, and caching systems, to custom applications like model training pipelines, experiment trackers, or internal dashboards. But with that control comes responsibility: you have to manage updates, security and scaling yourself.

Containers in the Cloud: Deployment Models and Trade-offs

As teams grow beyond basic virtual machines, many choose to move up the stack and adopt containerization. Containers offer a more lightweight, portable way to package applications — making it easier to develop, test, and deploy software consistently across environments.

But while containers simplify application deployment, running containers in the cloud introduces new architectural decisions: Where should your containers run? Who should manage the infrastructure underneath? How much flexibility do you need?

There are several common approaches:

Running containers manually on virtual machines

A straightforward method is to provision VMs through a cloud provider (like AWS EC2, Azure Virtual Machines, or Google Compute Engine), and then install Docker or Podman directly on them. This setup offers full control and flexibility — you're in charge of the container runtime, network configuration, updates, and scaling logic.

This approach is particularly useful when integrating with legacy systems, or when your deployment needs are relatively small and stable. However, it also means taking on a lot of operational responsibility, from monitoring to security patching.

Running your own Kubernetes cluster

For more complex systems, teams often deploy Kubernetes on top of virtual machines. Kubernetes offers a powerful framework for managing large numbers of containers, with features like service discovery, automatic scaling, and self-healing.

But running Kubernetes yourself is a serious undertaking. You must install and upgrade the control plane, secure communication between components, configure high availability, and maintain the overall health of the system. This level of control is useful for advanced use cases — but comes with a high maintenance cost.

Using a managed container platform

To reduce the operational burden of running containers manually, most cloud providers offer a range of managed container services. These platforms allow you to focus on building and deploying applications, while the cloud provider takes care of provisioning infrastructure, managing orchestration, and scaling workloads.

There are several types of managed container services available — from basic container execution platforms that simply run a single container without orchestration, to batch processing systems optimized for large-scale parallel jobs, to full-featured container orchestration platforms with autoscaling and service discovery built in.

Among these, two of the most widely used categories are:

Managed Kubernetes services, where the cloud provider runs and maintains the Kubernetes control plane for you.
Serverless container platforms, where containers are launched on-demand in response to traffic, with no need to manage servers or clusters.

We’ll explore both of these options in more detail in the next sections — but it’s important to note they are part of a broader ecosystem of container tools, each suited to different use cases and levels of complexity.

Managed Kubernetes Services

Kubernetes is the industry standard for orchestrating containers, but running it yourself is complex. It involves setting up control planes, managing certificates, networking, scaling, monitoring — and maintaining all of that over time. For many teams, this operational overhead is a distraction from the real goal: running their applications reliably.

That’s where managed Kubernetes services come in. All major cloud providers — AWS, Azure, and Google Cloud — offer fully managed Kubernetes platforms: EKS (Elastic Kubernetes Service), AKS (Azure Kubernetes Service), and GKE (Google Kubernetes Engine). These services run the Kubernetes control plane for you, manage upgrades and security patches, and integrate tightly with their own networking and storage systems.

As a user, this means you no longer worry about how Kubernetes itself is deployed. You write your Kubernetes manifests — specifying how many replicas a service should have, what kind of persistent volume it needs, or how traffic should be load balanced — and the platform takes care of the rest. You don’t know which exact VM your container runs on, and you don’t need to. If you request a persistent disk, the platform will provision storage for you. If you expose a service with a LoadBalancer, the provider will automatically provision and configure a cloud-native load balancer.

This abstraction lets you focus on your application and deployment logic, without getting bogged down in low-level infrastructure details. You still retain full control over how your containers are configured, how they scale, and how they communicate — but the cloud platform handles the heavy lifting of scheduling, networking, and failover.

For example, a team could deploy a FastAPI model server to a managed Kubernetes cluster, configure horizontal scaling based on CPU usage, mount a persistent volume for logs, and expose it to the internet behind a load balancer — all using standard Kubernetes YAML, and without ever logging into a VM.

Managed Kubernetes strikes a useful balance: you get the power and flexibility of Kubernetes, but without having to run Kubernetes itself.

Serverless Container Platforms

Serverless computing is a model where you hand over your code — or a container — to the cloud provider, and they take care of running it, scaling it, and stopping it when it’s no longer needed. You don’t manage servers, virtual machines, or container orchestration. Instead, you define what should happen, and the platform handles infrastructure behind the scenes.

A key advantage of serverless is scaling to zero. If no requests are coming in, the application can fully spin down — consuming no resources and incurring no cost. When demand increases, the platform automatically scales up to handle concurrent requests, then scales back down again as needed. This makes serverless ideal for event-driven systems, bursty traffic, and lightweight ML inference workloads.

There are two common approaches:

Code-level serverless lets you deploy individual functions in a high-level language (like Python or JavaScript), triggered by events such as HTTP requests or file uploads. This is useful for lightweight logic and simple backend tasks.

Container-level serverless allows you to deploy full Docker containers, including your own libraries, dependencies, or ML models — ideal for more complex services.

Common platforms include AWS Lambda, Azure Functions, Google Cloud Functions, Cloudflare Workers (for code-level), and AWS Fargate, Azure Container Apps, and Google Cloud Run (for container-level).

Both models let you focus on application logic without managing infrastructure, and they’re especially useful when you need elastic scaling, fast deployment, and cost efficiency for low-traffic or unpredictable workloads.

Platform as a Service (PaaS): Simplifying Development

Platform as a Service (PaaS) allows developers to focus on writing application code without managing servers, containers, or infrastructure. You write your code, deploy it, and the platform takes care of running it, scaling it, and keeping it available.

For example, imagine you're building a web application in Python using Flask. On a PaaS like AWS Elastic Beanstalk, Azure App Service, or Google App Engine, you can deploy your code directly — often just by pushing it to a repository. The platform sets up the runtime environment, installs dependencies, and connects your app to the internet. You don’t need to configure web servers or handle load balancing — that’s all abstracted away.

Another common use case is a backend service that needs to connect to a managed database. Most PaaS platforms let you provision one with a few clicks. Once it's ready, you get a connection string and start building — no need to install, patch, or back up the database yourself.

This level of abstraction saves time, but comes with trade-offs. PaaS platforms are more opinionated than containers or virtual machines — they work best for common workloads like web apps or APIs, but offer less flexibility for custom runtimes, system-level access, or specialized networking.

Still, for many projects, PaaS offers the fastest and simplest path to deployment. It’s ideal when you just want to run your application — and let the platform handle the rest.

PaaS for Machine Learning

While Platform-as-a-Service is often associated with web and backend development, all major cloud providers — AWS, Azure, and Google Cloud — offer high-level platforms specifically designed for machine learning. These services aim to support the full machine learning lifecycle: from data preparation and model training to deployment, monitoring, and versioning — all without requiring teams to manage underlying infrastructure.

These platforms are especially useful for data scientists and ML engineers who want to focus on models and datasets rather than scaling, networking, or DevOps. By offering a tightly integrated suite of tools, they make it easier to go from prototype to production with less friction.

There is some variation across providers — AWS SageMaker, Azure Machine Learning, and Google Vertex AI all offer broadly similar capabilities — but for clarity, we’ll focus here on AWS SageMaker, which is generally considered the most mature and widely adopted of the three.

Key Services Offered by AWS SageMaker

SageMaker includes a wide range of tools that support different stages of the machine learning workflow:

SageMaker Studio – an all-in-one web-based IDE for building and managing ML workflows
SageMaker Data Wrangler – a visual tool for exploring and transforming datasets
SageMaker Training – scalable infrastructure for training models, including GPU and distributed training
SageMaker Autopilot – automatic model building for tabular data
SageMaker Hyperparameter Tuning – managed optimization of training parameters
SageMaker Inference – deployment options for real-time endpoints, batch jobs, or asynchronous inference
SageMaker Model Monitor – automatic drift detection and model health tracking
SageMaker Pipelines – building reusable, CI/CD-style ML workflows
SageMaker Feature Store – managing and sharing features across training and inference
SageMaker Experiments – tracking and comparing model runs

These services are designed to work together, but can also be used selectively. You might use SageMaker just for training, or just for inference, while managing the rest of your workflow elsewhere.

Trade-offs of Using ML PaaS Platforms

While the convenience of managed ML platforms is appealing, there are trade-offs to consider.

One of the most important is cost. These services often charge a premium for the infrastructure and automation they provide. Training a model or hosting an endpoint on SageMaker will generally cost more than running the same workflow manually on virtual machines or Kubernetes — simply because the provider is doing more of the work for you.

However, managing your own infrastructure isn’t free either. The time you spend setting up, monitoring, and maintaining a training pipeline or deployment stack also costs money — especially when things break. Using a managed platform means you trade some of that operational complexity for a higher bill, but also for better reliability and fewer distractions from your core work.

Another trade-off is flexibility. SageMaker (and similar platforms) are designed to support common workflows, but they are opinionated. If your ML workflow requires custom runtimes, unusual deployment patterns, or specific system-level optimizations, you may find yourself fighting the platform — or falling back to lower-level tools like containers or virtual machines.

In the end, it’s a question of scale, priorities, and internal expertise. For many teams, a platform like SageMaker provides a fast, reliable way to get models into production without hiring infrastructure engineers. For others — especially those with strict cost constraints, niche requirements, or deep DevOps experience — building a custom ML stack might be the better path.

Software as a Service (SaaS): Using AI Tools Directly

At the highest level of abstraction, we find Software as a Service (SaaS) — fully built applications or APIs that offer ready-made functionality. With SaaS, you're no longer writing application code to train or deploy models. Instead, you're consuming a finished service, usually via a simple web interface or an API.

These AI services are often pre-trained on large datasets and expose high-level capabilities such as image recognition, speech-to-text, text summarization, or language translation. You send data to an endpoint, and the service responds with a result. There's no need to worry about infrastructure, scaling, or model performance — that’s all managed for you by the provider.

All three major cloud platforms offer SaaS-style AI tools:

AWS: Rekognition (vision), Polly (text-to-speech), Transcribe (speech-to-text), Comprehend (NLP)
Azure: Cognitive Services (image tagging, face detection, speech recognition, etc)
Google Cloud: Vision AI, Translation API, Speech-to-Text, and others

These services are useful when you want to integrate common AI capabilities into your product without building models yourself. For example, you could build an internal tool that analyzes product images using AWS Rekognition, or a customer-facing chatbot that relies on Azure’s language understanding API.

Importantly, SaaS isn’t limited to cloud providers. Many specialized vendors offer AI functionality as a service. A well-known example is OpenAI, whose ChatGPT and GPT-4 APIs let you send a prompt and receive a generated text — whether you’re building a chatbot, writing assistant, or summarization tool. Similar APIs exist for image generation, speech synthesis, and even music generation.

These APIs are typically accessed through token-based pricing models: you pay per request, per character, per second of audio, or similar units. This makes them easy to integrate and scale — especially when you don’t have the resources or time to train and host your own models.

For engineers building intelligent applications, this means you can build a completely custom front-end, user workflow, or factory interface — and let someone else provide the intelligence behind the scenes. Whether you're classifying images from a production line or summarizing documents in a knowledge base, you don’t need to reinvent the model. You just plug into a service that already works.

Of course, there are trade-offs. These services are convenient, but often more expensive — and they come with limits on customization and transparency. Still, for many teams, SaaS AI tools offer an unbeatable combination of speed, reliability, and reduced complexity.

On-Premise Infrastructure: Building Your Own Virtualization Environment

While cloud computing offers scalability and convenience, some organizations opt for on-premise infrastructure — managing their own physical servers and virtualization platforms. This approach offers more control over data, compliance, and performance, though it requires more in-house responsibility and expertise.

What Is On-Premise Infrastructure?

On-premise infrastructure means owning and operating your own hardware — including servers, storage systems, and networking equipment — typically housed in a server room or private data center. Rather than provisioning machines from a cloud provider, you purchase and manage the physical systems yourself, giving you full authority over setup, maintenance, and security.

To make the most of these resources, teams usually run a virtualization platform — software that allows them to deploy multiple isolated workloads on the same physical machines. The most common approach is to install a bare-metal hypervisor, such as Proxmox VE, VMware ESXi, or Microsoft Hyper-V, directly onto the hardware. These platforms can run full virtual machines (VMs), as well as Linux containers (e.g. LXC), depending on the level of isolation and performance required.

Why Choose On-Premise?

Choosing to run infrastructure on-premise — instead of in the cloud — is often a strategic decision based on a combination of technical, financial, and organizational needs.

For many, it begins with data control. Sensitive information, such as medical or legal records, may fall under strict regulations that limit where data can reside or how it must be handled. On-premise deployments give you precise control over data location and access.

Closely related is security and privacy. Some organizations prefer to isolate critical systems entirely from the internet, applying custom security policies, enforcing physical access controls, and avoiding third-party dependencies.

There are also performance considerations. Applications that process large local datasets or require extremely low latency — like industrial systems or real-time analytics — often benefit from being physically close to the hardware they're interacting with.

In some cases, cost is a motivating factor. Cloud platforms excel at scaling and flexibility, but for predictable, long-running workloads, owning hardware can be more economical over time. Capital expenses may pay off, especially when usage patterns are stable.

Finally, existing infrastructure or in-house expertise can tip the balance. If a team already has a functioning data center, backup systems, and skilled sysadmins, staying on-premise might be the most efficient and cost-effective option.

Core Features of Virtualization Platforms

Modern virtualization platforms provide a suite of tools for deploying and managing services on owned infrastructure. They act as the control layer between physical servers and the workloads you run — offering capabilities similar to those you'd find in the cloud, but under your own control.

Beyond individual servers, most platforms also support clustering: the ability to group multiple physical nodes into a unified environment. This makes it possible to migrate workloads between machines, balance load across hardware, and ensure high availability in the event of hardware failure. Platforms like Proxmox VE, VMware vSphere, and Microsoft Hyper-V all support clustering for a wide range of setups.

Here are some of the key features commonly found across virtualization platforms:

Virtual Machine (VM) Management: Run fully isolated virtual machines with their own operating systems and resource limits — ideal for workload separation and compatibility.
Container Support: Use containers (e.g. LXC) for lightweight, fast-deploying workloads that don't need full OS-level isolation.
Networking Capabilities: Define virtual networks, VLANs, and bridges to manage communication between VMs and containers, or with the outside world.
Storage Solutions: Integrate local storage, NAS, or distributed systems like Ceph, with support for snapshots, replication, and thin provisioning.
Backup and Restore: Schedule backups, create snapshots, and restore workloads — often with support for external backup servers.
High Availability (HA): Automatically detect node failures and restart affected services elsewhere in the cluster to minimize downtime.
Security Features: Enforce access controls and firewall rules directly within the virtualization layer, providing strong boundaries between workloads.
Web-Based Management Interface: Configure and monitor your environment through a centralized browser UI with dashboards, logs, and alerts.

Together, these features allow organizations to build a robust and scalable infrastructure that’s fully under their own control — often rivaling the flexibility of cloud environments, but with added ownership and autonomy.

Automating Infrastructure: Infrastructure as Code

As infrastructure grows more complex, managing it manually becomes error-prone and unsustainable. Clicking through cloud dashboards or running ad-hoc scripts might work for quick experiments, but it doesn’t scale — and it certainly doesn’t guarantee reproducibility. That’s where Infrastructure as Code (IaC) comes in.

The idea behind Infrastructure as Code is simple but powerful: instead of configuring servers, networks, or containers by hand, you describe your desired infrastructure in a declarative format, typically using text-based configuration files. These files serve as the source of truth for what your environment should look like, and they can be versioned, reviewed, and reused — just like any other code.

This approach brings many advantages. It allows for automation, so you can provision infrastructure consistently across environments (e.g. dev, staging, production) without introducing human error. It also supports reproducibility: you can rebuild your setup from scratch, test changes safely, and roll back if needed. And it makes collaboration and auditing easier, because infrastructure definitions are stored and shared in plain files — no one needs to remember exactly what they clicked.

One of the most widely used tools for infrastructure as code is Terraform. Terraform allows you to define cloud resources — such as virtual machines, storage, and networking — using a simple, declarative language. You describe what you want (e.g. “a virtual machine in AWS with 8 GB RAM and a certain network configuration”), and Terraform handles the API calls to make it happen. Because it works across multiple providers, you’re not locked into a single cloud interface or vendor-specific tooling.

Kubernetes, too, embraces infrastructure as code. Its configuration model is built around YAML files that describe the desired state of your system: which services should run, how they should scale, which resources they need, and so on. Tools like Argo CD and Flux take this further by introducing GitOps — a workflow in which a Git repository holds the complete state of the cluster. When you commit changes, these tools automatically reconcile the cluster to match. This brings the benefits of code review, history tracking, and CI/CD pipelines directly to infrastructure and deployment.

Whether you're working in the cloud or on-premise, deploying containers or managing VMs, the key takeaway is this:

Modern machine learning infrastructure is programmable.

From spinning up virtual machines and scaling Kubernetes clusters, to running serverless functions or deploying entire applications with a single command — today’s tools make it possible to automate nearly everything. Technologies like Docker, Kubernetes, and Terraform empower teams to treat infrastructure as software: scalable, reproducible, and built for automation from the ground up.

Hybrid and Multi-Cloud Infrastructure

Not every organization chooses between clouds and on-premise infrastructure in absolute terms.

In practice, many adopt a hybrid approach, combining both models to suit different needs. For example, a company might train large models on a powerful on-premise GPU server — to reduce cloud compute costs or meet data locality requirements — and then deploy them using a cloud-based platform for scalability and availability. This way, each workload runs in the environment best suited to its requirements.

A related concept is multi-cloud infrastructure — using multiple cloud providers at once. This could mean running production services across AWS, Azure, and Google Cloud, either to improve fault tolerance, satisfy client requirements in regulated industries, or leverage best-in-class tools that are only available from specific providers (such as advanced analytics on Google Cloud or enterprise integrations on Azure). It’s a strategy that increases flexibility, though it also adds complexity in terms of operations and integration.

A key concern in both hybrid and multi-cloud environments is vendor lock-in — the risk of becoming dependent on a single provider’s ecosystem. If a cloud vendor raises prices, suffers from performance issues, or discontinues a service you rely on, migrating away can be difficult or even impractical. While major providers are generally stable, this dependence still limits flexibility and negotiating power. The risk becomes more pronounced when teams rely on high-level services like Platform-as-a-Service or Software-as-a-Service — tools that are easy to adopt but tightly integrated with a specific cloud’s APIs and infrastructure. These services offer speed and convenience, but at the cost of portability.

To stay flexible, many teams invest in vendor-neutral technologies like containers, Kubernetes and Terraform. These tools abstract away provider-specific details and offer a consistent deployment model across environments. While they don’t eliminate lock-in entirely, they help decouple your application logic and infrastructure definitions from any one platform.

Ultimately, there is no universal answer. The right mix of infrastructure depends on your team’s skills, priorities, and context. Some companies value control and self-hosting; others benefit from the speed and scale of cloud services. Most land somewhere in between — and that's perfectly fine.

This concludes our exploration of infrastructure management. You've now seen the full range of options for provisioning and running compute environments — from virtual machines and containers to platforms and services at every level of abstraction. With this foundation, you're ready to explore the next steps towards a production ML system.

6. Scalable Storage for Machine Learning: Object Storage

Modern machine learning projects go far beyond writing code. As teams move toward automated pipelines, continuous experimentation, and production deployment, they generate and consume large volumes of data — often in the form of binary files that need to be stored efficiently.

Traditional tools like Git source control are designed for source code, not for managing evolving datasets or model artifacts. In this chapter, we introduce the role of object storage in MLOps and explore how scalable storage systems underpin every stage of the machine learning lifecycle.

Why Machine Learning Needs Scalable Storage

In traditional software engineering, version control is straightforward: changes happen in source code, and tools like Git let developers track these changes line by line. But in machine learning, code is only one part of the story. A typical ML workflow generates and relies on many other types of data — most of which are far too large, too binary, or too dynamic for Git source control to handle.

For example, consider the training phase of a PyTorch model. You'll need access to datasets — possibly consisting of thousands of high-resolution images or time series data captured from sensors. These datasets may grow or evolve over time as new samples are collected, especially in live environments. During training, our models generate checkpoints, experiment metadata, and performance logs. Once deployed, the model produces predictions and telemetry logs that may also need to be stored, analyzed, or reprocessed. All of these artifacts are critical to understanding, reproducing, and improving your model — but they don’t belong in your Git repository.

This separation becomes even more important in production systems where automated retraining is involved. If your application supports a data flywheel — where user interactions or sensor input generate new labeled data — your model can continuously improve as more data comes in. In such a system, data may be ingested, cleaned, versioned, retrained, and deployed with minimal human intervention. These processes generate large binary files — models, datasets, metrics — that change frequently and need to be stored and retrieved programmatically. This kind of lifecycle is entirely outside the scope of Git.

In short, machine learning requires a new kind of storage — one that is:

Scalable: to handle growing datasets and multiple model versions.
Binary-friendly: for files like images, audio, and serialized model weights.
API-accessible: so tools and scripts can upload and retrieve files on demand.
Compatible across environments: so files can be accessed easily from training jobs, inference services, CI/CD pipelines, or any mix of cloud, on-premise or edge systems.

This is where object storage becomes essential. It provides a reliable, scalable way to store large files with flexible access methods — making it a foundational building block for any machine learning infrastructure. In the next section, we’ll explore what object storage is, how it works, and why it has become the standard choice for managing ML data at scale.

What Is Object Storage?

Object storage is a modern approach to storing large amounts of unstructured data — files that don’t fit neatly into tables or structured databases. In the context of machine learning, it has become a foundational building block for storing datasets, model artifacts, logs, and other binary files that need to be reliably persisted and accessed at scale.

At its core, object storage is a write-once, read-many system. You upload a file (called an "object") to the storage service, and it’s stored immutably — meaning you don’t update or modify it in place. If you need a new version, you upload a new file. This immutability simplifies versioning and reduces the complexity of managing changes over time.

Object storage systems differ significantly from traditional file systems or block storage:

Compared to file systems (like what you use on your laptop or in a network drive), object storage uses a flat namespace. While you can simulate folder hierarchies using prefixes (e.g. project1/images/), there’s no real directory structure — just a collection of uniquely named objects. This makes the system highly scalable and easier to distribute across many servers.
Compared to block storage, which provides low-level access to disk blocks (essentially virtual hard drives you format and manage like a local disk), object storage is much higher-level. You don’t deal with formatting or file systems; you just send a binary blob to the system and retrieve it later via an API call.

Before object storage became common, teams used network file systems like NFS or protocols like FTP to store and share data. These systems worked well in small, centralized environments but struggled with scalability, redundancy, and global access. Cloud providers needed something more flexible — something that could serve petabytes of data across regions, handle millions of files, and remain durable and cost-efficient.

That’s where object storage comes in. It’s designed for scalability, durability, and low cost. Cloud platforms like AWS, Azure, and Google Cloud offer object storage as one of their core services, allowing users to store and retrieve files over the internet using simple APIs. These systems are ideal for storing large datasets, images, videos, audio files, logs, and machine learning artifacts — any file you want to store and retrieve efficiently, without worrying about the underlying infrastructure.

Because of its simplicity and scalability, object storage is now the default choice for storing binary data in ML pipelines. It integrates well with training scripts, deployment tools, and CI/CD workflows — and unlike traditional storage solutions, it can scale from a single file to millions without changing your architecture.

The S3 Standard and Cloud Compatibility

Amazon S3 (Simple Storage Service) was launched by AWS in 2006 and quickly became one of the most widely used cloud storage solutions. It introduced a clean, minimal API for storing and retrieving binary objects over HTTP — designed for scalability, durability, and global access. As the popularity of AWS grew, so did the reach of S3, and over time, its interface evolved into a kind of industry default for object storage.

While S3 itself is a proprietary service, its API is based on simple and well-understood REST principles. It uses standard HTTP verbs like GET, PUT, DELETE, and POST to interact with storage "buckets" and "objects", and because it’s so transparent and accessible, other vendors and platforms began implementing it too — both cloud providers and open-source communities.

Today, the S3 API has become a de facto standard for object storage across the industry. Major platforms like Azure Blob Storage, Google Cloud Storage, Backblaze B2, Cloudflare R2 and Wasabi now offer S3-compatible interfaces, either natively or through compatibility layers. This allows you to develop your application once — using the S3 API — and then switch between providers with minimal changes, or none at all.

The ecosystem goes even further: open-source projects like MinIO and Ceph offer full S3-compatible object storage systems that you can run on your own infrastructure. These are useful in multiple contexts:

Prototyping and development: Local S3-compatible servers make it easy to build and test ML applications without incurring cloud costs.
Private deployments: For teams that manage on-premise infrastructure, MinIO or Ceph can serve as the backend for production-grade object storage with the same API that cloud systems use.

This widespread compatibility is one of the key strengths of S3 as a model. While there is no official standards body maintaining it, the ecosystem has organically converged on a shared interface. That means your tools, libraries, and workflows — once built around S3 — can remain portable across environments and vendors.

Understanding the S3 REST API

Amazon S3 (Simple Storage Service) offers a RESTful API that enables interaction with your storage resources using standard HTTP methods. In this section, we'll explore the core concepts and operations of the S3 REST API, including buckets, objects, and the primary HTTP methods used to manage them.

Buckets and Objects

Buckets: These are containers for storing objects (files). Each bucket has a globally unique name within AWS and serves as the top-level namespace for your data.
Objects: These are the individual files stored within buckets. Each object is identified by a unique key (filename) within its bucket.

HTTP Methods and Operations

The S3 REST API utilizes standard HTTP methods to perform operations on buckets and objects:

PUT: Create or update a bucket or object.
GET: Retrieve an object or list the contents of a bucket.
DELETE: Remove an object or bucket.
HEAD: Retrieve metadata from an object or bucket without returning the object itself.

Creating a Bucket

To create a new bucket, you send a PUT request to the desired bucket endpoint:

PUT / HTTP/1.1
Host: my-bucket.s3.us-east-1.amazonaws.com
Authorization: AWS4-HMAC-SHA256 Credential=ACCESS_KEY/...

The response will confirm the creation of the bucket.

Uploading an Object

To upload an object to a bucket, you use the PUT method with the object's key:

PUT /my-object.txt HTTP/1.1
Host: my-bucket.s3.us-east-1.amazonaws.com
Content-Length: [object size]
Content-Type: text/plain
Authorization: AWS4-HMAC-SHA256 Credential=ACCESS_KEY/...

The request body contains the content of the object.

Downloading an Object

To download an object, you send a GET request to its key:

GET /my-object.txt HTTP/1.1
Host: my-bucket.s3.us-east-1.amazonaws.com
Authorization: AWS4-HMAC-SHA256 Credential=ACCESS_KEY/...

The response will include the object's data in the body.

Deleting an Object

To delete an object, you use the DELETE method:

DELETE /my-object.txt HTTP/1.1
Host: my-bucket.s3.us-east-1.amazonaws.com
Authorization: AWS4-HMAC-SHA256 Credential=ACCESS_KEY/...

The response will confirm the deletion.

Object Metadata

Each object stored in S3 has metadata — information stored alongside the file. This includes:

System metadata: Content type, content length, last modified date, storage class, and more.
User-defined metadata: Arbitrary key-value pairs provided at upload time. These can be useful for tagging objects with experiment IDs, model names, or custom notes relevant to your ML pipeline.

This metadata can be retrieved using the HEAD operation or returned in response headers during a GET request.

Additional S3 REST API Operations

Beyond the basic operations, the S3 REST API provides several other methods to manage your storage resources:

ListBuckets: Retrieves a list of all buckets owned by the authenticated sender of the request.
HeadBucket: Determines if a bucket exists and if you have permission to access it, without returning the bucket's contents.
HeadObject: Retrieves metadata from an object without returning the object itself. Useful for checking object existence and properties.
CopyObject: Creates a copy of an object that is already stored in Amazon S3.
ListObjectsV2: Returns some or all (up to 1,000) of the objects in a bucket. You can use request parameters as selection criteria to return a subset of the objects.
DeleteObjects: Enables you to delete multiple objects from a bucket using a single HTTP request.

These endpoints form the foundation of object storage systems. They are deliberately simple, which is one of the reasons why the S3 API has been so successful.

For a comprehensive list and detailed documentation of all S3 REST API operations, you can refer to the Amazon S3 API Reference.

Using S3 in Python with Boto3

Most machine learning tools and frameworks are written in Python, so it’s no surprise that interacting with S3-compatible object storage from Python is a common need. The most widely used Python library for this is Boto3, developed by Amazon for AWS services. But because S3 is a de facto standard, you can use Boto3 with almost any S3-compatible service.

This makes Boto3 a convenient and flexible choice, even if you’re not deploying on AWS. Whether you're storing a trained model, saving logs from inference, or retrieving datasets during training, Python + Boto3 gives you everything you need.

Here’s a basic example that shows common S3 operations from Python:

import boto3

# Initialize the S3 client (works with AWS, MinIO, or other S3-compatible systems)
s3 = boto3.client(
    's3',
    endpoint_url='https://s3.example.com',  # Use your endpoint here
    aws_access_key_id='YOUR_ACCESS_KEY',
    aws_secret_access_key='YOUR_SECRET_KEY',
    region_name='us-east-1'
)

# 1. Create a new bucket
s3.create_bucket(Bucket='my-bucket')

# 2. Upload a local file (e.g., trained model)
s3.upload_file('model.pt', 'my-bucket', 'models/model.pt')

# 3. Download a file (e.g., for inference)
s3.download_file('my-bucket', 'models/model.pt', 'downloaded_model.pt')

# 4. List objects in the bucket
response = s3.list_objects_v2(Bucket='my-bucket')
for obj in response.get('Contents', []):
    print(obj['Key'])

# 5. Get metadata for an object
metadata = s3.head_object(Bucket='my-bucket', Key='models/model.pt')
print(metadata['ContentLength'], metadata['LastModified'])

# 6. Delete an object
s3.delete_object(Bucket='my-bucket', Key='models/model.pt')

To connect to any S3-compatible service, you need credentials — typically an access key ID and a secret access key. These serve as your programmatic login: the access key ID identifies the client, and the secret key is used to sign the request.

Rather than using your personal login credentials, it's standard practice to create dedicated access keys for applications, with restricted permissions. For example, your model training job might have permission to upload models to S3, but your inference service might only be allowed to read them. Most cloud providers (and MinIO) let you define these access policies per key.

When initializing the boto3.client, you also specify an endpoint_url. This is especially important if you’re not using AWS. For example, to connect to a MinIO server or another cloud provider’s S3-compatible storage, you provide the correct endpoint here.

Recommended File Formats for ML Artifacts

When building scalable machine learning pipelines, how you store your data matters. The wrong file format can waste storage, slow down processing, and make collaboration harder. While beginners often default to formats like CSV or Python pickle files, these choices don’t scale well.

In this section, we explore what makes a file format well-suited for machine learning workflows and offer recommendations tailored to typical ML artifacts.

Choosing the Right Format

Good file formats for ML should meet three key criteria:

Compressed: Reducing file size lowers storage costs and speeds up I/O — and because compressed files are smaller, they also upload and download faster, even after accounting for compression and decompression time.
Portable: Avoid formats tied to a single tool or framework. Using open standards (like JSON, Parquet, ONNX) helps you share datasets or models between teams, tools, or even programming languages.
Safe: Certain formats, like Python's pickle, can execute arbitrary code when loading — a serious security risk in untrusted environments. Production systems should never rely on such formats for storage or exchange.

Text, Configs, and Logs

Plaintext logs, YAML configs, and JSON files are common in ML workflows — but they can become large and unwieldy over time. Compressing them is an easy win.

For archiving or transferring logs, tools like gzip (widely supported), zstd (fast with strong compression), or xz (better ratio, slower) are recommended. These compression formats are standard in the industry and integrate well with S3 storage.

Configuration files (e.g. config.yaml, params.json) should remain human-readable but can be compressed if needed for transfer. Logs produced by training runs or inference services — especially at scale — should always be compressed before storage.

Tabular and Time Series Data

CSV and JSON are easy to use but not efficient. They don’t support compression, schema enforcement, or partial reads. This becomes a problem when working with large datasets.

Instead, use Apache Parquet — a binary, compressed, columnar format designed for analytics:

It supports selective reads (e.g. just one column), which speeds up downstream processing.
It compresses well using built-in algorithms like Snappy or Zstandard.
It’s compatible with Pandas, Polars, DuckDB, Spark, and other tools used in ML and analytics.

Parquet is especially suited for storing sensor data, user events, model metrics, and feature tables — particularly when querying or filtering is part of the pipeline.

Images

Storing raw images (e.g. BMP, TIFF) consumes a lot of space. Compression is essential — but the choice between lossy and lossless depends on your use case:

PNG offers lossless compression. Use it when data integrity matters — such as with segmentation masks or images with few colors.
JPEG is lossy but highly efficient. It's suitable for natural images where minor visual artifacts are acceptable.
WebP and AVIF are newer alternatives that offer better compression ratios and support both lossy and lossless modes.

For model training, JPEG is often sufficient. But if your application depends on pixel-perfect accuracy (e.g. medical imaging), prefer lossless formats.

Audio

Audio data varies in size and quality. Choose the format based on the level of fidelity required:

WAV is uncompressed and large but preserves all signal detail.
FLAC is lossless and significantly smaller — a strong default for high-quality training data.
MP3 and OGG offer lossy compression — acceptable for tasks like speaker identification or music classification, where perfect reconstruction isn’t necessary.

Store audio at consistent sampling rates (e.g. 16kHz or 44.1kHz), and prefer mono over stereo if the task allows it.

Storing Machine Learning Models

Model files are also binary artifacts — and just like datasets, they benefit from clear structure and safe storage.

PyTorch models are typically stored with torch.save() as .pt or .pth files. These are easy to use but tightly coupled to Python and PyTorch versions.
For portability, consider ONNX, an open standard for representing ML models across frameworks and platforms. ONNX models can be exported from PyTorch, TensorFlow, and scikit-learn, and deployed on edge devices or in cross-language environments.
Model files should be compressed using .tar.gz or .zip, especially when bundling them with config files, training parameters, or preprocessing scripts.

Where possible, include metadata — like training settings, evaluation metrics, or Git commit hashes — alongside the model file in a structured format (e.g. metadata.yaml). This improves reproducibility and auditing down the line.

By choosing the right format for each type of artifact, you make your ML systems easier to scale, debug, and share. These decisions may seem minor — but in large pipelines, they have a major impact.

Versioning for Reproducibility and Auditability

In machine learning workflows, reproducibility is essential. You often need to answer questions like: What dataset was used to train this model? or What were the exact predictions generated last month? To answer these, you must version your data, models, and outputs — just like you version your source code with Git.

Object storage systems like S3 are well-suited to store this evolving data, but they don’t handle versioning for you automatically. There are three main approaches to consider, depending on your needs.

Manual Versioning: Hashes, Timestamps, and Naming Conventions

The simplest way to version your data is by naming your files accordingly. For example:

s3://my-bucket/datasets/2024-01-01/images.parquet
s3://my-bucket/models/model_v1.2.3.pt
s3://my-bucket/logs/train_run_abc123.log

You can embed timestamps, Git commit hashes, or experiment IDs into filenames. This gives you full control and doesn’t rely on any built-in features. It’s simple and flexible — but entirely up to you to manage.

This approach works well in early stages of development or small teams, but it becomes error-prone at scale. It's easy to accidentally overwrite files or lose track of what was used when.

Native S3 Versioning

S3 itself supports basic versioning at the object level. If you enable versioning on a bucket, every time you upload an object with the same key (i.e., the same filename), S3 stores the new version alongside the old one.

This means you can recover older versions, or inspect the full change history of an object. However, S3 versioning is:

Object-level only: You can’t snapshot a full directory or dataset.
Unstructured: There’s no concept of experiment tracking or metadata linkage.

Still, enabling versioning is a valuable safety net. It can protect against accidental overwrites or deletions, especially for critical data.

Versioning Layers on Top: DVC and LakeFS

For more advanced use cases — where you want Git-like behavior over large datasets — tools like DVC (Data Version Control) and LakeFS add a layer of versioning on top of S3.

These tools work by storing metadata about your data layout and content:

DVC integrates with Git to version pointers to your data stored in S3 or other backends. You get branching, commit history, and reproducible pipelines — without storing large files in Git itself.
LakeFS turns your S3-compatible bucket into a Git-like versioned filesystem, supporting branches, merges, and commits across your entire dataset.

This makes it possible to track dataset evolution, reproduce old training runs, and safely collaborate on data.

S3 Versioning Summary:

Manual versioning is flexible but error-prone.
Native S3 versioning helps at the object level.
Tools like DVC and LakeFS provide full dataset-level versioning on top of S3.

For machine learning systems where auditability and traceability are important — especially in production — versioning is not optional. Choosing the right approach helps you move fast without sacrificing reproducibility.

Multipart Uploads: Reliable Transfers for Large Files

Large files — such as trained models or datasets — are often tens or hundreds of gigabytes in size. Uploading them as a single object can fail due to timeouts, network interruptions, or size limits.

Multipart uploads solve this by letting you split a file into smaller parts, upload them independently, and reassemble them on the server.

This approach improves resilience (only failed parts need retrying), speed (parts can be uploaded in parallel), and flexibility (uploads can be resumed if interrupted). It's a built-in feature of the S3 API and recommended for files over 100 MB.

Under the hood, you first initiate a multipart upload (getting an UploadId), then upload each part with a numbered request, and finally complete the upload by sending a manifest of all parts. S3 handles the assembly — and libraries like Boto3 manage all of this behind the scenes, so you rarely need to deal with these steps manually.

Example: Uploading a Large File in Python

import boto3
from boto3.s3.transfer import TransferConfig

# Enable multipart uploads for files over 100 MB
config = TransferConfig(multipart_threshold=100 * 1024 * 1024)

s3 = boto3.client('s3')
s3.upload_file(
    Filename='model.tar.gz',
    Bucket='ml-artifacts',
    Key='models/2024-05/model.tar.gz',
    Config=config
)

By using multipart uploads — especially when storing large binary artifacts in object storage — you ensure faster, safer, and more robust transfers across all environments.

Advanced Features and Best Practices in Amazon S3

Beyond basic storage and retrieval, Amazon S3 offers advanced functionalities that enhance data management, security, and cost-efficiency—crucial for machine learning applications.

Lifecycle Rules: Automating Data Management

S3 Lifecycle configurations allow you to define rules that automatically transition objects between storage classes or delete them after a specified period. This is particularly useful for managing data that becomes less critical over time, such as:

Transitioning: Moving infrequently accessed data to cost-effective storage classes like S3 Glacier.
Expiration: Automatically deleting outdated datasets or model versions to free up storage.

Implementing lifecycle rules helps in optimizing storage costs and maintaining a clean data environment.

Object Locking: Ensuring Data Immutability

S3 Object Lock enables a write-once-read-many (WORM) model, preventing objects from being deleted or overwritten for a defined retention period. This feature is essential for:

Compliance: Meeting regulatory requirements by safeguarding critical data.
Auditability: Preserving model versions and datasets used in production for future audits.

Object Lock can be configured in two modes:

Governance Mode: Allows privileged users to alter retention settings.
Compliance Mode: Strictly enforces retention policies, prohibiting any modifications during the retention period.

Access Control: Fine-Grained Permissions

Controlling access to S3 resources is vital for security and operational efficiency. S3 supports:

Bucket Policies: JSON-based policies attached to buckets, defining permissions for users and actions.
IAM Policies: Identity-based policies that grant permissions to users or roles across AWS services.

For example, a bucket policy granting read-only access to a specific user might look like:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ReadOnlyAccess",
      "Effect": "Allow",
      "Principal": {"AWS": "arn:aws:iam::123456789012:user/MLUser"},
      "Action": ["s3:GetObject"],
      "Resource": ["arn:aws:s3:::ml-models/*"]
    }
  ]
}

Implementing precise access controls ensures that users and applications have appropriate permissions, enhancing security.

Pre-Signed URLs: Secure Temporary Access

Pre-signed URLs grant time-limited access to S3 objects without exposing AWS credentials. They are useful for:

Secure Sharing: Allowing temporary access to datasets or model artifacts.
Controlled Uploads: Enabling users to upload data without granting them full S3 access.

In Python, you can generate a pre-signed URL using Boto3:

import boto3

s3_client = boto3.client('s3')
response = s3_client.generate_presigned_url('get_object',
                                            Params={'Bucket': 'ml-models',
                                                    'Key': 'model.pkl'},
                                            ExpiresIn=3600)
print(response)

This URL allows access to model.pkl for one hour.

From OLTP to OLAP: Turning Live Data into ML-Ready Datasets

Most developers are familiar with traditional relational databases like MySQL or PostgreSQL. These are examples of OLTP systems — Online Transaction Processing databases designed for fast, frequent operations like inserts, updates, and deletes. They’re optimized for application performance: when a user updates their profile or a sensor pushes new readings every second, OLTP databases handle that load with speed and consistency.

Some domains use more specialized OLTP systems. For example, in industrial settings or IoT environments, time-series databases like InfluxDB or TimescaleDB are used to store timestamped data from sensors and machines. These databases are excellent at handling live, high-frequency, mutable data — but they’re not designed for training machine learning models.

That’s because machine learning has very different requirements.

To train a model, we don’t want the data to change every time we rerun the pipeline. We need frozen datasets — consistent snapshots of the world at a specific point in time. This ensures reproducibility: if we retrain a model with the same code and the same data, we should get the same result. If the data has changed — even slightly — the model may behave differently, and debugging becomes almost impossible.

This is where OLAP systems — Online Analytical Processing — come into play. OLAP systems are optimized for reading large volumes of data, often organized in a denormalized or columnar format. Rather than serving users in real time, OLAP systems serve analysts and data pipelines with historical, structured datasets.

In many modern data workflows, the OLAP layer is built on top of object storage — and most commonly, this means Amazon S3 or an S3-compatible system. Large datasets are stored as compressed columnar files (typically Parquet), and read by tools like Pandas, Polars, DuckDB or Spark. S3 isn’t a database in itself, but it acts as the backing store for an analytics system — a data lake that scales nearly infinitely and plays well with modern ML tooling.

For smaller teams or simpler applications, you may not need a dedicated OLAP database at all. Just storing cleaned, versioned datasets in S3 — using good file formats and naming conventions — is often enough. But for more advanced querying, indexing, or joining across datasets, OLAP engines like ClickHouse, Databricks, Snowflake, or BigQuery offer powerful interfaces to slice and aggregate data directly on S3 or other distributed storage.

But how does data move from your live OLTP database to this frozen OLAP world?

The answer is batch processing, typically through ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform) pipelines:

Extract: Pull data from a live source — e.g. a time-series database, or a relational database like PostgreSQL.
Transform: Clean and structure the data, filter irrelevant records, convert timestamps, normalize fields.
Load: Write the transformed data into S3 — typically as Parquet files, organized by date, event type, or another logical key.

This doesn’t happen continuously, but in batches — every few hours, once per day, or whenever it fits your system. That way, training pipelines can rely on stable datasets, and you retain a historical record of what was used when.

S3 doesn’t replace your transactional databases — you’ll still use OLTP systems to manage user interactions, sensor inputs, or metadata. But in ML workflows, S3 (or an OLAP system built on top of it) becomes the central store for all training and inference data.

Designing this flow well — from live updates to batch analytics — is essential for building scalable and reproducible machine learning systems.

7. Continuous Integration and Continuous Deployment (CI/CD) for Machine Learning

Why Automate Machine Learning Workflows?

In the early days of machine learning, many projects lived in notebooks or ad-hoc scripts. A researcher or engineer would experiment with models locally, tweak a few hyperparameters, run training scripts, and maybe save a model file to disk. If the model worked well, the same person might turn it into an API, manually upload it to a server, and hope everything ran as expected.

This approach can work — but only in the earliest stages of a project. As soon as multiple people are involved, or you need to retrain regularly, or your system is deployed in production and needs to evolve safely, manual workflows become a bottleneck.

Modern machine learning systems don’t just run code once — they:

Ingest new data regularly,
Train and evaluate models repeatedly,
Deploy models to production environments,
Monitor those models over time,
And evolve based on new requirements, feedback, or data.

Each of these steps involves many moving parts: data preprocessing scripts, model training code, dependency installation, infrastructure setup, deployment logic, and validation checks. If any of this is done manually — or inconsistently — you quickly run into problems:

"It worked on my machine" — but fails in production.
"Which model version is live right now?" — nobody’s sure.
"Can we retrain with last month’s data?" — not easily.

To build reliable and scalable systems, we need a way to automate all of this. That’s where CI/CD comes in.

From Dev and Ops to DevOps: A Brief History

Before we get to CI/CD, let’s rewind a bit. In traditional software development, teams were split into developers (who wrote the code) and operations (who ran the infrastructure). Developers would build an application, hand it off to the Ops team, and move on. This handoff often led to friction:

Devs wanted to push changes quickly. Ops wanted to keep things stable.
Devs worked in flexible environments. Ops had to deal with real-world constraints.
When something broke, it was always someone else’s fault.

Out of this tension emerged a new culture: DevOps. The core idea was simple but powerful:

Break the wall between development and operations by automating the path from code to production.

In a DevOps mindset, developers are also responsible for making sure their code runs in production. And instead of handing things over manually, they use automated pipelines to test, package, and deploy code safely and consistently.

This shift gave rise to two key practices:

Continuous Integration (CI): Automatically testing and validating code every time a change is made.
Continuous Deployment (CD): Automatically pushing that validated code into production — or at least into a staging environment.

Together, CI/CD pipelines became the backbone of modern software delivery.

Continuous Integration (CI)

Continuous Integration is about making sure that developers don’t work in isolation for too long. Instead of each developer working on their own branch for days or weeks and then facing painful merge conflicts, CI encourages everyone to frequently integrate their changes into a shared branch — often called the main or development branch.

But this only works if you have trust in that shared branch — and that trust comes from automated testing and careful code review. Pull requests provide the structure for both: they run tests automatically and create space for teammates to review and discuss changes before merging.

Every time someone opens a pull request or merges code, CI systems automatically:

Fetch the latest version of the codebase,
Install dependencies,
Run unit tests, integration tests, linting checks,
Optionally build the application or package.

The goal is simple: catch problems early, before they reach production. If a test fails, the pipeline blocks the change from being merged. This ensures that the main branch stays healthy and that the team can move quickly without stepping on each other’s toes.

In practice, CI is your team’s first line of defense against bugs, regressions, and quality issues.

Continuous Deployment / Delivery (CD)

If CI is about making sure code is safe, CD is about making sure it gets delivered.

In the past, many teams would batch up changes and release them once every few months — often in stressful, high-risk deployments. Today, CD flips that model on its head. The idea is to:

Deploy small changes often, rather than large changes rarely.
Automate the deployment process as much as possible.
Deliver value to users quickly, so feedback can be gathered faster.

CD typically comes in two flavors:

Continuous Delivery means changes are automatically tested and packaged, and then ready to be deployed — but a human decides when to push the button.
Continuous Deployment goes one step further: changes that pass all tests are deployed automatically, often within minutes of being merged.

This tightens the feedback loop with users, uncovers bugs earlier, and encourages a more Agile, iterative mindset. And because deployments happen often, they become routine, not scary.

Why Machine Learning Projects Need CI/CD

At first glance, ML projects may not seem like traditional software systems — they involve data science, experimentation, notebooks, and training jobs. But underneath all that, they are still software projects. They involve:

Code (data pipelines, training scripts, APIs),
Dependencies (Python packages, CUDA versions),
Artifacts (models, logs, metrics),
Infrastructure (GPUs, cloud services),
And users who rely on the output.

And ML systems are actually harder to manage than traditional software:

The data changes constantly — so your model might break even if the code doesn’t.
Models need to be retrained, re-evaluated, re-packaged — often repeatedly.
Bugs might not show up in tests — but in silent model drift or subtle regressions.

That’s why CI/CD is not just useful for ML projects — it’s essential.

You need Continuous Integration (CI) to:

Automatically test data and code,
Validate model training pipelines,
Catch issues in model performance early.

You need Continuous deployment (CD) to:

Re-deploy updated models safely,
Ship new features quickly (e.g. improved preprocessing),
Gather feedback from users on real-world performance.

ML projects may be built differently than traditional software — but they still live in source control, they still go through testing, and they still need to be reliable, reproducible, and safe to update. That’s why the same principles of CI/CD apply — and even more so.

Creating Your First GitHub Actions Workflow

So far, we’ve talked about CI/CD in theory. Now it’s time to make it real.

In this course, we’ll use GitHub Actions — GitHub’s built-in system for automation. It lets you run scripts and pipelines directly from your repository, in response to events like code pushes, pull requests, or manual triggers. Because it’s built into GitHub, you don’t need to install or configure anything extra to get started.

A GitHub Actions workflow is defined in a simple YAML file that lives in your repository — specifically under the .github/workflows/ folder. This means your automation scripts live right next to your code, are version-controlled like everything else, and evolve with your project. It’s one of the reasons GitHub Actions has become a popular choice in both traditional software teams and machine learning workflows.

Each workflow consists of three main parts. First, a trigger defines when the workflow runs — for example, when someone pushes code to the main branch or opens a pull request. Then, the workflow contains one or more jobs — logical units of work, like running tests or building a container. Each job runs independently and can be parallelized or sequenced. Finally, each job contains a series of steps — individual commands or reusable actions that are executed in order. These steps can install dependencies, run Python scripts, execute training code, or validate results.

Let’s look at a minimal example to tie all this together. Suppose you want to run Python unit tests every time someone pushes code to the main branch. Here’s what that workflow might look like:

📄 File path: .github/workflows/test.yml

name: Run tests on main branch

on:
  push:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest

    steps:
    - name: Check out code
      uses: actions/checkout@v4

    - name: Set up Python
      uses: actions/setup-python@v5
      with:
        python-version: '3.13'

    - name: Install dependencies
      run: pip install -r requirements.txt

    - name: Run unit tests
      run: pytest .

This workflow is triggered by any push to the main branch. It defines a single job called test, which runs on a standard Ubuntu environment. That job has four steps: it checks out your code, sets up Python, installs your dependencies, and runs your tests using pytest.

Once you commit this file, GitHub will pick it up automatically and run it whenever the trigger condition is met. You’ll see the results — including logs, errors, and runtime — in the Actions tab of your repository.

This is the core of CI in practice: every time you change your code, your project gets tested automatically. In the next sections, we’ll dig deeper into each part of Github Actions — from workflow triggers to jobs, runners, secrets, and more.

Workflow Triggers: When Do GitHub Actions Run?

Now that you've seen how to define a basic GitHub Actions workflow, let's explore what causes that workflow to run. In GitHub Actions, these are called triggers — events that start your workflow. Triggers can be automatic (like pushing code or opening a pull request) or manual (like clicking a button in the GitHub UI). Understanding these triggers is key to building useful automation.

Triggering on Code Pushes

The most basic and widely used trigger is push. It activates a workflow whenever someone pushes code to a branch in your repository.

This is particularly useful for automating deployment processes. For instance, when code is pushed to the main branch, a workflow can be set up to automatically deploy a web dashboard or upload a new model to your inference server. While tests can also be run on push events, it’s often more effective to run them during pull requests to catch issues before merging.

In your workflow YAML file, you would configure it like this:

on:
  push:
    branches: [main]

This setup ensures that every push to main triggers the workflow.

Triggering on Pull Requests

The pull_request trigger runs workflows in response to pull request events, such as opening a new PR or pushing new commits to an existing one. This is ideal for running tests, linters, and other checks to ensure code quality before any changes are merged.

Pull request workflows help catch issues early and protect shared branches from breaking changes. By configuring required status checks, you can make sure that a pull request cannot be merged unless the workflow passes. This helps teams enforce quality standards without relying on manual oversight.

These required checks are configured in GitHub’s repository settings, using branch protection rules. You can set up a rule for your main branch (or any other critical branch) and require that specific workflows must pass before merging is allowed.

on:
  pull_request:

This configuration ensures that any pull request, regardless of the target branch, will trigger the workflow.

Scheduling Workflows

Sometimes, you need workflows to run at regular intervals — like retraining a model every night or cleaning up old data weekly. The schedule trigger lets you run workflows on a cron-like schedule.

For example, to run a workflow every day at midnight UTC:

on:
  schedule:
    - cron: '0 0 * * *'

This setup is perfect for tasks that need to happen regularly, regardless of code changes.

Manually Triggering Workflows

The workflow_dispatch trigger allows you to run a workflow manually from the GitHub UI, CLI, or API. This is useful for tasks like deploying to production or running scripts on demand.

You can define input parameters to customize the workflow execution:

on:
  workflow_dispatch:
    inputs:
      environment:
        description: 'Deployment environment'
        required: true
        default: 'staging'

With this configuration, a "Run workflow" button appears in the GitHub Actions tab, allowing you to select the environment before execution.

Triggering Workflows from Other Workflows

The workflow_call trigger lets you define reusable workflows that can be called from other workflows. This promotes modularity and reduces duplication.

For example, you might have a common testing workflow:

# .github/workflows/test.yml
on:
  workflow_call:

Then, in another workflow, you can call this one:

jobs:
  call-tests:
    uses: ./.github/workflows/test.yml

This approach helps maintain consistency across your workflows.

Jobs and Runners: Where and How Workflows Execute

Now that you’ve seen how to trigger workflows, it’s time to talk about where and how those workflows actually run.

Every GitHub Actions workflow is made up of one or more jobs. A job is a collection of steps that execute in a clean environment — like installing dependencies, running tests, training a model, or deploying an application. Each job runs on a runner, which is a virtual machine that executes your code.

GitHub-Hosted Runners

By default, jobs run on GitHub-hosted runners. These are virtual machines provided by GitHub that spin up automatically and come pre-installed with many common tools. You can specify the operating system using the runs-on key:

runs-on: ubuntu-latest

You can also use windows-latest or macos-latest, but in practice, most data science and machine learning work happens on Linux, especially in cloud environments. Tools like Python, Docker, CUDA, and most ML libraries are Linux-first — and it’s the environment you'll most commonly use in production. So in this course, we’ll focus almost exclusively on Linux runners.

Running Jobs in Parallel or Sequence

Each job in a workflow is isolated and can be run in parallel with others — which can speed up your CI/CD pipelines significantly. For example, you might want to run tests and linting in two separate jobs, and they can execute at the same time:

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      # test steps

  lint:
    runs-on: ubuntu-latest
    steps:
      # linting steps

If one job depends on another, you can define the order using the needs: keyword:

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      # build steps

  deploy:
    runs-on: ubuntu-latest
    needs: build
    steps:
      # deployment steps

This ensures deploy only runs after build has completed successfully.

Self-Hosted Runners

While GitHub’s runners are convenient, they aren’t always enough — especially in machine learning projects where:

You need access to GPUs or high-performance hardware,
Your code depends on a private environment, such as an internal network or on-premise data source,
You want faster access to local databases, large datasets, or custom tools,
You’re building Docker images that need to interact with private container registries,
Or you just need more control over the environment (e.g. specific CUDA versions or system libraries).

In these cases, GitHub Actions supports self-hosted runners — your own machines (in the cloud, on-premise, or even on your laptop) that run jobs exactly the same way GitHub-hosted runners do.

To use them, you register your machine with GitHub, and it listens for incoming jobs. In your workflow, you simply target the runner by label:

runs-on: self-hosted

You can also define custom labels (e.g. gpu, training, secure) to route specific jobs to specific machines:

runs-on: [self-hosted, gpu]

This flexibility allows you to integrate GitHub Actions into private infrastructure while still benefiting from the same workflows, syntax, and automation.

Workflow Steps: Running Commands and Scripts

By now, you've seen what a job looks like inside a GitHub Actions workflow. Each job is made up of one or more steps, which are executed sequentially — one after the other.

In contrast to jobs (which can run in parallel), steps always run in order, inside the same runner environment. That means files created or commands run in one step are still available in the next step. This makes steps ideal for scripting logical tasks: installing dependencies, running tests, downloading data, or cleaning up afterward.

You’ve already seen examples like this:

steps:
  - name: Install dependencies
    run: pip install -r requirements.txt

  - name: Run tests
    run: pytest

The simplest steps use the run: keyword to execute shell commands — just like you'd type them into a terminal. These can be anything from Bash one-liners to full scripts, such as:

Installing packages with apt-get or pip,
Running a training script with python train.py,
Calling a CLI tool or any Linux software.

Using GitHub Actions from the Marketplace

In addition to writing your own shell commands, you can also plug in pre-built actions created by the community. These are published in the GitHub Marketplace, and they do everything from checking out your code to deploying your app to the cloud.

You’ve already used some of the most common ones:

- name: Check out code
  uses: actions/checkout@v4

- name: Set up Python
  uses: actions/setup-python@v5
  with:
    python-version: '3.10'

These steps don’t run shell commands — they use the uses: keyword to run an action, which is a small reusable component with its own logic.

Some other popular actions include:

actions/cache – for caching dependencies (e.g. pip, conda),
actions/upload-artifact – for saving files from your workflow (e.g. models, logs),
docker/build-push-action – for building and pushing Docker images,
actions/s3-sync - to sync a directory with a remote S3 bucket,
aws-actions/configure-aws-credentials - to configure your AWS credentials,
aws-actions/amazon-ecs-deploy-task-definition - to deploy containers to ECS.

Actions make your workflows cleaner, more maintainable, and often more efficient. You don’t need to reinvent the wheel — you just plug in a building block and go. If you want to explore what’s available, visit the GitHub Actions Marketplace.

Environment Variables and Secrets

In any CI/CD pipeline, you often need to configure things dynamically: which environment to deploy to, where to upload artifacts, which credentials to use, and so on. GitHub Actions provides two key mechanisms for injecting configuration into your workflows: environment variables and secrets.

Using Environment Variables

Environment variables are a standard way to pass configuration into scripts. In GitHub Actions, you can define them at different levels:

At the workflow level (shared across all jobs),
At the job level (shared across all steps in a job),
Or directly within a single step.

For example:

jobs:
  deploy:
    runs-on: ubuntu-latest
    env:
      ENVIRONMENT: production
    steps:
      - name: Print environment
        run: echo "Deploying to $ENVIRONMENT"

Environment variables like this are useful for passing runtime parameters (e.g. dataset names, URLs, experiment flags). But they are not secure — they’re just plain text, visible in logs and exposed to anyone with access to the repository.

Introducing GitHub Secrets

For sensitive values like API keys, database passwords, or cloud credentials, you should use GitHub Secrets.

Secrets are encrypted variables that you store securely in your GitHub repository settings. They can only be accessed by workflows running in that repository — and never appear in logs by default.

To add a secret:

Go to your GitHub repository.
Click on Settings → Secrets and variables → Actions.
Click New repository secret and enter a name like AWS_ACCESS_KEY_ID.

You can access a secret in a workflow using the secrets context:

env:
  AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
  AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

These environment variables can then be used by tools inside your steps — all without ever exposing the keys in your source code.

Example: Uploading to S3

Here’s a simple example of using secrets to upload files to an S3 bucket:

- name: Upload to S3
  run: aws s3 cp model.pkl s3://my-ml-models/ --region eu-west-1
  env:
    AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
    AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

This approach keeps your credentials safe and your repository clean — and it scales with you as you deploy to more environments or services.

By managing secrets this way, your CI/CD pipeline becomes the secure bridge between your code and your infrastructure. Whether you’re pushing models to S3, triggering API calls, or logging into cloud services, GitHub Secrets help you do it safely, cleanly, and automatically.

Why CI/CD Matters in Machine Learning

CI/CD isn’t just a convenience for software developers — it’s a foundational practice for building reliable, scalable machine learning systems. As your ML projects grow beyond experiments and prototypes, automation becomes essential. CI lets you test and validate your code and models automatically, ensuring that every change — from preprocessing scripts to model logic — is safe and reviewable. CD enables you to deliver those changes quickly and consistently, whether you're shipping a model update to production or publishing a retraining pipeline.

For ML teams, CI/CD can be used to:

Run integration tests on data pipelines and model outputs,
Lint and validate code automatically before merges,
Deploy models or APIs to staging or production environments,
Schedule nightly training or evaluation runs,
Share reproducible results with teammates through version-controlled automation.

Together, these practices help you move faster without breaking things. And while CI/CD doesn't solve every problem in ML (especially data-driven workflows — which we’ll explore next), it’s the first step toward treating machine learning as an engineering discipline — not just an experiment.

8. Data Pipelines and Orchestration Frameworks

As machine learning projects grow beyond experimentation, they often involve many interconnected steps: pulling in fresh data, cleaning it, training new models, evaluating results, and sometimes retraining or rolling back based on performance. Managing these steps manually — or stitching them together with scripts — quickly becomes error-prone and hard to maintain.

That’s where data orchestration frameworks come in. Tools like Airflow, Dagster, and Prefect are designed to define and run multi-step workflows in a reliable, repeatable way. They help you coordinate dependencies, schedule jobs, monitor progress, and recover from failures — all with automation in mind.

Unlike CI/CD systems, which are typically triggered by changes in code (like a Git push), these frameworks are often triggered by data events or schedules. For example, “run this training pipeline every night,” or “only start if new files arrive in a cloud bucket.”

In this chapter, we’ll look at what these tools are, why they exist, and how they differ. We won’t go deep into any one framework — instead, the goal is to give you a bird’s-eye view of the orchestration landscape so you know what’s out there and when to reach for it.

Understanding the Role of Orchestration Frameworks

By now, you’ve seen how powerful CI/CD pipelines can be for machine learning teams. Tools like GitHub Actions help you test code, package models, deploy APIs, and integrate with cloud services — all automatically, every time your code changes.

But at some point, code isn’t the only thing that changes.

In real-world ML systems, the arrival of new data often triggers the next step in your pipeline. A model might need to retrain daily, but only if enough new data has been collected. You might want to preprocess 10 files in parallel, but then aggregate the results before moving on. Or you might want to resume a long-running job from the exact step where it failed — without starting over.

These kinds of workflows stretch the limits of traditional CI/CD tools. Systems like GitHub Actions were built for code-driven automation — they're great at reacting to commits, pull requests, or version tags. But they aren’t designed for workflows that are:

Triggered by data, not just code (e.g. new files in an S3 bucket),
Non-linear, with complex dependencies between tasks,
Dynamic, changing shape based on runtime input,
Long-running, potentially taking hours or days to complete,
Retryable, where individual steps can fail and be resumed in isolation.

This is where orchestration frameworks shine.

Tools like Airflow, Dagster, and Prefect are built from the ground up to handle data-centric workflows. They let you define workflows as DAGs — Directed Acyclic Graphs — where each node is a task, and edges define the dependencies between them. These tools manage execution order, handle failures gracefully, and give you visibility into every task’s status and output.

More importantly, they speak the language of data. They integrate with cloud storage, databases, APIs, and ML tools. They support rich scheduling, custom triggers, and modular design. Many have plugin systems for tools like dbt, S3, Kubernetes, and more — because they know that real pipelines are part of a broader ecosystem.

In short:

CI/CD tools help you test and ship software.
Orchestration frameworks help you run and manage workflows. And in machine learning, you often need both.

DAGs: The Backbone of Modern Workflows

At the heart of every orchestration framework — whether it’s Airflow, Dagster, or Prefect — is the Directed Acyclic Graph, or DAG. This is a simple but powerful structure for defining workflows: a set of tasks (nodes) connected by dependencies (edges) that must be followed in a specific order.

The “directed” part means tasks flow in a single direction — from start to finish. The “acyclic” part ensures there are no loops: you never go backwards or re-run earlier tasks accidentally. This makes DAGs ideal for data workflows, which typically follow a clear and logical progression.

Take the example illustrated below:

Here, we load sales data, split it by currency, convert each currency to USD in parallel, then summarize and generate reports. This is a perfect real-world DAG:

Tasks like “Convert Euros to USD” and “Convert Pounds to USD” can run in parallel.
Tasks like “Summarize by Region” only start after all currency conversions are done — enforcing dependencies.
If one step fails, orchestrators can retry just that step — offering robust failure handling.

And beyond static graphs like this, modern tools (like Dagster or Prefect) support dynamic DAGs — pipelines that adjust at runtime depending on the input data or configuration. For example: if new currencies appear, new branches of the graph can be created automatically, without rewriting the pipeline.

This level of control and flexibility is far beyond what traditional CI/CD tools can offer. In GitHub Actions, you can run jobs in parallel or sequence, but you can’t express conditional logic, runtime branches, or fine-grained retries the same way.

In short: DAGs are what turn your machine learning workflows from scripts into systems — structured, observable, and built to scale.

Apache Airflow: The Veteran

Airflow is the oldest and most widely adopted orchestration framework in the data world. Originally built at Airbnb, it's designed around the idea of writing your workflows as Python code — where each task is a function, and their dependencies form a Directed Acyclic Graph (DAG).

Airflow is highly extensible, has a rich plugin ecosystem, and is deeply embedded in many enterprise stacks. But it also comes with some trade-offs:

✅ Strengths:

Battle-tested in production.
Excellent UI for tracking runs and debugging.
Huge community and plugin support.

⚠️ Limitations:

Static DAGs only — everything must be defined up front.
Boilerplate-heavy and sometimes awkward for dynamic workflows.
Can be overkill for small or fast-moving ML teams.

Here’s an example of what an Airflow DAG looks like:

@dag(schedule_interval="@daily", start_date=datetime(2024, 1, 1))
def daily_pipeline():
    clean_data() >> train_model() >> evaluate_model()

Each task is written in Python, and >> is Airflow’s way of saying “run this task after the previous one.” Simple, clear, and great for scheduled pipelines — as long as you can live with its rigidity.

Dagster: The Asset-Centric Approach

Dagster is a modern orchestration framework that takes a different approach: instead of thinking in terms of tasks, Dagster encourages you to model your workflows around data assets — the things you produce and consume, like cleaned datasets, trained models, or reports.

This mindset shift brings powerful benefits: you get built-in lineage tracking, better testability, and fine-grained observability. Dagster is designed with type safety, modularity, and developer experience in mind, making it a great fit for machine learning projects that evolve over time.

✅ Strengths:

Asset-first design makes dependencies and lineage explicit.
Strong support for testing, type checking, and modular pipelines.
Excellent developer experience and web UI.

⚠️ Limitations:

Still evolving; smaller ecosystem than Airflow.
Some newer concepts (like assets vs. jobs) can be confusing at first.
Best suited for Python-first teams.

Here’s what a simple asset graph might look like:

@asset
def raw_data():
    return fetch_from_s3()

@asset
def trained_model(raw_data):
    return train(raw_data)

Each asset is a reusable, testable unit. Dagster handles the wiring, scheduling, and execution — and keeps track of what was run, when, and with which inputs.

Prefect: The Flexible Newcomer

Prefect is a newer orchestration tool built with developer experience in mind. It aims to solve many of the usability issues found in older systems like Airflow. With a clean, Pythonic API and support for dynamic workflows, Prefect is especially attractive to data scientists and ML practitioners who want minimal friction and fast iteration.

It supports rich task state tracking, automatic retries, and both cloud-hosted and self-hosted execution — making it a good choice for teams that want orchestration without heavy infrastructure upfront.

✅ Strengths:

Very easy to get started — especially for Python users.
Supports dynamic, runtime-generated DAGs.
Good for rapid prototyping and research workflows.

⚠️ Limitations:

Less opinionated — great for flexibility, but can lead to inconsistent structures.
Some features (e.g., UI, task history) require using Prefect Cloud.
Still maturing in large-scale, enterprise use cases.

Here’s what a minimal Prefect pipeline might look like:

from prefect import flow

@flow
def my_pipeline():
    data = clean_data()
    model = train_model(data)
    evaluate_model(model)

This reads almost like regular Python code — but under the hood, Prefect tracks execution, retries failures, and lets you monitor the flow in its UI.

Combining CI/CD and Data Orchestration: Who Does What?

As you’ve seen, CI/CD systems and data orchestration frameworks serve different purposes — but they often need to coexist in real-world machine learning projects. So how do they fit together?

Let’s start with the core idea: CI/CD is built for code, while orchestration frameworks are built for data.

Two Tools, Two Responsibilities

In a typical setup:

CI/CD tools like GitHub Actions or GitLab CI handle:
- Running unit tests, linters, and integration checks,
- Packaging and deploying software,
- Building Docker images or Python packages,
- Releasing APIs, dashboards, or CLI tools to users or servers.
Data orchestration tools like Airflow, Dagster, or Prefect handle:
- Ingesting and transforming incoming data,
- Retraining models when new data becomes available,
- Running batch inference jobs,
- Producing versioned data and model artifacts.

Orchestration frameworks are better at long-running, data-triggered, and branching workflows. CI/CD tools shine when you’re dealing with event-driven code changes and structured release pipelines.

A Tale of Two Origins

In practice, teams often start with just one tool — based on where they come from:

A software-first team might begin with CI/CD tools and add orchestration later when workflows become too complex or data-driven.
A data-first team might build everything in Airflow or Prefect, and only later realize they need CI/CD to handle packaging and deployment.

Eventually, larger or more mature teams end up with both — and that’s where integration becomes key.

Integration Patterns: When Tools Meet

There’s no single rule for how these tools should interact, but a few common patterns have emerged:

CI/CD triggers orchestration: When a new version of your code is merged, GitHub Actions could deploy updated DAGs or pipelines to your orchestrator (e.g., “Here’s the new training code, go retrain with it.”).
Orchestration triggers CI/CD: When a pipeline finishes training a model, it could trigger a deployment pipeline — for instance, by pushing a new model to S3 and pinging a CI/CD job to deploy it.
Decoupled via object storage: The orchestrator writes the model to a known location (e.g., S3). Separately, your production system or CI/CD pipeline watches that location and pulls in the newest model when it changes.

Each of these is valid — and which one you choose depends on your team’s structure, reliability needs, and deployment setup. Some teams want the orchestrator to "own" the model and push it. Others want production to stay in control and pull from a safe location. Some use CI/CD as the central router that handles promotion and deployment logic. There’s no wrong answer — just trade-offs.

Wrapping It All Up: The MLOps Journey

Over the course of this journey, we’ve covered a lot of ground.

We started with notebooks — the familiar playground of every data scientist — and gradually worked our way toward robust, production-ready systems. Along the way, we explored how to refactor experiments into clean code, serve models through REST APIs, and deploy them reliably using Docker containers.

From there, we moved into container orchestration with tools like Docker Compose and Kubernetes, learned how to interact with cloud infrastructure and object storage, and introduced CI/CD pipelines to bring automation, safety, and repeatability to our software changes.

Finally, we stepped into the world of data orchestration frameworks, where tools like Airflow, Dagster, and Prefect help manage the messy, dynamic nature of real-world data workflows.

Each chapter of this course zoomed out a little further — from scripts to services, from services to infrastructure, from infrastructure to automation, and from automation to orchestration. Together, these topics paint a high-level picture of what it means to build and operate machine learning systems in production.

Of course, this isn’t the full story. Production ML systems in large companies might also include feature stores, data catalogs, real-time inference, edge deployment, data lakes, monitoring stacks, and more. But what we’ve built here is a solid foundation — one that gives you the vocabulary, intuition, and architectural patterns to start building your own MLOps systems with confidence.

Whether you're a solo practitioner, a startup team, or part of a larger engineering organization, these are the tools and principles you’ll encounter again and again.

Thanks for following along — and good luck building!

Building ML Infrastructure and Deploying Production-Ready Models

Course Overview

Table of Contents

1. From Notebooks to Production Code

Why notebooks don’t scale to production

What to do instead

Benefits of refactoring into Python modules

Notebooks as thin wrappers

Example: Refactoring a messy notebook

Refactor into reusable modules

Python packages: requirements.txt

Virtual environments

Notebooks conclusion

2. Model Deployment and Serving

Client-Server vs. Edge Deployment

Communication Protocols for Model Serving

REST APIs

gRPC

WebSockets

MQTT

OPC UA

Protocol Comparison Summary

Introduction to FastAPI

Creating Routes with FastAPI

Using FastAPI to Serve Model Predictions

Swagger UI: Built-In API Documentation and Testing

Frontends for Model APIs

JavaScript Frameworks

Python-Based Frontends with Flask

Dashboarding Frameworks: Dash, Streamlit, and Gradio

Beyond the Browser

Authentication and Access Control

3. Containers and Virtual Machines

What is a Virtual Machine?

What is a Container?

Comparing Virtual Machines and Containers

Why Containers Matter for Machine Learning

Versioning and Reproducibility

Common Container Technologies

Introduction to Docker

Running Existing Docker Containers

Common Docker Run Flags

Example: Running a PostgreSQL Container

Building Custom Docker Images

Publishing Docker Images to a Registry

Understanding Images and Containers

Layers and Caching in Docker

Why You Should Not Use latest

4. Introduction to Container Orchestration

Docker Compose: Lightweight, Local Orchestration

Core Docker Compose Commands

Managing Images and Builds

Networking and Inter-Service Communication

Volumes and Persistence

Environment Variables and Configuration

Service Lifecycle and Restart Behavior

Kubernetes: Industrial-Scale Container Orchestration

Alternatives to Kubernetes

Compose vs. Kubernetes: A Conceptual Comparison

Advanced Capabilities in Kubernetes

Choosing the Right Tool for the Job

5. Infrastructure Management: Cloud, On-Premise and Infrastructure as Code

What Is a Cloud Provider?

Understanding Infrastructure Layers

Virtual Machines in the Cloud (IaaS)

Containers in the Cloud: Deployment Models and Trade-offs

Managed Kubernetes Services

Serverless Container Platforms

Platform as a Service (PaaS): Simplifying Development

PaaS for Machine Learning

Software as a Service (SaaS): Using AI Tools Directly

On-Premise Infrastructure: Building Your Own Virtualization Environment

Core Features of Virtualization Platforms

Automating Infrastructure: Infrastructure as Code

Hybrid and Multi-Cloud Infrastructure

6. Scalable Storage for Machine Learning: Object Storage

Why Machine Learning Needs Scalable Storage

What Is Object Storage?

The S3 Standard and Cloud Compatibility

Understanding the S3 REST API

Why You Should Not Use `latest`