Edge AI Deployment: Optimizing and Scaling AI Models on Devices

Want a visual companion to this course? A slide deck is available to accompany the course content. It summarizes key concepts in a presentation format — ideal for revision, teaching, or team discussions.

👉 Download the slide deck (PDF)

Prefer listening? This course is also available in audio podcast format. You can listen to the full content using the episodes below — ideal for learning on the go or as a complement to the written material.

Course Overview

As artificial intelligence moves from cloud platforms to physical environments — from factory floors to remote infrastructure, from consumer devices to autonomous vehicles — there is a growing need to deploy machine learning models at the edge.

This course prepares participants to bring AI to edge devices, where computational power is limited, connectivity is unreliable, and energy consumption must be minimized.

Through a series of structured chapters and hands-on components, you will learn how to optimize models, package them for deployment, and run them efficiently on real-world hardware.

Edge AI is a key enabler for applications such as predictive maintenance, smart sensing, visual inspection, robotics, and embedded analytics — especially in industrial and IoT contexts.

By following this course, you will:

Understand the constraints and opportunities of edge computing.
Get an overview of edge hardware and how to choose the right platform.
Learn model compression and optimization techniques for resource efficiency.
Explore model serialization such as ONNX format and how to export models from frameworks like PyTorch.
Gain hands-on experience with ONNX Runtime, including backend selection, graph-level optimizations, and quantization techniques for efficient inference.
Learn how to deploy high-performance models using TensorRT, with practical guidance on engine building, precision tuning, and memory management.
Discover how to further optimize models using NVIDIA’s Model Optimizer, including quantization-aware training, structured pruning, and support for sparsity acceleration.

Whether you're a data scientist, ML engineer, or embedded systems developer, this course provides the tools and context you need to bridge the gap between machine learning models and operational edge deployment.

1. Introduction to Edge AI

Edge AI refers to the deployment of machine learning inference on physical devices located close to the source of data — rather than relying on remote cloud servers. These devices, often called “edge devices,” operate outside traditional data centers and are typically embedded into systems such as machines, vehicles, equipment or sensors.

Unlike cloud infrastructure, edge devices function under significant resource constraints. They usually have limited compute power, memory, and storage. Network bandwidth may be constrained or intermittent, and in many cases — such as battery-powered devices — energy usage must also be minimized. These constraints have a direct impact on how machine learning systems are designed and deployed in edge environments.

In Edge AI systems, machine learning models are typically trained on external compute resources — such as cloud platforms or high-performance workstations — and then optimized and deployed to run inference on the device itself. These models must be lightweight and efficient, which often requires the use of techniques like quantization, pruning, or model distillation. Once deployed, the models run locally on the device, responding to input from cameras, microphones, accelerometers, or other sensors.

In some cases, edge devices may also perform on-device training or fine-tuning, for example to adapt to new conditions or user-specific data — though this remains relatively rare and is often limited by hardware, storage capacity and energy constraints.

Edge AI is commonly applied in domains where data is generated continuously by physical systems and needs to be processed nearby. Use cases include industrial automation (e.g. vision systems or predictive maintenance), automotive systems and robotics (e.g. real-time control and navigation), healthcare devices (e.g. diagnostics and wearables) or remote infrastructure (e.g. monitoring smart meters or pipelines). Each of these contexts presents different challenges in terms of performance, connectivity, and hardware design — but they all share the need to bring intelligent behavior closer to the edge of the network.

From Cloud to Edge: Why the Shift?

Most machine learning systems today are developed and deployed in the cloud. This makes sense: cloud platforms offer vast compute resources, virtually unlimited storage, and elastic scalability. They accelerate experimentation and reduce operational costs, making them ideal environments for training and serving models — especially during early development or when scaling applications.

In this traditional setup, models are trained and hosted on remote servers. When a prediction is needed, client devices — such as phones, browsers, or machines — send input data to the cloud, wait for a response, and receive the output. This architecture has clear advantages: it's easy to manage, it supports powerful models, and it's relatively quick to deploy and update. For many applications, this pattern remains the best option.

But as machine learning moves out of the data center and into physical environments — factory floors, vehicles, hospitals, wearables, and remote installations — new constraints emerge. These systems are often subject to strict requirements on latency, reliability, connectivity and privacy. In such settings, sending data to the cloud is no longer ideal. Sometimes it's technically infeasible. Sometimes it's legally restricted. Sometimes it's just too slow, too expensive, or too brittle.

This shift has created a growing need for edge AI: systems where inference happens on the device itself, without relying on continuous cloud access. In some domains, edge computing is simply the only viable option. Autonomous vehicles, for example, cannot afford the latency of a cloud round-trip when making driving decisions. Medical devices may need to function offline or protect sensitive patient data locally. Industrial machines on a production line may be located in facilities with limited or unstable connectivity — yet they still need to respond in real time.

Even when it's not strictly required, moving AI to the edge can offer strategic advantages. Systems that operate independently of the cloud become more resilient. Processing data locally can reduce bandwidth costs and limit privacy risks. And in products with long operational lifetimes — like embedded sensors or durable industrial equipment — local intelligence can ensure functionality without requiring constant server-side infrastructure.

Edge AI is not a rejection of the cloud. Rather, it’s a response to the practical and architectural demands of deploying AI in the physical world. In the rest of this chapter, we’ll explore the technical motivations for this shift — including latency, bandwidth, reliability, privacy, and cost — each of which helps explain why edge computing is gaining traction across a wide range of industries.

Latency: Real-Time Response in the Physical World

In many real-world systems, decisions must be made in milliseconds — not seconds. Latency isn’t just a matter of convenience, it can be a functional or safety requirement. Think of a robotic arm detecting a faulty part on a conveyor belt, a drone adjusting its trajectory mid-flight, or a vehicle detecting a pedestrian in its path. These systems require tight control loops that react instantly to incoming sensor data.

Cloud inference introduces an unavoidable delay. Even under ideal conditions, sending data to a remote server, processing it, and returning a result can easily take 50 to 100 milliseconds — and that’s assuming low-latency networks and no queuing delays. In practice, that’s often too slow for real-time control. Worse, the delay can vary unpredictably depending on network conditions, making it unsuitable for systems that demand deterministic response times.

By contrast, Edge AI allows inference to happen directly on the device, with latencies measured in single-digit milliseconds — or even microseconds on specialized hardware. This enables fast, reliable responses and makes it possible to embed intelligent behavior into time-critical systems.

If your product needs to perceive and act in the physical world — and especially if it needs to do so fast — cloud latency is often a deal-breaker. Edge inference provides the timing guarantees that such systems demand.

Bandwidth: Managing Data Volume Locally

Many edge applications generate a continuous stream of high-volume data — especially from sources like cameras, microphones, or high-frequency sensors. Sending all of this raw data to the cloud for inference isn’t just slow — it’s often impractical or too expensive. Network capacity becomes a bottleneck, especially in environments with limited connectivity or when relying on cellular or satellite links that incur per-byte costs.

Edge AI reduces this burden by performing inference directly on the device. Instead of transmitting entire video streams or sensor logs, the device can extract insights locally and send only the relevant results — such as anomaly alerts, classification labels, or aggregated summaries. This drastically reduces bandwidth usage and enables systems to operate efficiently even on constrained networks.

In more advanced setups, devices can selectively upload interesting samples — for example, edge cases, misclassifications, or rare events — to support monitoring or retraining in the cloud. This allows teams to maintain model quality over time without flooding the network, striking a balance between independence and central coordination.

In bandwidth-constrained environments, Edge AI isn’t just a performance optimization — it’s what makes the entire system feasible.

Reliability and Resilience: When the Network Fails

When deploying AI in the physical world, network reliability becomes a critical risk factor. Many systems operate in environments where connectivity is poor, intermittent, or simply unavailable — from factory floors and rural infrastructure to mobile platforms like vehicles, ships, and drones. If your AI system depends on a stable internet connection to function, it’s exposed to a single point of failure: when the network goes down, the intelligence disappears.

This is not acceptable in many domains. Industrial machinery may need to perform safety checks or quality inspection even if the internet connection is temporarily lost. Medical devices may be used in ambulances, remote clinics, or under field conditions where cloud access is unreliable or non-existent. Defense or aerospace systems must maintain autonomy in disconnected or hostile environments. In all these cases, failure due to network loss is not an option.

Resilience isn't just about handling extreme cases — it can also be a strategic differentiator. Products that continue to work predictably during outages build user trust and reduce support burden. Systems that degrade gracefully rather than failing outright are more attractive in competitive markets.

For some applications, it's also possible to combine both worlds using a hybrid fallback model: when connectivity is available, a system might use a more powerful cloud-based model, but when the connection is lost, it automatically falls back to a local edge model. This provides performance without sacrificing reliability.

Privacy, Security, and Compliance

One of the most important benefits of Edge AI is that data can stay on the device. This local processing model has major implications for privacy, security, and regulatory compliance, particularly in sensitive domains like healthcare, defense, finance, and consumer electronics.

When raw data never leaves the edge device, the risk of exposure is significantly reduced. There’s no need to transmit personal or sensitive information across the internet, store it in centralized databases, or manage access controls in the cloud. This dramatically shrinks the system’s attack surface and simplifies compliance with legal frameworks like the GDPR or HIPAA.

In some cases, legal or contractual requirements explicitly prohibit data from crossing geographic borders — or even leaving the physical premises. Edge AI naturally supports these constraints by enabling inference and decision-making at the point of capture.

This model supports privacy-by-design. A system can extract value from data without ever storing or transmitting it. For example, a smart camera might detect the number of people entering a store, classify the age range of visitors, or identify suspicious activity — but send only aggregate statistics or alerts, not the raw video. This avoids handling personally identifiable information entirely, which simplifies compliance and reduces ethical concerns.

While some industries are bound by hard legal constraints, in many others, strong privacy guarantees are a business advantage. Consumers and clients increasingly expect transparency and discretion. Systems that make it clear that their data stays local — and never leaves the device — are easier to trust, easier to deploy, and often more appealing to customers.

Cost Considerations: The Surprising Economics of Edge

Cloud-based AI is convenient — but it’s not always cheap. Inference in the cloud means paying for compute time, data storage, and network traffic, often on a per-request or per-byte basis. When serving predictions to thousands or millions of devices, these costs can scale unpredictably. What starts as a few cents per request can snowball into a major recurring expense — especially for real-time systems or high-frequency data sources.

Bandwidth costs are an important factor here. Many edge applications involve heavy data — images, audio, high-rate sensor streams — and streaming this raw input to the cloud for processing can be expensive. On mobile networks, this may cause high subscription costs, congest bandwidth, or degrade user experience. Edge AI avoids this by sending only summaries or alerts when necessary, reducing data usage and saving money.

Edge systems also offer cost predictability. Instead of ongoing cloud charges, inference costs are tied to a one-time hardware investment — often on devices that are already deployed and underutilized. Even if new edge hardware is required, the cost is fixed upfront. This is particularly valuable in industries with strict budgeting, long procurement cycles, or product lifetimes measured in decades.

And the longer a device remains in the field, the better the economics get. An edge device that runs the same model for five years can amortize its hardware cost over millions of inferences — while a cloud-based solution would keep generating bills indefinitely. In this way, Edge AI is not just a technical architecture, but a long-term cost strategy that can unlock scalable intelligence without scaling your operating expenses.

Power Efficiency and Battery-Powered Intelligence

Running AI in the cloud requires continuous data transmission, which consumes a lot of energy on the device side. For edge devices, especially those running on batteries or solar power, this model can become unsustainable. Edge AI addresses this by eliminating the round trip: instead of streaming data to a server, devices process it locally, reducing communication overhead.

Power efficiency becomes a core design constraint, not just a nice-to-have. Modern edge chips — especially those built on ARM architectures or equipped with Neural Processing Units (NPUs) — can run inference tasks at a fraction of the energy cost compared to general-purpose CPUs or GPUs.

Lower power usage also means less heat, which is critical in enclosed or thermally constrained environments — such as wearables, automotive systems, or sealed industrial enclosures where active cooling isn’t feasible.

This unlocks entirely new applications. Battery-powered sensors, wearables, and remote monitoring systems can now include machine learning without sacrificing battery life. In some setups, tiny microcontrollers (with NPUs) can execute models while consuming just milliwatts — running for months or even years on a single battery.

Paired with low-power wireless protocols like LoRa, NB-IoT, or BLE, these systems can communicate only when needed, further reducing energy use. This not only lowers operational costs, but also supports sustainability goals and makes AI viable in places where reliable power is unavailable.

Edge AI doesn’t just shrink models — it shrinks their energy footprint too. And in many environments, that’s the difference between something that works in the lab, and something that works in the real world.

Strategic Deployment: Cloud, Edge, or Both?

Choosing where to run machine learning models — in the cloud, at the edge, or somewhere in between — is a strategic decision, not just a technical one. Each option comes with trade-offs in terms of performance, cost, flexibility, and control.

Running inference in the cloud offers many advantages: centralized management, access to powerful hardware, scalable infrastructure, and fast iteration cycles. But it also often means building on top of cloud ecosystems like AWS, Azure, or Google Cloud — each of which comes with proprietary tooling, complex pricing models, and the risk of vendor lock-in.

By contrast, Edge AI gives teams more autonomy. Models run on local hardware — hardware that they control — without being tied to a specific cloud provider. This allows for tighter integration with devices, greater cost predictability, and systems that can operate independently of internet access or external infrastructure. That independence can be a competitive advantage, especially in industrial, medical, or defense settings.

Still, this is rarely an all-or-nothing decision. In practice, many systems adopt a hybrid architecture: training models in the cloud, deploying them to the edge, and occasionally uploading selected data for monitoring or retraining. Some systems might perform inference in the cloud under normal conditions, but fall back to local models when bandwidth is constrained or connectivity is lost. Hybrid setups can offer the best of both worlds.

There’s even a middle ground between cloud and edge: on-premise servers. These are high-performance machines located in the same facility as the edge devices they serve — for example, in a factory or hospital. On-prem deployments offer cloud-like compute capacity without requiring internet access. Devices can connect over local networks, achieving low-latency, high-bandwidth communication while keeping all data within the facility. This setup is particularly valuable when privacy, performance, or uptime requirements rule out external services.

Finally, it’s important to acknowledge that Edge AI is not always the right tool. If your application requires large-scale foundation models, real-time collaboration across devices, or constant model updates, cloud inference may remain the better fit. Edge AI should be chosen deliberately, when the constraints or goals of the application justify it.

2. Overview of Edge Hardware and Accelerators

In the previous chapter, we explored what edge AI is and why many companies are choosing to run machine learning models locally. Once you’ve made the decision to adopt edge AI, the next question naturally follows: what kind of hardware will run my model?

This chapter provides a practical overview of the edge AI hardware landscape. We’ll explore the types of processors available — from tiny microcontrollers to powerful GPUs — and explain how to match them to your application needs. Whether you’re buying off-the-shelf boards or integrating compute into custom devices, understanding the available options is key to building efficient, reliable edge systems.

Industrial Priorities in Hardware Choice

When selecting hardware for consumer electronics, most people focus on performance, features, and price. But industrial edge AI deployments follow a very different logic. Here, reliability, regulatory compliance, and long-term viability often outweigh sheer compute power.

One of the most important — and often underestimated — requirements is long-term availability. Industrial systems have long design cycles and even longer deployment lifespans. Companies that spend years developing a product want the hardware to remain available for at least a decade, including spares for service and repairs. A streetlight controller, for instance, might be installed in 2027 but still need replacement parts in 2037. This makes disappearing parts a major liability — even small hardware changes can trigger redesigns or re-certification.

That brings us to regulatory certification. Industrial devices often operate in safety- or compliance-critical environments. Meeting standards like CE, FCC, or UL isn’t optional — it’s a legal requirement. Off-the-shelf compute modules are attractive here, because many come pre-certified, saving time and cost. In low volumes, this can be decisive. But at high volume, custom designs may be cheaper per unit — even if that means managing the certification process in-house.

Environmental durability is another essential factor. Edge AI systems are rarely used in climate-controlled offices. Instead, they might be installed outdoors, on moving vehicles, or factory floors. Exposure to dust, vibration, temperature extremes, and humidity means consumer-grade hardware often won’t survive. Industrial systems demand ruggedized components, wide-temperature tolerances, and mechanical stability. Some are even designed for space applications, where radiation resilience becomes critical.

Even when a board looks perfect on paper, it may not be available at scale. Many developers underestimate sourcing risk — supply chains fluctuate, parts are discontinued, and lead times stretch without warning. Hardware that’s easy to buy in small batches might be hard to source in volume. Industrial buyers must consider vendor roadmaps, supply guarantees, and regional distribution networks before locking in a platform.

Industrial edge AI systems are built to last, operate in tough environments, and comply with strict rules. Choosing the right hardware means thinking beyond benchmarks — and weighing the full ecosystem that surrounds a chip or board.

Tooling, SDKs, and Software Ecosystems

When it comes to deploying AI at the edge, hardware alone is not enough. What separates a usable platform from a frustrating one often isn’t the chip — it’s the software stack around it. For industrial teams especially, the difference between a smooth deployment and a stalled project often comes down to the quality of SDKs, drivers, and documentation.

This is why toolchain maturity has become a first-class priority in hardware selection. Mature platforms like NVIDIA Jetson or Intel OpenVINO don’t just offer solid performance — they offer reliable software ecosystems. They come with build systems that work, drivers that compile, and support teams that answer questions. In contrast, cheaper alternatives may look great on paper but quickly reveal hidden costs when the SDK is poorly maintained or the documentation is incomplete.

From a business perspective, this has major implications. Engineering time is expensive, and debugging vendor-specific quirks can consume weeks. If your team spends a month trying to coax an NPU into accepting a model, that lost time can easily outweigh any savings on hardware. That’s why industrial developers increasingly view software support as part of total cost of ownership — not just how much the board costs, but how long it takes to make it useful.

Another key concern is software portability. High-level abstractions — like ONNX Runtime or TensorFlow Lite — allow developers to build once and deploy across devices. This saves time but introduces some overhead. In performance-critical systems, teams often fall back to lower-level vendor SDKs (like NVIDIA TensorRT or Rockchip RKNPU toolchains) to extract maximum efficiency. This tradeoff — speed versus portability — is central to edge AI engineering.

Fortunately, the edge ecosystem is maturing. Industry standards like ONNX aim to unify how models are represented and deployed across hardware. This growing standardization helps developers move between platforms — or future-proof their software against hardware changes.

Good hardware needs great software. Without robust toolchains and support, even powerful processors can become bottlenecks. For industrial edge AI, ecosystem maturity often matters as much as raw compute.

Understanding CPUs, GPUs, and NPUs

To run an AI model on edge hardware, you need compute — something that can execute the mathematical operations of your neural network. In practice, that means choosing between three main types of processors: CPUs, GPUs, and NPUs.

Let’s start with what you already know. A CPU (Central Processing Unit) is the most flexible compute type. It can run anything: operating systems, drivers, control logic, signal processing, and — yes — machine learning. But it’s not particularly fast at neural inference. CPUs are designed for general-purpose computing, with a few cores optimized for sequential logic and branching code. That makes them essential for handling the “glue logic” in most systems, but relatively inefficient at the kinds of matrix operations that dominate deep learning.

This is where GPUs (Graphics Processing Units) shine. Originally built to accelerate image rendering, GPUs are specialized for highly parallel computation. Instead of a few powerful cores, they have hundreds or thousands of simpler ones, which is exactly what you want for matrix multiplication, convolution, and activation functions. That’s why deep learning frameworks like PyTorch run so well on desktop and datacenter GPUs — and why embedded versions, like NVIDIA Jetson, use the same idea in a smaller package. GPUs are general-purpose enough to support a wide range of models, but fast enough to handle real-time inference for video and vision tasks.

But GPUs are still overkill for many edge deployments. Enter the NPU — Neural Processing Unit. These are application-specific accelerators: custom silicon built to do one thing well — run neural networks. Unlike GPUs, they aren’t designed to be general-purpose. They usually support a specific subset of layers, quantized to int8 or fp16, and are deeply optimized for energy efficiency. That makes them ideal for small, always-on tasks, especially in constrained environments like wearables, drones, or smart cameras.

Most NPUs are tightly integrated into a larger System-on-Chip (SoC) that also includes a CPU (and sometimes a GPU). The CPU sets things up, the NPU does the heavy lifting. You’ll find this pattern in platforms from NXP, STMicroelectronics, Texas Instruments, and many others. But not all NPUs are equal. Some only support outdated models or lack compatibility with modern network layers like attention or depthwise convolutions. Others, like Hailo or newer Qualcomm chips, support transformer blocks and even small language or vision-language models. The capabilities vary widely — and so does the tooling.

It’s worth stressing this point: performance specs can be misleading. A board might advertise “10 TOPS” (Tera Operations Per Second), but if your model uses layers that aren’t supported — or if memory bandwidth becomes a bottleneck — your real-world speed will suffer. This is why edge AI isn’t just about benchmarks. It’s about operator compatibility, toolchain maturity, and how well the hardware maps to your specific model.

Matching AI Workloads to Hardware

The type of model you want to run often determines the kind of hardware you need. Time series, audio, images, video or language all place very different demands on a system.

Start with the lightest workloads: time series models. These include things like temperature, vibration, flow rate, or accelerometer data — often sampled at low rates, from fractions of a Hz to a few kHz. Because of this, the models that process them — usually classifiers or anomaly detectors — are small and efficient. Many can run comfortably on microcontrollers (MCUs) with just a few hundred kilobytes of RAM. This is the domain of TinyML: gesture recognition, electrical analysis, machine failure prediction — all possible on sub-watt devices.

Next up is audio. Human speech and environmental sounds require higher sampling rates — typically 16 to 48 kHz — and demand more compute than simple time series. Classification tasks like keyword spotting, speaker identification, or glass break detection can still run on optimized MCUs or low-end NPUs. But continuous speech recognition (e.g. Whisper) requires more memory and processing power, usually pushing you into Linux-class processors with NPUs or embedded GPUs.

Image-based models raise the bar again. A 224×224 RGB image contains over 50,000 pixels — orders of magnitude more data than a few sensor readings. Models like image classification or object detection (e.g. MobileNet, R-CNN) need significant matrix compute and often rely on quantized NPUs or embedded GPUs for real-time performance. You might run simple classification on an MCU with a camera — but anything more advanced, like semantic segmentation, requires hardware with deeper acceleration support and enough memory to hold intermediate feature maps.

Video workloads are even more demanding. Not only do they need to process high-resolution images, but they must do so continuously — 15, 30, or even 60 frames per second. This is where high-end NPUs, dedicated AI SoCs, or embedded GPUs become essential. Tasks like license plate recognition, pose estimation, or multi-object tracking are typically out of reach for low-end boards.

Finally, there’s a new frontier: transformer-based models, such as Whisper (audio), Vision Transformers (ViT) or LLaMA (language). These require large amounts of RAM, fast memory access, and support for operations like multi-head attention. While some modern NPUs and embedded GPUs are beginning to support them, this is still the cutting edge — and many platforms lack full compatibility. For these models you often need desktop-grade GPUs.

The structure and frequency of your input data — and the complexity of your model — should guide your hardware choices. Matching the workload to the hardware is essential for building efficient, scalable edge AI systems.

Microcontrollers for TinyML

Microcontrollers (MCUs) are the smallest class of compute platforms in edge AI — and also the most widespread. They’re used in everything from smart sensors and wearables to industrial automation and home appliances. Traditionally built for control tasks and simple I/O, MCUs now play a growing role in machine learning thanks to advances in TinyML — the practice of running ML models on devices with kilobytes or megabytes of memory.

Modern MCUs typically run at 10–200 MHz, with KBs to a few MB of RAM, and are designed for ultra-low power operation. That makes them ideal for always-on sensing in constrained environments: vibration analysis on motors, fall detection in elderly care devices, keyword spotting in voice interfaces, or anomaly detection in pressure and flow sensors.

One of the most widely used families in industrial and embedded systems is the STMicroelectronics STM32 series. These chips range from simple Cortex-M0 controllers to high-end STM32H7 variants with floating-point support and multiple megabytes of RAM. At the cutting edge is the STM32N6, which adds a dedicated neural processing unit (NPU) — enabling real-time computer vision tasks like YOLOv8 at up to 30 FPS, all within a power budget suitable for battery operation. This marks a major step forward: object detection on MCU-class hardware was unthinkable just a few years ago.

In many real-world deployments, they act as smart sensors, running inference locally and only transmitting data when something of interest happens — saving bandwidth, energy, and cloud costs.

Microcontrollers make AI possible in places where power is scarce, bandwidth is limited, and simplicity is essential. They won’t run LLMs — but they will quietly monitor the world, and speak up when it matters.

Embedded Linux SoCs (With or Without NPU)

Microcontrollers are efficient, but they have limits — especially when your models grow larger, your inputs become richer, or your application needs to run Linux. That’s where embedded Linux System-on-Chips (SoCs) come in. Often called microprocessors (MPUs), these platforms strike a balance between performance, power, and flexibility — making them a go-to for many edge AI applications.

Unlike MCUs, these chips can run full Linux distributions, offer hundreds of megabytes (or gigabytes) of RAM, and support advanced I/O like Ethernet, USB, and camera interfaces. That makes them well suited to tasks like robotics, computer vision, human–machine interfaces, audio processing, and industrial control systems. Think of factory robots navigating their workspace, or smart security cameras analyzing video in real time — all without needing a server or GPU.

Many Linux-capable SoCs now include dedicated neural accelerators (NPUs). These offload inference from the CPU, enabling smooth real-time AI even under tight power budgets. Popular examples include:

NXP i.MX 8M Plus: a well-documented SoC with a 2.3 TOPS NPU, widely used in industrial vision and IoT.
Texas Instruments TDA4VM: optimized for machine vision and edge analytics, with real-time cores and up to 8 TOPS of AI acceleration.
Rockchip RK3588: a high-performance SoC featuring an octa-core CPU (4x Cortex-A76 and 4x Cortex-A55), a Mali GPU, and a 6 TOPS NPU, supporting 8K video processing and advanced AI applications.
STM32MP25: bridging their MCU heritage with MPU features, this SoC adds a 1.35 TOPS NPU for embedded vision tasks.

These platforms are easier to develop for than bare-metal MCUs — they support standard Linux toolchains and offer faster iteration for developers. But they do come with trade-offs: higher power draw, more complex hardware, and a steeper bill of materials.

In the edge AI landscape, embedded Linux SoCs fill a vital middle ground: more power and flexibility than MCUs, but still compact and efficient enough for embedded devices.

Single-Board Computers and Add-On Modules

When building an edge AI product, you don’t always want to design your own PCB from scratch. Many teams instead rely on single-board computers (SBCs) or compute modules, which offer pre-integrated platforms for fast development and deployment. These solutions package SoCs into ready-to-use boards — often with memory, storage, power regulation, and I/O already included.

A single-board computer is a fully functional computer on one PCB. It boots on its own, runs Linux, and typically includes HDMI, USB, Ethernet, and other interfaces. These are great for prototyping or small-scale deployments. Examples include the Radxa ROCK 5, which pairs the Rockchip RK3588 with 8K video and AI capabilities, and the well-known Raspberry Pi 5, which, while limited in AI performance, remains widely used due to its community support and general-purpose flexibility.

But when you need a more integrated solution — especially for embedded or industrial products — add-on modules (or computer-on-modules) are often the better choice. These are compact boards that plug into a carrier board using high-speed connectors (like SMARC or custom mezzanine interfaces). The compute module handles high-frequency design (CPU, memory, power), while the carrier board adds application-specific I/O: motor controllers, industrial buses, camera connectors, or custom interfaces.

For example, the Toradex Aquila family supports Texas Instruments and NXP i.MX SoCs, and offers long-term availability along with pre-certified Linux support. SolidRun’s system-on-module series features modular i.MX8-based compute modules for vision and control applications. Even the Raspberry Pi Compute Module 5 follows this pattern, enabling companies to embed a Pi into a custom product without having to design high-speed circuitry themselves.

This modular approach is especially attractive in industrial and medical markets, where certification, reliability, and lifecycle support are essential. Designing with add-on modules simplifies development and lets teams focus on their application logic — not DDR routing, power sequencing, or RF compliance.

The NVIDIA Jetson Series

At the high end of embedded AI computing sits NVIDIA’s Jetson platform — a family of compute modules that combine ARM CPUs, powerful NVIDIA GPUs, and dedicated AI acceleration. Jetsons are designed for demanding edge applications: autonomous robots, industrial inspection, multi-camera analytics, and real-time video understanding. If your use case involves large models, high frame rates, or complex decision-making on-device, Jetson is often the go-to.

The current generation — the Jetson Orin series — spans a range of power and performance options:

The Jetson Orin Nano delivers up to 67 TOPS at just 7–25W, ideal for compact AI workloads.
The Orin NX steps up to 157 TOPS at 10-40W, handling advanced tasks like multi-stream video analytics.
At the top, the Jetson AGX Orin offers up to 275 TOPS at 15-60W, enough to run full transformer models or sophisticated robotic planners — all on-device.

What makes Jetson stand out isn’t just the hardware — it’s the software ecosystem. Every module runs JetPack, NVIDIA’s development stack that includes CUDA, cuDNN, TensorRT, and DeepStream. This gives you access to the same tools used in datacenter AI — but optimized for embedded form factors. Jetson is fully supported by frameworks like PyTorch, and integrates easily with NVIDIA’s pretrained models.

Jetson modules are compute boards — meant to be embedded into a carrier board. Development kits come with a reference carrier, but many companies build their own or buy from vendors like Connect Tech or Forecr. This allows integration into robots, kiosks, drones, or other custom hardware. Jetson modules are also the base for many ruggedized industrial computers — including some sold by Siemens and other automation vendors.

Of course, this power comes at a price. Jetson boards cost hundreds of euros, and draw more power than MCU or NPU-class systems. Lifespan is another factor: Jetson modules are supported for up to 8 years, which is decent, but not long enough for some safety-critical or ultra-long-lifecycle products.

Still, for many teams, Jetson hits the sweet spot: a well-supported, high-performance platform that lets you bring cutting-edge AI to the edge — without reinventing your software stack.

External AI Accelerators (Hailo)

Not every edge device comes with an NPU built in — and not every application needs one all the time. That’s where external AI accelerators come in. These are compact, dedicated processors designed to offload neural inference from the main CPU or SoC. They connect over high-speed interfaces like PCIe, typically using the M.2 form factor, and act as plug-in AI engines for otherwise limited systems.

Historically, similar products like the Intel Neural Compute Stick or Google Coral USB Accelerator used USB interfaces and offered basic inference support. But those devices are now deprecated, and newer workloads demand more bandwidth and better integration.

Today’s accelerators can deliver serious performance, often rivaling embedded NPUs while offering more deployment flexibility. Among these, Hailo stands out as a leading industrial-grade option.

The Hailo-8 is a compact PCIe or M.2 module that delivers up to 26 TOPS of AI inference at just 2.5W. That’s enough for real-time object detection, multi-camera video analytics, or industrial quality control — all on compact, passively cooled hardware. Its little sibling, the Hailo-8L, targets lighter applications at up to 13 TOPS with even lower power consumption.

More recently, Hailo-15 brings together an ARM CPU with up to 20 TOPS of integrated AI acceleration. It’s designed as a smart vision processor — ideal for camera-based systems where you want AI and image processing on a single chip.

What makes Hailo particularly compelling isn’t just the hardware — it’s the software stack. The Hailo Dataflow Compiler supports ONNX and TensorFlow models and has a reputation for being far more usable than many embedded NPU SDKs. There’s active community momentum too: the Raspberry Pi AI Kit, for instance, brings Hailo-8L to the Raspberry Pi 5 — unlocking industrial-grade inference for hobbyist and prototyping use cases.

Hailo accelerators are also well integrated into the industrial ecosystem. Vendors like Toradex, SolidRun, Variscite, Kontron, and Advantech offer boards with Hailo modules built in or supported via expansion slots. Automotive-grade versions, extended temperature support, and long-term availability make them viable for serious deployments — not just experiments.

One caveat: the accelerator is only as fast as your system’s ability to feed it. If the host CPU is underpowered, or the PCIe bandwidth is constrained, you may run into performance ceilings. But when paired with a capable host, Hailo modules offer an elegant path to scalable, low-power AI at the edge.

FPGAs for Deterministic AI

Most AI acceleration happens on fixed-function hardware — CPUs, GPUs, NPUs — each optimized for specific types of computation. But there’s another class of devices that offers a different kind of advantage: Field-Programmable Gate Arrays (FPGAs). These are reconfigurable chips that allow developers to define their own logic circuits — effectively building custom hardware for each application.

In the context of AI, FPGAs are used when deterministic behavior, ultra-low latency, or precise control over data flow is required. This makes them attractive for real-time signal processing, sensor fusion, or tightly integrated vision pipelines in environments like industrial automation or high-speed robotics.

Two major vendors dominate this space: AMD (formerly Xilinx) and Intel (via Altera). Both offer SoC platforms that combine standard ARM processors with programmable logic:

AMD’s Kria K26 module integrates a Zynq UltraScale+ MPSoC, combining a quad-core Cortex-A53 with FPGA fabric. It’s designed for edge AI workloads and comes with a developer-friendly carrier board (KV260) aimed at vision applications.
Intel’s Agilex 5 FPGAs mix ARM cores with AI-tuned DSP blocks, targeting mid-range applications in automation and robotics. Their FPGA AI Suite supports standard ML frameworks like PyTorch, bridging the gap between software and hardware development.

FPGAs are not plug-and-play — they require more specialized knowledge and longer development cycles than other accelerators. But for applications where timing is critical and customizability is key, they offer capabilities that CPUs, GPUs, and NPUs simply can’t match.

Intel-compatible Edge Devices

Many embedded edge AI platforms today are built around ARM processors — from microcontrollers to SoCs with integrated NPUs. But ARM isn’t the only game in town. Many industrial and commercial systems still rely on x86 platforms, powered by Intel or AMD CPUs.

Intel-compatible edge devices come in a wide range of form factors — from mini PCs to industrial single-board computers (SBCs) and computer-on-modules. A good example is the Intel N100, a recent low-power CPU that offers solid performance at just 6W TDP. You’ll find N100-based boards from vendors like Radxa, ASRock Industrial, and AAEON, many with support for industrial temperature ranges, wide voltage input, and long-term availability. These platforms are often used in control panels and compact automation systems.

What makes x86 attractive isn’t just the hardware — it’s the software flexibility. You can run full Linux or Windows distributions, reuse existing drivers, and install well-known tools without needing special builds or patches. This matters in practice: it means your team can deploy with familiar stacks, update systems with standard package managers, and reduce integration overhead.

For AI tasks, Intel’s OpenVINO toolkit is a major selling point. It provides a mature, developer-friendly environment for optimizing and deploying models on Intel CPUs and integrated GPUs. OpenVINO supports popular formats like ONNX, PyTorch, and TensorFlow, and includes tools to quantize and convert models for faster inference. In many industrial use cases — like visual inspection, anomaly detection, or predictive maintenance — the combination of x86 hardware and OpenVINO is powerful enough to avoid the need for dedicated accelerators.

Compared to embedded ARM platforms, x86 boards tend to be less integrated — they’re not usually embedded into custom PCBs, and they often come as full computers rather than chips. But that’s also a strength. For many businesses, the ability to buy a rugged, standards-compliant, ready-to-use box with strong vendor support is a huge advantage.

These x86 platforms may not be as power-efficient as ARM or as specialized as NPUs, but they offer a uniquely flexible, well-supported, and widely available option for edge AI — especially when development speed, ecosystem maturity, or operating system compatibility are top priorities.

Industrial PCs (IPCs)

When edge AI needs more horsepower than embedded platforms can provide, many teams turn to Industrial PCs (IPCs). These systems combine the power and flexibility of standard PC hardware with ruggedized designs built for harsh industrial environments. Think of them as desktop-class machines — but built to run 24/7 on a factory floor, inside a vehicle, or in a remote enclosure.

IPCs typically feature x86 CPUs from Intel or AMD, offering strong single-threaded and multithreaded performance. They support full operating systems like Linux or Windows, and often ship in fanless enclosures with DIN rail mounting, wide temperature support, and fieldbus connectivity for industrial protocols like Modbus or Profinet. You’ll find them in production lines, inspection systems, and machine control panels — wherever a standard PC would fail due to dust, vibration, or extreme temperatures.

At the high end, IPCs can be equipped with dedicated GPUs, including NVIDIA cards. This unlocks serious AI potential: systems with NVIDIA RTX GPUs can run full transformer models or process high-resolution video in real time. Popular vendors include Advantech, AAEON, OnLogic and many others.

These systems vary widely in size and capability — from palm-sized fanless boxes to full rack-mounted units — but they all emphasize reliability, modularity, and longevity. Many vendors offer extended lifecycle support, and most IPCs are designed for tool-free servicing or modular I/O expansion, making them practical to maintain in the field.

Importantly, IPCs are still just PCs at heart. They run standard software, support mainstream development tools, and integrate well with enterprise IT systems. That means fast prototyping, easy updates, and no need to learn exotic toolchains. But that general-purpose nature also means IPCs are not ideal for hard real-time tasks — they lack the deterministic timing guarantees needed for tight control loops or safety-critical automation.

PLCs with AI Capabilities

Programmable Logic Controllers (PLCs) are the backbone of real-time industrial automation. Designed for deterministic control, they run highly reliable, time-critical tasks — like managing a conveyor belt, synchronizing robotic arms, or ensuring safety interlocks trigger within milliseconds.

Unlike general-purpose computers, PLCs use real-time operating systems and execute their logic in tight, predictable cycles — often programmed in domain-specific languages like Structured Text (ST).

Some vendors support hybrid architectures where a real-time kernel runs alongside a standard operating system like Linux or Windows on the same device. This allows developers to combine deterministic control loops with more flexible IT workloads — such as data logging, visualization, or even AI inference — all on a single platform.

What makes PLCs so appealing in industrial settings is their ecosystem integration. Vendors like Siemens and Beckhoff offer not just the controller, but a whole ecosystem of I/O modules, fieldbus support, diagnostic tools, and certified libraries. This tightly integrated approach simplifies deployment, ensures reliability, and helps meet regulatory standards. But it comes at the cost of vendor lock-in, with software, tooling, and extension modules often tied to a single manufacturer.

AI is now entering this world. Several pathways exist to deploy AI models on PLCs:

Some platforms, like Beckhoff TwinCAT, allow you to run lightweight AI models (in ONNX format) directly inside the control logic. These models can be triggered during the PLC cycle, offering fast inference with deterministic timing.
Another option is external AI inference: models run on companion edge devices connected via industrial Ethernet protocols like Profinet or EtherCAT. The PLC handles control, while the nearby module or industrial PC handles inference.
In more advanced setups, inference may be offloaded to a central AI server or even the cloud. Siemens Industrial Edge, for instance, supports this kind of architecture, with AI models managed and deployed from a central platform.

This spectrum — from real-time AI on the PLC, to AI on a local module, to cloud-based inference — gives system designers flexibility. The tradeoff is always the same: timing precision versus model complexity. Real-time integration enables closed-loop control but limits model size and framework support. Offloading to companion systems unlocks more powerful AI but introduces latency and integration effort.

AI is increasingly finding its way into PLC-based systems, but adoption is still catching up. PLC vendors have built frameworks that make it possible to integrate AI into industrial automation. Platforms like Siemens Industrial Edge or Beckhoff TwinCAT Machine Learning are ready to deploy models at the edge — offering tooling, lifecycle management, and integration with existing automation protocols.

What’s still evolving is how quickly factory floor integrators and system engineers adopt these tools. As AI use cases mature in industry — from predictive maintenance to visual inspection — PLC ecosystems are well-positioned to serve as the bridge between classical control and intelligent automation.

Wrapping Up the Hardware Landscape

While this chapter focused on edge hardware — from microcontrollers and NPUs to industrial PCs and PLCs — it's worth noting that AI workloads don't always stop at the edge. When real-time latency isn’t critical, or when model complexity exceeds what industrial systems can handle, it's common to offload processing to an on-premise GPU workstation or local AI server. These machines are connected over reliable Ethernet and offer datacenter-grade performance without sending data to the cloud.

For many edge AI applications, the real challenge isn’t just raw power — it’s finding the right hardware for the job. Power constraints, latency requirements, certification demands, and ecosystem maturity all shape what "right" means. A low-power microcontroller might be perfect for vibration analysis, while a rugged Jetson AGX system makes sense for autonomous inspection robots.

There is no universal answer. Edge AI spans a broad landscape, and choosing wisely means balancing technical needs with practical constraints. The best engineers don’t just chase TOPS or brand names — they understand the trade-offs, the standards, and the lifecycles that define real-world deployments. In the end, it’s not just about what the hardware can do — it’s about what you can do with it.

3. Model Optimization Techniques

Why Optimize Models for the Edge?

Modern machine learning models are rarely built with embedded systems in mind. Whether trained in PyTorch or TensorFlow, most architectures assume access to workstation-class GPUs, lots of RAM, and fast disk I/O. When these models are exported and pushed to the edge, however, they often collide with reality: tiny memory footprints, limited compute power, and strict latency constraints.

Critically, the memory demand of a model isn’t limited to its weight file. Every parameter — each learned weight or bias — typically uses a 32-bit floating point number, meaning 4 bytes per parameter. A 1-million-parameter model, stored in FP32, already takes 4 MB just for weights. But that’s only part of the story.

During inference, the model also computes intermediate activation maps — the outputs of each layer — and these are often much larger than the weights themselves. Especially in convolutional or vision models, activation maps can occupy 5–10× more memory than the weights. These must often be held in RAM simultaneously, depending on the model and inference engine. In practice, inference-time memory use can easily be an order of magnitude higher than model storage size.

This is why optimization matters. The models you train in the cloud are rarely the models you run in the field. Without adjustments — whether in architecture, data precision, or inference logic — even moderately sized networks can overwhelm edge devices. A standard ResNet might not fit in RAM. An unquantized transformer could blow past real-time inference budgets.

Optimization is not about premature micro-efficiency. It’s a practical response to the constraints of deployment. It ensures that the intelligence you’ve built survives the transition from training lab to physical product. The rest of this chapter explores how to make that transition work — through quantization, pruning, distillation, and smarter design choices.

How Big Are Modern Models, Really?

Let’s look at the actual size and resource needs of common deep learning models — and compare them to what edge devices can realistically handle.

ResNet-50, a popular image classification backbone, has around 25 million parameters and requires about 100 MB of storage for the weights in FP32. Running a single inference can consume hundreds of megabytes of RAM due to intermediate feature maps and activation buffers. This is no issue on a desktop GPU — but on a Jetson Orin Nano (with 4 GB RAM total), it already pushes the limits, especially when running alongside other processes or camera pipelines.
YOLOv5s, one of the smaller object detection models, still weighs in at 7.5 million parameters (≈30 MB in FP32), and even more during inference. It fits comfortably on Jetson Orin or industrial x86 boards, but struggles on Raspberry Pi 5, where CPU-only inference often exceeds 500 ms per frame (depending on input resolution), making real-time use difficult without hardware acceleration.
MobileNetV2, designed for efficiency, drops to around 3.5 million parameters and typically consumes ~14 MB in storage. On a Radxa Rock 3, it can run at reasonable frame rates using INT8 quantization — which cuts inference time by more than half.
For microcontrollers like the STM32F4 series, you’re typically working with 128–512 KB of RAM, and perhaps 1 MB of flash — total. Models running on these devices (TinyML-class networks) need to be extremely compact — often just tens of kilobytes, using INT8 weights and minimal intermediate storage.

The bottom line: even relatively small models are too large for naive deployment. Without optimization they either won’t fit at all, or will miss performance targets by a wide margin. That’s why the techniques in the rest of this chapter aren’t just nice to have — they’re often the only way to make deployment feasible.

Who's Doing the Optimization? Frameworks and Toolchains

Before we dive into specific techniques like quantization or pruning, it's worth introducing the software frameworks that make these optimizations possible. When deploying models to edge devices, you're not just working with PyTorch or TensorFlow — you're stepping into a broader ecosystem of inference engines, compiler toolchains, and vendor SDKs that translate your model into something runnable, fast, and hardware-compatible.

These frameworks can be grouped into three categories:

1. Training Frameworks This is where models are built and trained, usually in the cloud or on a developer workstation.

PyTorch is by far the most widely used framework today — flexible, Pythonic, and dominant in research and industry prototyping.
TensorFlow / Keras still has strong adoption in production and mobile environments, though it’s less favored for new research work.

While training frameworks aren’t optimized for deployment, they provide the starting point — and often include tools to help convert or quantize models for inference elsewhere.

2. General-Purpose Inference Runtimes These run models on a wide range of devices, often with cross-platform support and hardware abstraction.

ONNX Runtime is a high-performance, open source inference engine designed to run models exported in the ONNX format. It supports CPU, GPU, and NPU backends, and is vendor-neutral. Many optimizations — including quantization, operator fusion, and graph rewrites — happen automatically.
TensorFlow Lite (TFLite) is a lightweight inference engine for deploying TensorFlow models to mobile and embedded devices (Linux and RTOS). It includes tools for quantization and supports many ARM-based platforms, including Cortex-M microcontrollers.
TorchScript is PyTorch’s built-in export format, used to freeze and run models outside of the Python runtime — for example on mobile apps or in C++ environments. It’s an easy way to deploy your model while staying within the PyTorch ecosystem, and exporting to TorchScript is often simpler than exporting to ONNX.
ExecuTorch is a new lightweight runtime from PyTorch, designed for deploying models on mobile and edge devices. It builds on PyTorch 2’s export tools and targets platforms like Android, iOS, and microcontrollers. While still new, it’s a promising option for future on-device inference — though it’s not yet as mature as the others.

3. Vendor-Specific Optimizers and SDKs These toolchains are tightly coupled to specific hardware platforms and extract maximum performance through low-level compilation and tuning.

TensorRT (NVIDIA) compiles models — usually in ONNX format — into highly optimized execution engines for NVIDIA GPUs and Jetson modules. It supports FP16, INT8, layer fusion, calibration, and memory planning.
OpenVINO is Intel’s toolkit for optimizing and deploying models on Intel hardware, including CPUs, integrated GPUs, and FPGAs. It offers post-training optimization, quantization, pruning-aware workflows (via NNCF), and a strong runtime.
TI EdgeAI SDK (Texas Instruments) targets their TDA series SoCs and supports model compilation, quantization, and hardware-aware optimization.
STM32Cube.AI (STMicroelectronics) lets you import trained models and compile them into efficient C code for STM32 microcontrollers, applying quantization and static memory allocation.
Hailo SDK (Hailo-8 accelerators) allows converting and compiling models for efficient execution on their external NPUs, with ONNX and TensorFlow support.
RKNPU Toolchain (Rockchip) provides support for Rockchip’s NPU-equipped SoCs, converting models from ONNX or TensorFlow into efficient execution graphs.
And many more...

In this chapter, we focus mostly on ONNX Runtime, TensorRT, and OpenVINO — because they’re widely used, industrially relevant, and often form the core of real-world edge deployments. PyTorch remains our starting point for model training and export. We’ll reference these frameworks frequently as we explore optimization techniques — and by the end of the chapter, you’ll understand not just what the optimizations do, but who applies them and how.

Numeric Precision and Model Data Types

Most machine learning models are trained using 32-bit floating point (FP32) values for weights and activations. This format is precise and stable, but it’s also large — each number takes 4 bytes of memory — and slow to compute on many edge devices.

Fortunately, during inference, we don’t always need that much precision. By switching to smaller numeric formats, we can dramatically reduce memory usage and improve speed — as long as the hardware and runtime support it.

The most common alternatives are FP16 and INT8. Half-precision floating point (FP16) cuts memory use in half and is supported natively by many GPUs. Integer formats like INT8 go even further — down to just 1 byte per value — and offer major performance gains on accelerators that support them. These include NPUs in Rockchip and NXP SoCs, Intel CPUs (via OpenVINO), and microcontrollers.

There are also niche formats like BF16 (Brain Float 16-bit, used in Google TPUs and some Intel hardware) and INT4 (supported on newer Qualcomm chips and high-end NVIDIA GPUs), which are gaining traction in specialized hardware for extreme compression. But for most edge applications, the real workhorses are FP16 and INT8.

PyTorch even supports automatic mixed precision (AMP) — a technique that selectively uses FP16 where safe, while falling back to FP32 for numerically sensitive operations. This is especially useful when running inference on NVIDIA GPUs, and it works transparently during both training and inference. It doesn’t reduce model size, but it can yield a significant speed boost with minimal effort.

The trade-off is always the same: smaller formats are faster and more efficient, but slightly less precise. Some models tolerate this easily — especially vision or audio classifiers — while others may lose accuracy unless quantization is done carefully.

Now that we’ve discussed some of these smaller data types, how do we actually convert our model to use them? That’s what quantization is all about. There are two main approaches: post-training quantization and quantization-aware training. Let’s look at each.

Post-Training Quantization (PTQ)

Quantization is one of the most effective ways to make machine learning models smaller and faster — especially on edge hardware. The idea is simple: instead of storing weights and activations as 32-bit floats, we convert them to lower-precision types, usually 8-bit integers (INT8). This reduces model size by up to 4× and often speeds up inference significantly, especially on devices with native INT8 support.

However, this conversion comes at a cost: because integer types have fewer possible values than floating point numbers, quantization introduces small rounding errors called quantization noise. Every float-to-int conversion loses a bit of precision — and when many layers accumulate this noise, it can slightly shift the model’s output. For most tasks, this is fine: classification confidence might change by 0.5%, or bounding boxes may shift by a pixel. But in sensitive models or under high compression (e.g. INT4), this noise can lead to noticeable drops in accuracy.

The simplest way to apply quantization is after training — without retraining or modifying the model architecture. This is known as post-training quantization (PTQ). It works by analyzing the trained model and computing scaling factors that map floating-point values into integer ranges. The result is a model that behaves roughly the same, but uses much smaller numbers.

There are two main variants:

Dynamic quantization only quantizes the weights ahead of time. Activations are still computed in floating point and only quantized on-the-fly during inference. This is quick and easy to apply, and works well for models like LSTMs or transformers where activations vary a lot.
Static quantization goes further: it quantizes both weights and activations, but requires a calibration dataset — a small sample of typical inputs — to estimate activation ranges. This provides better performance, especially on vision models and convolutional networks, but takes more setup.

In most cases, the conversion works by computing a scale factor and (optional) zero-point for weights and activations. These tell the inference engine how to convert between float and integer representations. The model architecture remains the same, but the math is carried out in lower precision.

float_value = scale × (int_value - zero_point)

When the zero-point is omitted, it’s usually because the framework or hardware uses symmetric quantization, which simplifies computation. TensorRT, for example, only supports symmetric quantization for both weights and activations.

Quantization parameters can be calculated either per-tensor or per-channel. Per-tensor quantization uses a single scale and zero-point for the entire layer, which is efficient but less accurate when channels vary a lot. Per-channel quantization assigns separate parameters to each channel (typically along the output dimension of a convolutional layer), improving accuracy at the cost of slightly more storage. In practice, weights are often quantized per-channel, while activations use per-tensor quantization for simplicity.

Most major frameworks support PTQ directly:

In PyTorch, you can apply dynamic or static quantization using torch.quantization.
ONNX Runtime includes a quantization toolkit that supports both modes, with export paths from PyTorch or TensorFlow.
TensorRT supports INT8 models via calibration steps, often requiring representative input data to compute accurate ranges.
OpenVINO offers post-training optimization through the Neural Network Compression Framework (NNCF).

Post-training quantization doesn’t come for free: some accuracy loss is normal, especially in models with sensitive numerical behavior. But for many edge applications — such as image classification, object detection, or audio commands — the drop is minimal, and the gains in size and speed are well worth it.

Quantization-Aware Training (QAT)

While post-training quantization is fast and convenient, it can lead to a drop in accuracy — especially for models that are sensitive to small changes in numerical precision. That’s where Quantization-Aware Training (QAT) comes in.

QAT simulates quantization during training. Instead of training with high-precision values and converting afterward, QAT inserts “fake quantization” operations into the model. These simulate the effect of using INT8 weights and activations, while still storing gradients and parameters in floating point. This allows the model to adapt to quantization noise during training, which typically leads to higher accuracy in the final quantized model.

The training process looks almost the same — but under the hood, every layer is already behaving like it’s quantized. By the time the model is exported, it’s fully prepared to run at low precision, with fewer surprises in accuracy or behavior.

In PyTorch, QAT is supported via the torch.quantization workflow — using prepare_qat() before training and convert() afterward. In OpenVINO, QAT is integrated into the NNCF framework, which can instrument a PyTorch model to simulate quantization and export it directly for deployment. TensorRT and ONNX Runtime can both run QAT-trained models, provided the final model is properly exported with quantization annotations.

QAT adds some complexity and training time, but it’s often the best way to get high-performance, low-precision models without compromising accuracy. It’s particularly useful when post-training quantization leads to unacceptable errors — which can happen in models with small layers, outlier-sensitive activations, or tight accuracy requirements.

Pruning and Sparsity

Pruning is the process of removing parts of a neural network that contribute little to its output, with the goal of reducing model size and accelerating inference. While many deep networks are over-parameterized — especially those trained in high-resource environments — only a fraction of their weights may be truly essential. Pruning exploits this redundancy by simplifying the network after training, often with minimal loss in accuracy.

There are two main strategies:

Unstructured pruning removes individual weights that are close to zero. This creates a sparse weight matrix, where many values are zero but the original tensor shapes remain unchanged.
Structured pruning removes entire filters, channels, or blocks. This changes the architecture itself — reducing the number of operations per layer and potentially shrinking intermediate activations.

Unstructured pruning removes individual weights — typically those close to zero — without altering the overall shape of the model. This can lead to significant reductions in the number of nonzero parameters, which helps when compressing the model for storage (e.g. using sparse encodings or compression algorithms). However, once the model is loaded into memory, most runtimes convert sparse tensors back into dense format, and standard hardware still performs the zero multiplications. Unless the inference engine explicitly supports sparse execution, there’s no runtime speedup — only a smaller file on disk.

Structured pruning removes entire units of computation — such as channels, filters, or even whole layers. This directly reduces both compute cost and memory usage during inference, since the pruned components are fully eliminated from the graph. In convolutional networks, this often means pruning entire filters or feature maps. Structured pruning produces models that are not just smaller, but actually faster to run. However, this often comes with a larger drop in accuracy, since the model’s structure is changed more aggressively than in unstructured pruning

One special case is fine-grained structured sparsity, such as 2:4 sparsity, which is supported by some modern inference engines and hardware. In this format, every group of 4 weights must contain exactly 2 zeros — allowing for accelerated execution on compatible devices, like recent NVIDIA GPUs. This pattern strikes a balance between unstructured flexibility and structured regularity, enabling both compression and real inference acceleration, but only when the underlying kernels are optimized for it.

In PyTorch, pruning is available through torch.nn.utils.prune, which supports both unstructured and structured strategies. TensorRT can accelerate structured sparsity when models follow the 2:4 pattern and are compiled appropriately. OpenVINO, via the NNCF toolkit, offers structured pruning-aware training workflows and can export optimized models that benefit from sparsity-aware execution across Intel hardware platforms.

Choosing What to Prune

Pruning decisions are typically guided by heuristics that estimate the importance of weights or structures within the network. Common strategies include:

Magnitude-Based Pruning: Weights with the smallest absolute values are pruned under the assumption that they contribute less to the model's output.
Gradient-Based Pruning: Weights are evaluated based on the gradients of the loss function with respect to each weight. Weights with smaller gradients are considered less important.
Activation-Based Pruning: Neurons or filters that are rarely activated (i.e., produce outputs close to zero) are pruned, as they have minimal impact on the model's predictions.

These pruning strategies are often used in an iterative loop rather than applied all at once. A typical workflow is: prune a small portion of weights or channels, evaluate the impact, fine-tune the model to recover any lost accuracy, and then repeat. This trial-and-error cycle helps avoid over-pruning and allows the model to gradually adapt.

Graph-Level Optimizations and Operator Fusion

Modern inference engines don’t just run your model as-is — they transform and optimize it first. These transformations happen at the level of the computation graph: the directed acyclic graph of operations (nodes) and data (edges) that defines your model. The goal is to reduce memory use, eliminate inefficiencies, and improve execution speed — all without changing the model’s outputs.

One of the most important techniques is operator fusion. This involves combining multiple consecutive operations into a single, more efficient kernel. A common example is fusing convolution + batch normalization + ReLU into one operation. Instead of executing three separate steps — each with its own memory access — the fused version computes everything in a single pass. This reduces both memory traffic and kernel launch overhead, which are major bottlenecks on CPUs and embedded GPUs.

Another common optimization is constant folding. If parts of the computation graph depend only on fixed parameters — like multiplying a tensor by a static scale — these operations can be evaluated ahead of time, so the model runs fewer steps during inference. This shrinks the graph and reduces compute load.

Other techniques include:

Dead code elimination: Removing operations or branches that are never used — often leftovers from training or unused model heads.
Transpose folding: Merging reshapes or layout conversions (e.g. NHWC ↔ NCHW) with adjacent operations to reduce unnecessary memory movement.
Memory reuse: Reallocating intermediate buffers smartly to reduce peak RAM usage, especially important on devices with limited memory.
Layout optimization: Reordering tensors in memory to match the preferred format of the hardware, improving memory coalescing and minimizing cache misses.
Static scheduling: Precomputing the execution order and memory plan of all operations at compile time — avoiding the overhead of dynamic graph interpretation and reducing kernel launch overhead.

These optimizations are typically applied automatically by deployment frameworks:

ONNX Runtime applies a wide range of graph-level optimizations when you load a model, especially if you enable its optimization flags.
TensorRT compiles the graph into a highly optimized execution engine, aggressively fusing operations and reordering computation where allowed.
OpenVINO also performs extensive graph rewriting, including fusion patterns tailored to Intel hardware.

When you export a model from PyTorch — using ONNX or TorchScript — you’re handing off the structure of the model to these runtimes. They then apply these optimizations based on the graph’s shape and the hardware target. As a result, two models with identical architectures might run very differently, depending on how well they were exported and optimized.

You, as a developer, typically don’t need to apply graph optimizations manually. But it’s important to know they exist — especially when comparing runtimes or wondering why a model runs faster in one tool than another. In many cases, it’s not the model that changed — it’s the graph execution plan that got smarter.

Knowledge Distillation

Knowledge distillation is a technique for training small, efficient models (called students) by having them learn from the outputs of a larger, more accurate model (the teacher). Rather than training the student solely on the original labeled dataset, we also train it to mimic the behavior of the teacher — capturing not just what the right answer is, but how confident the teacher is across all possible classes or outputs.

This process typically improves the generalization ability of the student, leading to higher accuracy than if it were trained on ground truth labels alone — especially when the student model is much smaller or when the training dataset is limited.

In practice, the student model is trained using a combined loss:

One part compares the student's predictions to the true labels (as usual).
The other part compares the student's outputs to the soft predictions of the teacher.

This lets the student absorb knowledge encoded in the teacher’s decision boundaries — such as which classes are often confused or how to behave on ambiguous inputs.

The main trade-off is that you need access to a trained teacher model, and the training setup becomes slightly more complex. But when you’re targeting deployment on constrained edge hardware, distillation is often one of the most effective ways to maintain accuracy in a compact model.

Knowledge distillation is widely used in edge AI to shrink models without losing performance. Examples include:

MobileNet or EfficientNet distilled from large ResNet or Vision Transformer models.
YOLO-Tiny variants trained with supervision from full YOLOv5/YOLOv8 models.
DistilBERT, a smaller transformer (66 million parameters) trained to match BERT’s outputs, often used on mobile or constrained servers.

In PyTorch, implementing distillation is straightforward. During training, you compute two losses: one between the student and the teacher's soft targets, and one between the student and the ground truth. The total loss is usually a weighted sum of the two.

Distillation is also highly compatible with quantization and pruning. You can train a small student network using knowledge distillation, then apply quantization-aware training to obtain an even smaller INT8-compatible model. Subsequently, you can prune the student to further reduce compute and memory usage.

Alternative Techniques for Knowledge Distillation

The method we just described — where a student learns from the soft outputs of a teacher — is the most common form of knowledge distillation. But there are several alternative techniques used in industry, each with slightly different strengths.

Feature-Based Distillation: Instead of only mimicking final outputs, the student also matches the intermediate feature maps of the teacher. This is especially effective in vision models, where students can learn spatial patterns from deeper networks like ResNet.
Multi-Teacher Distillation: A student learns from multiple teacher models, each trained differently. This is useful when building a compact model that must generalize across tasks — for example, combining an object detector and a segmentation model into one shared student.
Self-Distillation: A model learns from its own internal knowledge, typically by training its shallow layers to mimic the outputs of deeper layers. This approach requires no external teacher and can help improve generalization, even in fully supervised setups.
Self-Supervised Distillation: Methods like DINO or BYOL train a model without labels by using a teacher that is an exponential moving average of the student. The model learns to produce consistent outputs for different augmented views of the same input. This is useful for pretraining models when labeled data is scarce.

These techniques are less common than output-based distillation, but they’re useful to know — especially when dealing with specialized constraints or data limitations.

Generating and Using Synthetic Data

Synthetic data refers to artificially generated data used to train or improve machine learning models. Instead of collecting real-world measurements or images, synthetic data is created through simulation, procedural generation, or machine learning models themselves. While the data isn’t “real,” it’s designed to closely mimic real-world conditions — allowing developers to expand datasets, balance class distributions, or model rare scenarios that are difficult to capture in practice.

Synthetic data isn’t strictly a model optimization technique — it doesn’t make your model smaller or faster. But it can improve performance in powerful and practical ways, especially in industrial edge AI deployments where real-world data is scarce, noisy, or difficult to label. In many cases, synthetic data becomes a way to compensate for accuracy loss introduced by aggressive model compression or to boost generalization in models that are otherwise constrained by hardware or dataset size.

Traditionally, synthetic data is created using 3D rendering engines, simulation environments, or algorithmic transformations of existing datasets. For example:

In industrial vision: tools like Unity Perception or Blender can generate realistic scenes with objects under different lighting, rotation, or occlusion.
In robotics: simulated environments like NVIDIA Isaac Sim allow safe, repeatable generation of sensor data — including camera feeds, depth maps, and IMU signals.
In audio and speech: background noise, reverberation, and synthetic voice samples can be used to train robust classifiers or detectors.

Even basic data augmentation techniques — such as flipping, cropping, or blending images — can be seen as a lightweight form of synthetic data generation. These methods remain extremely effective and are widely used.

More recently, synthetic data has taken a leap forward with the rise of foundation models — including large generative models for text, images, and even video. These models can now be used to generate labeled datasets, simulate rare edge cases, or even fill gaps in underrepresented data regions. For example:

A large vision-language model can generate images on demand — “red fish on blue conveyor belt” — with enough variability to simulate real production settings.
A foundation model could auto-label an unlabeled dataset of camera images or sensor logs, enabling rapid dataset expansion without manual annotation.

These synthetic labels may not be perfect, but when used in combination with real data or as pretraining material, they can significantly improve model robustness — especially for small, quantized, or pruned models that have limited capacity.

Looking forward, this synergy between powerful, general-purpose models (running in the cloud) and compact, specialized models (running at the edge) is likely to grow. In industrial contexts, the future of edge AI may rely not just on optimized architectures, but on training small models using synthetic or auto-labeled data — derived from systems much larger than the ones being deployed.

Lightweight Vision Models for the Edge

Most state-of-the-art deep learning models are designed for high accuracy on benchmark datasets using powerful GPUs. But in edge environments, where compute and memory are limited, these models quickly become impractical. Instead of pushing a desktop-grade architecture onto a constrained device, it often makes more sense to start with a lightweight model — one designed from the ground up for efficiency.

This section introduces the most commonly used lightweight architectures for image classification, object detection, and semantic segmentation on edge devices. These models are not just smaller versions of their full-size counterparts — they are built on fundamentally different design principles that aim to reduce computation, memory usage, and model size without compromising too much on accuracy.

Design Principles of Efficient Vision Models

Many lightweight models rely on architectural tricks that drastically reduce the number of operations required per inference:

Depthwise-separable convolutions: Instead of a full 3D convolution across input channels, the operation is split into a cheap depthwise convolution (per channel) followed by a pointwise (1×1) convolution. This is the core idea behind MobileNet, reducing computation by up to x8.
Inverted residuals and linear bottlenecks: Introduced in MobileNetV2, this technique allows the model to expand and contract feature dimensions efficiently, maintaining expressiveness without bloating parameter count.
Squeeze-and-excitation blocks: Found in SqueezeNet, these adaptively scale channel responses using attention, improving performance with a small overhead.
Neural architecture search (NAS): Used in EfficientNet, NAS automates the design of efficient models by searching for layer configurations that offer the best accuracy-to-compute tradeoff.

Key Model Families

Here’s an overview of the most widely used lightweight models for computer vision at the edge — not a complete list, but a practical shortlist of what works today in real deployments:

MobileNetV1/V2/V3: MobileNet remains the most popular family of efficient CNNs for image classification. MobileNetV2 and V3 are especially well-optimized for edge inference and are supported by almost every toolchain. MobileNetV2 (3.4M parameters) and MobileNetV3-Small (2.5M) are very popular.
EfficientNet-Lite: A scaled-down version of EfficientNet designed specifically for mobile and embedded platforms. EfficientNet-Lite0 to Lite4 offer strong accuracy-to-size ratios, with built-in support for quantization and good compatibility with ONNX and TensorRT. Slightly heavier than MobileNet, but more accurate.
YOLO-Tiny (v3–v8): A stripped-down version of the popular YOLO object detector. Variants like YOLOv3-Tiny and YOLOv5n offer good performance with real-time inference on CPUs and embedded GPUs.
MobileNet-SSD: Combines a MobileNet backbone with a single-shot detection (SSD) head for real-time object detection. Extremely lightweight, and supported by OpenVINO out of the box.
DeepLabv3+ with MobileNet: For segmentation tasks (e.g. lane marking, object masks), DeepLabv3+ with MobileNet as backbone offers a great trade-off between accuracy and efficiency. It’s widely used in robotics applications.

In practice, most successful edge deployments start with one of these models — and then apply further optimization (quantization, pruning, distillation) depending on the hardware and performance needs.

Lightweight Models for Time Series and Audio

Not all edge AI workloads involve cameras. In many industrial and embedded applications, the input comes from time series sensors or audio streams — such as temperature, vibration, accelerometers, microphones, or electric current. These signals vary widely in sampling rate: from a few hertz in sensor logs to tens of kilohertz in audio. But fundamentally, they are all sequences over time — and the same architectural ideas often apply across this spectrum.

Unlike vision models, which operate on 2D spatial data, time series and audio models are typically designed to extract temporal patterns. These can be simple spikes or periodic signals, or complex sequences like spoken commands or mechanical failures unfolding over time.

There’s no one-size-fits-all solution, but a few model families stand out as practical and mature for edge deployment:

1D Convolutional Networks (CNN-1D): These apply convolution along the time axis to extract local temporal features. They work well for low-frequency signals like vibration, flow, or current. Many TinyML models for anomaly detection or gesture recognition use 1D CNNs.
Small RNNs: Recurrent networks like LSTM and GRU are still used when capturing long-term time dependencies is important and latency constraints allow for it. However, they are often harder to train and optimize than convolutional models, and less efficient on modern edge hardware.
Temporal Convolutional Networks (TCNs): TCNs replace recurrence with dilated 1D convolutions, giving them long receptive fields while remaining easy to parallelize. They’re a solid choice for predictive maintenance, sensor forecasting, or event classification, and often outperform RNNs in real-time tasks.
Spectrogram-based CNNs: For audio, it’s common to convert the waveform into a Mel spectrogram or MFCCs and then apply a 2D CNN — often a MobileNet or custom small CNN. This transforms the problem into a visual one and makes use of mature tooling. Widely used for keyword spotting, speaker ID, or fault detection.
CRNNs (Convolutional Recurrent Networks): Popular in audio tasks, these combine 2D CNNs (e.g. on spectrograms) with recurrent layers (LSTM or GRU) to model time-dependent structure. Useful for sound event detection, voice activity, or wake word detection. Heavier than plain CNNs, but more accurate for long or variable-length audio.

In short, 1D CNNs, TCNs and spectrogram-based CNNs cover most practical needs for edge inference on time-based signals. Whether it's a stream of sensor readings or high-frequency audio, these architectures help make sense of time — without wasting time, memory, or power.

A Reality Check on Architecture Search

Neural Architecture Search (NAS) is a technique for automatically discovering neural network architectures — rather than designing them by hand. In theory, it promises the best of both worlds: architectures that are as compact and fast as possible, without sacrificing accuracy. The process involves defining a search space of possible building blocks — layers, connections, kernel sizes, etc. — and then using algorithms (like reinforcement learning, evolutionary strategies, or gradient-based search) to explore that space and find the optimal configuration.

In practice, NAS is a resource-intensive process. Training and evaluating each candidate model is expensive, especially when searching for architectures that work well on constrained hardware. Some platforms try to simulate performance during the search or use proxy metrics, but even so, NAS typically involves training hundreds or thousands of models. It’s not something you run on a laptop — it’s something that companies like Google, Facebook, or Microsoft do at scale using large compute clusters.

Fortunately, the results of NAS are publicly available and widely used. Architectures like MobileNetV3 and EfficientNet were discovered (or refined) using NAS techniques. They’ve already been benchmarked across hardware targets and scaled into multiple sizes — from tiny edge-compatible variants to large models for cloud inference. For example, MobileNetV3 comes in “Small” and “Large” variants, and EfficientNet is available in 8 different sizes (from B0 to B7), specifically tuned for mobile and embedded inference.

The truth is, very few teams in industry perform architecture search themselves, and that’s perfectly fine. It’s expensive, hard to validate, and rarely necessary — especially when most deployment challenges can be addressed through careful model selection. Starting with an off-the-shelf model and applying optimization techniques will usually get you 99% of the way there. The last 1% — what NAS might offer — often isn’t worth the engineering cost (unless you’re Google).

What Do These Frameworks Really Support?

By now, we’ve explored a wide range of optimization techniques — from quantization and pruning to distillation and architecture design. But the real question for deployment is: which of these techniques are actually supported by the tools you’re using? After all, training a sparse, quantized model doesn’t help if your runtime can’t run it efficiently.

This section gives you a high-level guide to the capabilities of the most important deployment frameworks in edge AI: ONNX Runtime, TensorRT, and OpenVINO. These are the tools that turn a trained model — often exported from PyTorch — into something that can run efficiently on your device.

ONNX Runtime

ONNX Runtime is Microsoft's general-purpose inference engine for models exported in the ONNX format. It’s widely used across platforms — from cloud to edge — and supports a broad range of model types and hardware backends (CPU, GPU, NPU). Its strength lies in portability and extensibility, not bleeding-edge acceleration.

Supported techniques:

✅ Post-training quantization (INT8) — both dynamic and static quantization are supported via the ONNX quantization toolkit.
✅ Operator fusion and graph optimization — automatic fusions applied at runtime, including Conv+BN+ReLU and others.
✅ Reduced precision (FP16, BF16) — supported on compatible hardware (e.g., GPUs, Intel CPUs).

ONNX Runtime is a strong choice when targeting cross-platform deployments.

TensorRT

TensorRT is NVIDIA’s inference engine, tightly optimized for Jetson devices and NVIDIA GPUs. It’s not a general-purpose framework — it’s a compiler that turns models into optimized CUDA code for specific hardware. When used correctly, it delivers some of the fastest inference available.

Supported techniques:

✅ INT8 and FP16 quantization — highly optimized, with support for calibration and QAT exports.
✅ Graph-level optimization and aggressive fusion — TensorRT builds a static inference graph with fused kernels, custom memory layout, and scheduling.
✅ Sparsity-aware execution (2:4 structured sparsity) — supported on recent GPUs, but limited to specific patterns.

TensorRT excels when maximum performance on NVIDIA hardware is the goal.

OpenVINO

OpenVINO is Intel’s optimization toolkit for running inference on CPUs and integrated GPUs. Unlike TensorRT, it supports a wider range of hardware, and unlike ONNX Runtime, it includes advanced post-training optimization tools.

Supported techniques:

✅ Post-training quantization (INT8, BF16) — supported via the Neural Network Compression Framework (NNCF).
✅ Quantization-aware training — which integrates the training loop with PyTorch.
✅ Operator fusion and constant folding — automatically applied during model conversion.
✅ Structured pruning — NNCF supports pruning-aware training; results are compatible with OpenVINO runtime.

OpenVINO is especially strong when deploying to Intel-based edge devices, or when you want to apply multiple optimizations in a single workflow.

Together, these tools form the practical backbone of edge AI deployment. Whether you're exporting a quantized MobileNet, pruning a time series model, or compiling a vision network for GPU acceleration, the techniques we’ve covered are not just theoretical — they’re supported in real toolchains today. As we move into the next chapters, we’ll apply these ideas hands-on to turn trained models into efficient, deployable systems.

4. Exporting Models with the ONNX Format

Why Export Models? From Training to Inference

When you train a machine learning model in PyTorch or TensorFlow, you're working inside a large, flexible training framework — one that's designed for experimentation, gradient computation, and dynamic model definition. These frameworks are perfect for research and development, but they're not intended for production inference, especially on embedded or edge devices.

There are a few key reasons for this:

Training frameworks are heavy. PyTorch and TensorFlow ship with everything needed for backpropagation, autograd, debugging, visualization, and sometimes even GPU compilation. These features add significant memory, binary size, and dependency overhead — most of which is unnecessary at inference time.
They rely on Python. Most training workflows assume a Python runtime. But Python isn’t always available or suitable in production, particularly on embedded systems, mobile devices, or industrial controllers. Many edge platforms favor compiled languages or runtime environments with tighter control over memory and startup time.
Inference needs different performance characteristics. During deployment, the goal is fast, efficient prediction — with minimal latency and memory use. Training frameworks often prioritize flexibility over speed, and they don’t apply the kinds of static graph optimizations, memory planning, or quantization strategies that inference engines specialize in.

To bridge this gap, we export models from training frameworks into lightweight, portable formats. These formats are designed for inference: they represent only the forward computation graph and are optimized for compactness and speed. Once exported, models can be loaded by inference runtimes — like ONNX Runtime, TensorRT, or OpenVINO — which take over the job of executing the model efficiently on the target device.

There are several export formats in use today:

TorchScript is PyTorch’s internal format, used for saving models as scripted or traced computation graphs. It’s ideal when staying inside the PyTorch ecosystem — for example when deploying a model in a C++ mobile app — but it’s not portable to other runtimes.
TensorFlow Lite (TFLite) is a compact format for TensorFlow models, especially optimized for mobile devices and microcontrollers. It supports quantization and pruning, but only works within the TensorFlow stack.
ONNX (Open Neural Network Exchange) is a cross-framework format that’s supported by both PyTorch and TensorFlow (and many others), and can be executed by a wide range of inference engines. ONNX separates model training from model execution, making it a popular choice for deployment in production.

In this chapter and the next, we’ll focus on ONNX and ONNX Runtime because they are open, widely supported, and fit naturally into the PyTorch-to-deployment workflow that many edge AI projects rely on today.

What Is ONNX? A Cross-Framework Model Format

ONNX — short for Open Neural Network Exchange — is an open standard for representing machine learning models. It was created by Microsoft and Facebook in 2017 with a simple goal: make trained models portable across frameworks and platforms.

At its core, ONNX is a serialization format. It defines how to store a model as a file — typically with a .onnx extension — including all the information needed to run inference:

The computation graph, representing the layers or operations that process the input data.
The parameters, such as weights and biases, stored as binary tensors.
Input and output definitions, including shapes and data types.
Optional metadata, including version info, author tags, and custom annotations.

Each operation in ONNX is defined by an operator (like Conv, Relu, Add, Gemm). These operators come from standardized opsets (operator sets) — versioned collections of supported operations. The ONNX community maintains a growing list of operators, and different frameworks or runtimes may support different opset versions. This is why specifying the opset version during export is important.

While ONNX is best known for supporting deep learning models (CNNs, RNNs, transformers), it also supports traditional machine learning models, such as decision trees, linear regression, and SVMs, via tools like skl2onnx, which convert scikit-learn models to ONNX.

ONNX does not run models by itself. It is just the file format. To execute a model, you need an inference engine like ONNX Runtime, TensorRT, or OpenVINO, which parse the ONNX file and convert it into executable kernels optimized for the target hardware.

Note that the naming of ONNX and ONNX Runtime can be a bit confusing. ONNX Runtime is just one of several engines that can run ONNX models — the two are separate.

Protocol Buffers (protobuf)

ONNX files are stored in Protocol Buffers (protobuf) format — a compact, binary representation that is both efficient and cross-platform. You don’t need to write protobuf yourself. Most tools generate and read ONNX files automatically.

ONNX files are not compressed by default. While Protocol Buffers (protobuf) is a compact and efficient binary format, it focuses on structure and serialization speed — not compression. It avoids the overhead of text formats like JSON or XML, but it doesn’t apply any compression algorithms internally. The result is that ONNX files are often smaller than equivalent models in plain text, but still fully uncompressed on disk.

If you're storing many ONNX models — or distributing them over the network — it often makes sense to compress them using standard tools like gzip, zip, or xz. This can significantly reduce storage size, especially for large models with repeated weight patterns. Just be aware that ONNX inference engines expect uncompressed files, so you’ll need to decompress them before loading.

Exporting PyTorch Models to ONNX

The most common starting point for ONNX export is a model trained in PyTorch. Fortunately, PyTorch includes native support for exporting models to the ONNX format using the torch.onnx.export() function. This section walks you through a typical export workflow — using real code and highlighting best practices.

Step 1: Define or load your model

You can use any trained model — for this example, we’ll use a basic convolutional network.

import torch
import torch.nn as nn

class TinyCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Conv2d(1, 8, kernel_size=3, stride=1, padding=1)
        self.relu = nn.ReLU()
        self.flatten = nn.Flatten()
        self.fc = nn.Linear(8 * 28 * 28, 10)  # MNIST-style 28x28 images

    def forward(self, x):
        x = self.relu(self.conv(x))
        x = self.flatten(x)
        return self.fc(x)

model = TinyCNN()
model.eval()  # must be in eval mode during export!

Step 2: Prepare dummy input

PyTorch needs a sample input to trace the computation graph.

dummy_input = torch.rand(1, 1, 28, 28)  # batch size 1, grayscale image

Step 3: Call torch.onnx.export

This function traces the model’s forward pass and writes the .onnx file.

torch.onnx.export(
    model,                          # the PyTorch model
    dummy_input,                    # dummy input tensor
    "tiny_cnn.onnx",                # output filename
    export_params=True,             # store trained weights
    opset_version=20,               # ONNX opset version to use
    do_constant_folding=True,       # fold constant expressions
    input_names=['input'],          # name your input tensor(s)
    output_names=['output'],        # name your output tensor(s)
    dynamic_axes={
        'input': {0: 'batch_size'},
        'output': {0: 'batch_size'}
    }
)

This will create a file named tiny_cnn.onnx containing the model’s graph, weights, and input/output names.
opset_version: This controls which ONNX operator definitions are used. Always check what your target runtime supports. As of 2025, opset 21 is the latest, but PyTorch only supports 20.
dynamic_axes: This allows flexibility in batch size or sequence length. Without this, the exported model will be fixed to the exact input shape of the dummy input.
do_constant_folding: This simplifies the graph by precomputing constant expressions. Recommended for deployment.

Inspecting and Visualizing ONNX Models

Once you've exported your model to ONNX, it's useful to inspect the file — both for understanding and for debugging potential issues. Can you verify that everything you expect is really in there?

Text-Based Inspection with Python

ONNX provides a simple Python API that lets you load and print the contents of a model. This is particularly useful when you want to do quick, scriptable inspections, or integrate model validation into your pipeline.

import onnx

# Load the ONNX model from disk
model = onnx.load("model.onnx")

# Print a human-readable version of the graph
print(onnx.helper.printable_graph(model.graph))

# Print input and output of graph
print(model.graph.input)
print(model.graph.output)

The output shows the model as a directed graph of operations. You’ll see nodes with names like Conv, Relu, and Gemm, each taking named inputs and producing outputs. It also prints the model’s declared inputs and outputs, their shapes, and types.

Visual Model Inspection with Netron

For a more intuitive, interactive experience, use Netron — a graphical viewer for ONNX and many other model formats. Netron lets you open a .onnx file and explore its structure visually: you'll see the full computation graph with layers connected by arrows, along with shape and type annotations at each step.

To get started, go to netron.app or run:

import netron
netron.start("model.onnx")

This opens a browser window where you can click through the model layer by layer. Clicking on a node shows its input/output shapes, the operation type (e.g., Conv, Add, Reshape), and any attributes (like kernel size or padding).

Netron is also extremely helpful when comparing two versions of a model — for example, before and after applying quantization or graph optimizations. You’ll immediately see if some layers were fused, removed, or replaced by more efficient alternatives.

Whether you’re debugging a broken export, preparing for quantization, or just want to see what your training pipeline actually produced, learning to inspect ONNX files is an essential part of working with ONNX models.

Common Pitfalls When Exporting to ONNX

Exporting a PyTorch model to ONNX is not always a smooth, one-click process. In many cases, it just works — especially for standard models using common layers. But in real-world projects, it’s quite normal to run into problems during export or inference.

Let’s walk through some of the most common things that can go wrong, and how to think about fixing them.

Tracing vs. Scripting

The default export mechanism in PyTorch uses tracing — it runs your model once on a sample input and records the operations that were executed. This works well for many models, but it breaks if your model contains dynamic control flow, such as if statements or loops that depend on the input data. In these cases, the traced graph won’t reflect the full behavior of the model.

If you run into this, the fix is to use TorchScript scripting instead. Scripting parses the Python source code of your model and creates a static graph that captures the control logic.

scripted_model = torch.jit.script(model)
torch.onnx.export(scripted_model, dummy_input, "scripted.onnx", ...)

This approach often solves problems where the export fails or produces an incomplete model.

Structural Checks and Shape Inference

Once you have an ONNX file, it’s good practice to check whether it’s structurally valid:

import onnx
model = onnx.load("model.onnx")
onnx.checker.check_model(model)

This doesn’t guarantee that the model will run, but it helps catch low-level issues — like invalid graph edges or missing metadata — before you load it into a runtime.

If your model loads but tools like Netron can’t show the shapes of tensors, it’s likely missing shape information. You can try fixing this by applying shape inference:

from onnx import shape_inference
inferred_model = shape_inference.infer_shapes(model)
onnx.save(inferred_model, "inferred.onnx")

This can also help runtimes like ONNX Runtime or OpenVINO apply optimizations correctly.

When Export Fails: What Else Can Go Wrong?

Sometimes the export simply fails with an error message — or worse, it silently completes but produces a broken model. Here are a few typical reasons:

Unsupported operations: PyTorch layers like grid_sample, einsum, or certain types of indexing might not have ONNX equivalents. You may need to replace them.
In-place operations: Operations like x += 1 or x.relu_() can confuse tracing. Replace them with out-of-place versions like x = x + 1.
Device Consistency: Ensure that both your model and dummy input are on the same device — either CPU or GPU—before exporting.
Dynamic shapes: If you don’t declare dynamic_axes, your exported model will only work with inputs of the exact shape used during export.
Model not in eval mode: Always call model.eval() before exporting to ensure consistent behavior from layers like BatchNorm and Dropout.

And of course, even if the export succeeds, that doesn’t mean your ONNX Runtime or TensorRT engine will be able to run it. Different runtimes support different opset versions and operations — so you may find that a valid ONNX file still can’t be deployed. This is part of the trial-and-error nature of real-world deployment.

5. Efficient Inference with ONNX Runtime

Once you've exported your model to ONNX, the next step is running it efficiently on your target system. That’s where ONNX Runtime comes in handy — an optimized, cross-platform inference engine developed by Microsoft for executing ONNX models. It's designed to take the model you trained, and run it as fast as possible on the hardware you care about.

What makes ONNX Runtime especially appealing is that it's not tied to any specific hardware vendor. It's built to be a general-purpose runtime — lightweight, production-ready, and highly extensible. You can use it to deploy models on your development laptop, a Linux-based ARM board, or even mobile platforms like Android. It's also widely used in server environments, where it powers inference at scale for cloud APIs and backend systems. Really, the only thing it doesn't support is microcontrollers.

One of the core strengths of ONNX Runtime is that it supports a wide range of execution providers — backends that map model operations to specific hardware or libraries. Out of the box, you can run models on CPU (with multi-threading), GPU (CUDA), and other backends like DirectML, TensorRT, or OpenVINO. ONNX Runtime handles the graph loading, operator dispatch, and backend selection, so you can focus on deploying your model — not rewriting it for each platform.

Finally, ONNX Runtime isn’t just a Python tool — it also provides bindings for C++, C#, JavaScript, Java (for Android), and more. This makes it suitable for integrating inference into embedded applications, mobile apps, and even web environments.

In short, ONNX Runtime is the practical, industry-ready bridge between your trained model and its real-world deployment. It abstracts away the complexity of low-level optimization and backend compatibility — so you can focus on building systems that run fast and run everywhere.

Running Inference with ONNX Runtime

Once you’ve exported your model to ONNX, running inference with ONNX Runtime is simple and efficient. In this section, we’ll walk through the entire process in Python — from loading the model to getting predictions.

Step 1: Load the ONNX Model

Create an inference session by loading the .onnx file. This step initializes the model and prepares the execution backend.

import onnxruntime as ort

session = ort.InferenceSession("model.onnx")

To explicitly request a GPU or specific backend:

session = ort.InferenceSession("model.onnx",
    providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])

ONNX Runtime will use the first available provider from the list.

Step 2: Inspect Model Inputs and Outputs

You can programmatically check what the model expects — including input names, shapes, and types.

for input in session.get_inputs():
    print("Input:", input.name, input.shape, input.type)

for output in session.get_outputs():
    print("Output:", output.name, output.shape, output.type)

Typical output:

Input: input ['batch_size', 1, 28, 28] tensor(float)
Output: output ['batch_size', 10] tensor(float)

This tells you how to structure your NumPy inputs.

Step 3: Prepare Input Data as NumPy Arrays

Make sure the data matches the expected shape and dtype. If your ONNX model expects float32, convert accordingly:

import numpy as np

image = np.random.rand(1, 1, 28, 28).astype(np.float32)  # shape and type match

For batch inference, stack inputs:

batch = np.stack([image1, image2])  # shape: (2, 1, 28, 28)

Step 4: Run Inference with session.run()

Run inference by passing a dictionary of input names to NumPy arrays:

inputs = { 'input': image }
outputs = session.run(None, inputs)

If your model has multiple inputs, provide all of them in the dictionary.

To get named outputs (optional):

outputs = session.run(['output'], inputs)

The result is always a list of NumPy arrays — one per output.

Using Dynamic Shapes (Batch Size)

If your ONNX model was exported with dynamic_axes, you can change the batch size at runtime. ONNX Runtime will infer the shape dynamically as long as the input axis is marked as variable.

If you see a shape mismatch error, double-check:

The input name matches exactly.
The dtype is correct (usually np.float32).
The shape respects any required dimensions (e.g., channels, height, width).

Example: Full Inference Pipeline

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("model.onnx")
input_name = session.get_inputs()[0].name

# Generate batch of 2 grayscale images
batch = np.random.rand(2, 1, 28, 28).astype(np.float32)

# Run inference
outputs = session.run(None, {input_name: batch})
print("Predictions:", outputs[0])

Comparing PyTorch and ONNX Model Outputs

Once you’ve exported your PyTorch model to ONNX and loaded it into ONNX Runtime, one of the most important things to do is validate correctness. We want to make sure that the model still behaves the same — or almost the same — after all transformations and optimizations.

This is especially important if you're planning to apply graph optimizations or quantization, or if you're integrating this export into a CI/CD pipeline. A small bug in your export code or a mismatch in model behavior can silently break downstream systems. To catch this early, we compare model outputs.

Why outputs will differ (a little)

The first thing you need to know is: don’t expect exact numerical equality.

Even without quantization, ONNX Runtime and PyTorch will produce slightly different floating-point values for the same input. This is due to:

Operator-level optimizations (like fused layers or reordered math).
Differences in how ONNX Runtime and PyTorch implement low-level operations.
Variations in CPU/GPU instruction paths or numerical libraries (e.g. oneDNN vs ATen).

As long as the differences are small, this is totally acceptable. But if the model outputs deviate significantly, it may mean something went wrong — for example, a mis-exported operator or a bad quantization step.

Example: How to compare model outputs

Let’s step through an example of how to run both models and compare their outputs:

import torch
import onnxruntime as ort
import numpy as np

# Define dummy input
input_tensor = torch.rand(1, 1, 28, 28)

# Run inference with PyTorch
model.eval()
with torch.no_grad():
    torch_out = model(input_tensor).cpu().numpy()

# Run inference with ONNX Runtime
ort_session = ort.InferenceSession("model.onnx")
onnx_out = ort_session.run(None, {'input': input_tensor.numpy()})[0]

# Compare outputs
if np.allclose(torch_out, onnx_out, rtol=1e-03, atol=1e-05):
    print("✅ Outputs are close enough.")
else:
    print("⚠️ Significant difference!")
    print("Max error:", np.abs(torch_out - onnx_out).max())

What this does:

It runs the same input through both the original PyTorch model and the exported ONNX version.
It uses np.allclose() to check that the outputs are numerically similar — within a small relative (rtol) and absolute (atol) tolerance.
If the values differ by more than expected, it prints a warning and the maximum absolute difference.

Verifying that your ONNX-exported model behaves like your PyTorch model is a simple but essential quality check. Use np.allclose() to validate similarity, inspect max error if needed, and remember: small differences are fine — large ones are not.

Execution Providers and Backend Selection

When you load an ONNX model into ONNX Runtime, the runtime doesn’t execute the model by itself — instead, it delegates the actual computation to something called an execution provider. Execution providers are backend engines that know how to run the operations in your model on specific hardware. By switching providers, ONNX Runtime can adapt the same model to different platforms and devices without modifying the model itself.

An execution provider is essentially a plugin that knows how to run a part or all of your ONNX graph using a certain library, runtime, or accelerator. Some providers are general-purpose, like the default CPU provider, which works everywhere. Others are highly specialized — like NVIDIA’s TensorRT or Intel’s OpenVINO — and can significantly speed up inference if your model and hardware are compatible.

Most models can run across multiple execution providers. ONNX Runtime will try to use the first provider that supports a given operation, and fall back to others for any unsupported parts. You can explicitly configure the order and selection of providers when creating an inference session.

Here’s an example:

import onnxruntime as ort

# Prefer GPU (CUDA), but fall back to CPU if needed
session = ort.InferenceSession("model.onnx", providers=[
    "CUDAExecutionProvider", "CPUExecutionProvider"
])

You can check which providers are available in your environment:

print(ort.get_available_providers())

This prints something like:

['CUDAExecutionProvider', 'CPUExecutionProvider']

This order reflects priority: ONNX Runtime will try CUDA first, and if something isn’t supported (e.g. a rare operation), it will silently fall back to CPU for just that part of the graph.

It’s important to understand that not all providers support every operation, data type, or optimization — so it’s common in practice to mix them, or to experiment to see which combination gives the best performance for your use case.

Popular Execution Providers

Below is a list of the most important execution providers:

Default CPU: The default backend supports all ONNX operations and runs entirely on CPU, on any platform (Linux, Windows, macOS). This provider is reliable and always available, but will be slower than other options.
CUDA: Uses NVIDIA’s CUDA stack to run inference on supported GPUs. This provider is faster than CPU for large models, especially vision models. Requires a compatible GPU and the CUDA + cuDNN libraries installed.
TensorRT: Builds an optimized inference engine using NVIDIA TensorRT. This can be much faster than CUDA alone, thanks to aggressive kernel fusion, memory planning, and INT8 support. However, startup times will be longer than with the CUDA provider.
XNNPACK: A highly optimized backend for inference on ARM and x86 CPUs. Uses advanced techniques like vectorization and parallelization — including NEON SIMD instructions on ARM — to achieve fast, efficient execution.
OpenVINO: Targets Intel hardware — CPUs, integrated GPUs, and FPGAs. It applies graph optimizations, operator fusion, and quantization-aware acceleration.
DirectML: Uses Microsoft’s DirectML API to run inference on DirectX 12-compatible GPUs. This provider is Windows-only and designed for consumer-grade hardware like AMD or NVIDIA GPUs.
CoreML: Enables ONNX models to run via Apple’s Core ML framework. Available on macOS and iOS. This allows native integration of ONNX models into Apple apps, using Apple Neural Engine.

Each provider comes with trade-offs. Some are easier to install, others provide more acceleration but require specific hardware or model structure. ONNX Runtime is designed to let you mix and match them — and it will always fall back to CPU for unsupported operations unless told otherwise.

Graph-Level Optimizations in ONNX Runtime

When you load an ONNX model into ONNX Runtime, it doesn’t run the graph exactly as it was exported. Instead, the runtime first applies a series of graph-level optimizations — transformations that simplify or fuse parts of the computation graph to make inference faster and more memory-efficient. These optimizations are automatic, happen during session initialization, and can significantly improve performance.

The idea is simple: many models contain redundant or inefficient patterns that emerge from training frameworks, like a Convolution followed by BatchNorm and ReLU, or a sequence of Reshape → Transpose → Reshape. ONNX Runtime recognizes these patterns and replaces them with fused operations that execute faster and use fewer memory operations.

Optimization Levels

ONNX Runtime lets you choose how aggressively it applies these transformations:

import onnxruntime as ort

sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL

ORT_DISABLE_ALL: No graph optimizations (useful when model is already optimized).
ORT_ENABLE_BASIC: Applies safe and fast optimizations like constant folding and dead node removal.
ORT_ENABLE_EXTENDED: Adds pattern fusion (Conv+BN+ReLU, Reshape simplifications, etc.).
ORT_ENABLE_ALL: Includes layout optimizations and hardware-specific transforms.

By default, ONNX Runtime uses ORT_ENABLE_ALL, which is safe and recommended for most use cases.

Saving the Optimized Model

If you want to inspect what ONNX Runtime did to your model, you can save the optimized graph:

sess_options.optimized_model_filepath = "optimized_model.onnx"
session = ort.InferenceSession("model.onnx", sess_options=sess_options)

This writes the post-optimization model to disk. You can then open it in Netron or re-load it in Python to compare the structure to the original file. You’ll often notice that some layers have disappeared (due to folding or fusion) or been replaced by more efficient patterns.

When to Disable Optimizations

In most cases, these optimizations are helpful and reliable. But in a few scenarios — such as debugging a broken model, benchmarking raw performance, or preparing for certain hardware-specific conversions (like TensorRT) — you might want to turn them off:

sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_DISABLE_ALL

You can also run the optimizer once offline (ahead of deployment), save the optimized .onnx, and then load that at runtime with optimizations disabled — this avoids repeating the optimization step on every device.

Post-Training Quantization with ONNX Runtime

Quantization is one of the most effective ways to reduce the size and inference time of machine learning models. By replacing 32-bit floating-point values with lower-precision formats like 8-bit integers (INT8), we can shrink model size by up to 4x, reduce memory bandwidth, and speed up inference by 2–4x, depending on the hardware.

In ONNX Runtime, the easiest way to quantize a model is through post-training quantization (PTQ). This means you quantize a model after training is done, without needing to fine-tune or retrain anything. PTQ is non-invasive — you don’t change your training process — and it works on most PyTorch-exported ONNX models with minimal effort.

This approach contrasts with quantization-aware training (QAT), which simulates quantization during the training loop. While QAT can yield slightly better accuracy, it’s significantly more complex to set up and currently not supported in ONNX Runtime. PyTorch does support QAT — but only through experimental APIs (torch.quantization), none of which are considered stable.

As of today, PTQ remains the most practical and reliable path to quantization when using ONNX Runtime. The accuracy drop is often minimal — typically just a few percentages — which is perfectly acceptable in many real-world applications, especially when balanced against the gains in model size and inference speed.

There are two flavors of PTQ in ONNX Runtime:

Dynamic Quantization – weights are quantized ahead of time, but activations are quantized dynamically at runtime.
Static Quantization – both weights and activations are quantized ahead of time, using calibration data to determine value ranges.

Preprocessing Before Quantization

Before applying post-training quantization, it's recommended to preprocess your ONNX model. This step involves shape inference and graph optimization, which help the quantizer identify and convert more operators effectively. ONNX Runtime provides the quant_pre_process() function for this purpose:

from onnxruntime.quantization.shape_inference import quant_pre_process

quant_pre_process(
    input_model="model_fp32.onnx",
    output_model_path="model_processed.onnx"
)

This will save the preprocessed model as model_processed.onnx, ready for quantization.

Dynamic Quantization with ONNX Runtime

Dynamic quantization is the simplest and most accessible form of post-training quantization in ONNX Runtime. It reduces model size by converting weight tensors from 32-bit floating point (FP32) to 8-bit integers (INT8), while keeping activations in floating point during inference. These activations are quantized on the fly at runtime — hence the name “dynamic.”

The key advantage of dynamic quantization is that it requires no calibration dataset, no extra model preparation, and no changes to the training process. It works especially well for RNNs and transformer-based models. While it’s less effective on convolutional networks (where activations dominate the compute), it still provides size reductions and may yield speed-ups.

ONNX Runtime provides built-in support for dynamic quantization:

from onnxruntime.quantization import quantize_dynamic

input_model = "model_processed.onnx"
output_model = "model_dynamic.onnx"

# Apply dynamic quantization
quantize_dynamic(
    model_input=input_model,
    model_output=output_model,
    per_channel=True,
)

This code reads your original ONNX model (model_processed.onnx), replaces eligible weight tensors with INT8 quantized versions, and writes the quantized model to model_dynamic.onnx.

Parameters for quantize_dynamic():

per_channel: Enables per-channel quantization for weights. Useful in convolutional networks for better accuracy. Default is False.
op_types_to_quantize: Optional list to restrict quantization to certain operators (e.g. ['MatMul'], ['Conv']). By default, ONNX Runtime will quantize all supported operators.
weight_type=QuantType.QInt8: This determines the type of integer used for weights. Signed integers (QInt8) are common and compatible with most backends.
nodes_to_quantize / nodes_to_exclude: Let you include or skip specific nodes by name — useful for fine-grained control.

Once complete, you can inspect the quantized model in Netron. You’ll notice that some MatMul or Gemm layers have been replaced with quantized versions (like MatMulInteger or QLinearMatMul), and that weight tensors are now INT8. Activation nodes remain in FP32 — this is expected for dynamic quantization.

Dynamic quantization is a quick and effective way to shrink your model without complicating your workflow. Since only the weights are quantized and activations remain in floating point, it avoids many of the pitfalls of more aggressive quantization methods — and often retains nearly identical accuracy. For many edge AI scenarios, it’s a low-effort, high-reward optimization that makes sense as a first step before considering more advanced techniques.

Static Quantization with ONNX Runtime

While dynamic quantization offers simplicity, static quantization can deliver even greater performance and memory savings — especially on models like CNNs where activations dominate computation. Static quantization works by converting both weights and activations to low-precision integers (typically INT8), using a calibration dataset to estimate the dynamic ranges of activations.

The main trade-off is setup complexity: you must provide representative input data to calibrate the activation ranges, and the model must be compatible with ONNX Runtime’s quantization tooling. But when applied correctly, static quantization gives better hardware utilization and often outperforms dynamic quantization.

ONNX Runtime provides built-in support for static quantization:

from onnxruntime.quantization import quantize_static, CalibrationDataReader

class MyDataReader(CalibrationDataReader):
    def __init__(self):
        self.data = iter(  # Generate fake images
            {"input": np.random.rand(1, 1, 28, 28).astype(np.float32)}
            for _ in range(100)
        )
    def get_next(self):
        return next(self.data, None)

# Perform static quantization
quantize_static(
    model_input="model_processed.onnx",
    model_output="model_static.onnx",
    calibration_data_reader=MyDataReader(),
    per_channel=True,
)

This code reads your original ONNX model (model_processed.onnx), feeds in input samples via the calibration reader, and writes a statically quantized model to model_static.onnx.

Parameters for quantize_static():

calibration_data_reader: A class that yields representative input samples for calibration. This can be as simple as a loop over your dataset.
per_channel: Enables per-channel quantization for weights (recommended for CNNs).
calibrate_method: Choose between MinMax (default) or Entropy. MinMax is faster and often sufficient.
calibration_providers: Specifies which execution providers to use during the calibration phase. By default, this is ["CPUExecutionProvider"], but you can set it to use others (e.g., ["XnnpackExecutionProvider"]) if your hardware supports it.

Static quantization requires a bit more setup than dynamic quantization, but for many edge workloads, it gives better inference speed and lower runtime memory usage — particularly on devices with native INT8 support. With just a few lines of Python and a handful of representative inputs, you can turn a float-heavy model into a compact, deployment-ready INT8 graph.

Float16 and Mixed Precision: Optimizing for GPUs

Throughout this chapter, we’ve focused heavily on INT8 quantization — a very performant technique for shrinking models and speeding up inference. But not all deployments run on microcontrollers or ARM boards. Sometimes, your model runs on a GPU-equipped device (like a Jetson or industrial PC), where floating-point computation is fast and well-supported. In those cases, converting your model to float16 (FP16) offers a great middle ground: better performance than float32, fewer risks than INT8, and often no noticeable drop in accuracy.

Unlike INT8, FP16 preserves the structure and behavior of your original model, just with smaller numbers. There’s no need for calibration or retraining — the conversion is straightforward and low-risk. FP16 models use half the memory of FP32, and run significantly faster on hardware that supports it — including NVIDIA/Intel GPUs, Apple silicon, and some NPUs (although most strongly prefer INT8).

INT8 quantization remains the most efficient choice for deployment on any modern hardware. But when you're working with GPU-based systems or other high-performance platforms that support FP16, converting to float16 is a practical and low-risk alternative. It won't match INT8 in raw performance, but it often provides a strong balance between speed, simplicity, and accuracy — with minimal effort.

To apply float16 conversion in ONNX Runtime:

import onnx
from onnxconverter_common import float16

model = onnx.load("model_fp32.onnx")
model_fp16 = float16.convert_float_to_float16(model, keep_io_types=True)
onnx.save(model_fp16, "model_fp16.onnx")

keep_io_types: Whether model inputs/outputs should be left as float32.

You may also encounter automatic mixed precision (AMP) — a technique that keeps numerically sensitive layers in FP32 while using FP16 elsewhere. Normally, AMP is enabled during training in PyTorch, and the resulting model contains a mix of data types. ONNX Runtime can execute these mixed-precision graphs directly, as long as they are properly exported from the training framework.

This concludes our deep dive into ONNX and ONNX Runtime. You’ve seen how to export models, run inference, apply graph-level optimizations, and tune model precision for different deployment targets. With these tools, you're well-equipped to bring your PyTorch models into real-world systems — and make them run faster, lighter, and smarter on the hardware that matters.

6. High-Performance Deployment with TensorRT

What Is TensorRT?

TensorRT is NVIDIA’s deep learning inference engine — a high-performance runtime that turns trained neural networks into compiled executables optimized for NVIDIA GPUs. While training happens in frameworks like PyTorch or TensorFlow, deployment is about running inference as fast and efficiently as possible — and that’s what TensorRT is built for.

At its core, TensorRT takes a trained model (usually exported to ONNX format), and converts it into a static inference engine — a hardware-specific binary that’s highly optimized for your exact GPU. This makes it ideal for deployment on everything from desktop RTX cards to server-grade H100s and even Jetson edge devices.

Unlike PyTorch or TensorFlow, TensorRT doesn’t interpret your model layer-by-layer at runtime. Instead, it performs a process called graph compilation, which analyzes the computation graph, applies a series of graph-level optimizations, and selects the fastest implementation for each layer — known as a tactic. This produces a highly optimized inference engine that runs orders of magnitude faster than naïve execution.

While later sections will walk you through the Python code needed to build and run TensorRT engines, the next section explains what happens during engine creation — and why it matters for performance.

Graph Compilation and Tactic Selection

TensorRT doesn’t just “load and run” your model — it compiles it. This compilation process happens when you first build an engine from an ONNX file, and it’s what gives TensorRT its performance advantage. Here’s how it works:

Step 1: Graph Parsing

TensorRT starts by parsing your ONNX model into its internal representation — a directed acyclic graph (DAG) of operations and tensors. It validates node connectivity, checks operator support, and infers input/output shapes. If your model contains unsupported operations, this is where the build will fail.

Most standard ops from PyTorch and TensorFlow (via ONNX) are supported. If needed, you can write custom plugins — though for this course, we focus on models that work out-of-the-box.

Step 2: Graph Optimization

Next, TensorRT applies a series of graph-level optimizations, including:

Operator fusion: Layers like Conv + Bias + ReLU are merged into a single operation.
Constant folding: Subgraphs with static inputs (e.g., scale factors, masks) are evaluated at build time.
Dead node elimination: Nodes not connected to outputs are removed.
Precision calibration: TensorRT determines which layers can safely run in reduced precision (FP16, INT8), depending on your configuration.

These optimizations reduce kernel launches, minimize memory transfers, and simplify the execution graph — all of which lead to faster inference.

Step 3: Tactic Selection

This is the core differentiator of TensorRT. For each layer (e.g. convolution, matrix multiplication), there are many possible GPU kernels that could be used — varying in tiling strategy, memory layout, thread usage, and more.

TensorRT benchmarks a wide set of these implementations — called tactics — on your actual GPU, and selects the fastest one for each layer, based on your input shapes, precision (FP32/FP16/INT8), and GPU architecture.

For example: a Conv2D layer might have 30+ candidate kernels. TensorRT runs timed benchmarks, then locks in the fastest configuration as part of the engine.

This tactic benchmarking is what makes engine building slow but worthwhile — because once selected, each operation runs with the most efficient kernel available. The resulting engine is hardware-specific and can only be used on compatible GPUs (same compute capability and CUDA version).

Step 4: Memory Planning and Engine Serialization

Once tactics are chosen, TensorRT performs memory planning — allocating buffers for inputs, outputs, and intermediate activations, with smart reuse to minimize peak memory. It then generates a serialized engine, which can be saved to disk and reloaded later without recompiling.

The Full Engine Build Pipeline

This entire process — parsing, optimizing, tactic selection, memory planning — happens once, when you build the engine. After that, inference is instant: the compiled binary is loaded, and TensorRT executes the pre-planned operations with almost no overhead.

To summarize, TensorRT transforms your ONNX model into a custom GPU program, with:

A fused and simplified computation graph,
Hand-selected, benchmarked kernels (tactics),
Memory layouts optimized for your hardware,
Optional reduced precision (FP16, INT8) to improve speed and memory usage.

This is why TensorRT delivers such a large speed-up over vanilla PyTorch or ONNX Runtime on NVIDIA GPUs. It’s not just running the model — it’s compiling it into the fastest possible version for your specific device.

Building and Running TensorRT Engines

Once you’ve exported a model to ONNX format, the next step is to build a TensorRT engine — and use it to run inference on the GPU. In this section, we’ll walk through the full process using Python: from loading an ONNX file, building an optimized engine, and executing inference on example data.

We’ll use the official Python bindings for TensorRT, along with pycuda to handle memory transfers between CPU and GPU. This example assumes you have a working CUDA setup and a trained model saved as model.onnx.

Let’s break the process into three steps:

1. Load the ONNX Model

We start by reading the ONNX file and parsing it into TensorRT’s internal network graph.

import tensorrt as trt

logger = trt.Logger(trt.Logger.INFO)
builder = trt.Builder(logger)
network_flags = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
network = builder.create_network(network_flags)
parser = trt.OnnxParser(network, logger)

with open("model.onnx", "rb") as f:
    if not parser.parse(f.read()):
        for i in range(parser.num_errors):
            print(parser.get_error(i))
        raise RuntimeError("Failed to parse ONNX model")

trt.Logger sets the logging level — use INFO to see warnings and parsing errors.
trt.Builder is the main interface for creating engines.
The EXPLICIT_BATCH flag is needed for ONNX models — it tells TensorRT that the model defines its own batch dimension.
trt.OnnxParser reads the model file and populates the network structure.
If parsing fails, we print all errors for debugging.

At this point, the model’s structure is loaded, but it hasn’t been optimized or compiled yet.

2. Build the TensorRT Engine

Now we configure the build process and compile the engine. This turns the graph into an optimized binary executable.

config = builder.create_builder_config()
config.max_workspace_size = 1 << 30  # 1 GB

if builder.platform_has_fast_fp16:
    config.set_flag(trt.BuilderFlag.FP16)

engine = builder.build_engine(network, config)

create_builder_config() creates a build configuration — here we define how TensorRT should optimize the model.
max_workspace_size limits how much temporary memory TensorRT can use during optimization (not runtime).
If the GPU supports fast FP16 operations, we enable half-precision execution — which reduces memory and increases speed.
build_engine() compiles the model: this is where graph optimizations and tactic selection happen.
The result is a hardware-specific inference engine.

This step can take several seconds or even minutes depending on model size and hardware.

3. Run Inference with the Engine

Once we have an engine, we create an execution context and run a forward pass with real data. This involves memory transfers between CPU and GPU.

import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np

context = engine.create_execution_context()
input_shape = engine.get_binding_shape(0)
output_shape = engine.get_binding_shape(1)

input_data = np.random.rand(*input_shape).astype(np.float32)
output_data = np.empty(output_shape, dtype=np.float32)

d_input = cuda.mem_alloc(input_data.nbytes)
d_output = cuda.mem_alloc(output_data.nbytes)

cuda.memcpy_htod(d_input, input_data)
context.execute_v2([int(d_input), int(d_output)])
cuda.memcpy_dtoh(output_data, d_output)

print("Output shape:", output_data.shape)

pycuda.autoinit initializes the CUDA context automatically.
engine.create_execution_context() prepares the engine to accept input.
get_binding_shape() retrieves the expected input/output shapes. These depend on how the model was exported.
We create NumPy arrays with random data to simulate an input batch, and empty arrays to receive the output.
cuda.mem_alloc() reserves memory on the GPU for inputs and outputs.
memcpy_htod copies data from host to device, while memcpy_dtoh does the opposite.
execute_v2() runs the inference using raw pointers to GPU memory buffers.

After this, output_data contains the result of the forward pass — exactly like PyTorch’s .forward(), but now compiled and running at full GPU speed.

Configuring the Engine Build Process

When you build a TensorRT engine from an ONNX model, you don’t just convert the graph — you compile it, and that process can be tuned using the builder configuration. This happens through the config object you pass to builder.build_engine().

By default, TensorRT tries to find a good balance between speed and compatibility, but you can influence how it optimizes your model using a few key settings.

Workspace Size

config.max_workspace_size = 1 << 30  # 1 GB

This sets the maximum temporary GPU memory TensorRT can use while building the engine (not during inference). A larger value can unlock faster tactics that need more memory, but if it’s too high for your hardware, the build will fail.

Precision Flags

if builder.platform_has_fast_fp16:
    config.set_flag(trt.BuilderFlag.FP16)

If your GPU supports it, this enables FP16 precision for faster inference and reduced memory use. TensorRT will automatically pick layers that can safely run in half-precision. Later, you’ll also see how to enable INT8 — but that requires quantization.

Other Useful Options

config.set_flag(trt.BuilderFlag.SPARSE_WEIGHTS)

Enables 2:4 structured sparsity, which can unlock extra speedups on supported GPUs (Ampere and newer). You’ll need a sparsity-aware model (e.g. pruned with 2:4 pattern) — we’ll cover this later with the Model Optimizer.
config.set_flag(trt.BuilderFlag.STRICT_TYPES)

Forces TensorRT to only use layers in the specified precision (e.g. FP16), without automatically falling back to FP32. This is mostly used for debugging or when you're tuning for absolute minimum memory usage.
config.profiling_verbosity = trt.ProfilingVerbosity.DETAILED

Turns on detailed performance logging — useful when analyzing execution traces with Nsight Systems or the TensorRT CLI profiler.
config.builder_optimization_level = 5

Sets how hard TensorRT should try to optimize the model. Level 3 is the default, but higher levels (up to 5) allow longer engine build times in exchange for better performance. Great for final deployments — less useful during rapid iteration.

You won’t need these extra options at first, but it’s good to know they exist. For most cases, setting the workspace size and enabling FP16 is all you need to get started.

Managing Memory for Inference

When running inference with TensorRT, you need to manage GPU memory manually. That’s because your model runs entirely on the GPU, but your input data (like NumPy arrays) starts off on the CPU. So before inference, you need to copy data to the GPU, and after inference, copy results back to the CPU.

Allocating GPU Buffers

input_allocation = cuda.mem_alloc(input.nbytes)
output_allocation = cuda.mem_alloc(output_size * 4)  # float32 = 4 bytes

This allocates raw memory on the GPU. mem_alloc expects a size in bytes, so we multiply the number of floats by 4 to get the right size for float32. If you were working with float16, you’d use 2 bytes per value.

Copying Data to the GPU

cuda.memcpy_htod(input_allocation, input)

This copies data from host to device (CPU → GPU). Since your NumPy array lives on the CPU, you must explicitly transfer it before inference.

Copying Results Back

cuda.memcpy_dtoh(output, output_allocation)

After inference, this copies the prediction data from device to host (GPU → CPU), so you can inspect it in Python.

Why Use Integers or Raw Pointers?

TensorRT (via PyCUDA) doesn’t use NumPy arrays directly on the GPU. Instead, it works with raw memory pointers. That’s why everything is passed as opaque memory addresses, and why you allocate buffers using mem_alloc().

These pointers are passed into the engine’s execute_v2() call:

context.execute_v2([int(input_allocation), int(output_allocation)])

Each input/output must be passed in the order they appear in the engine bindings. The int() conversion ensures we’re passing raw addresses — not Python objects.

Manual memory management can feel low-level, but it gives you precise control — which is key to performance. You decide when and how much memory to allocate and transfer.

Dynamic Shapes and Batch Sizes

When building a TensorRT engine, you need to define the input tensor shape(s) the engine should support. This includes the batch size. TensorRT doesn’t just accept any input shape — it needs to know ahead of time which shapes might occur during inference so it can optimize accordingly.

This is done using an optimization profile, which defines a valid shape range using three values:

profile.set_shape(input_name, min_shape, opt_shape, max_shape)

Here’s what each shape means:

min_shape: The smallest input shape the engine should accept.
opt_shape: The most common (expected) shape — used to guide kernel selection.
max_shape: The largest input shape the engine should support.

TensorRT will optimize the engine across this range and benchmark tactics (kernels) based primarily on the opt_shape.

There are two common strategies:

Fixed Input Shape (best performance)

If you know exactly what shape your input will always have — for example, batch size 1 — you can set all three shapes to the same value. This produces the most optimized engine possible.

profile.set_shape("input", (1, 1, 28, 28), (1, 1, 28, 28), (1, 1, 28, 28))

This means the engine only supports input with batch size = 1, and nothing else. This is very common for edge AI devices, where inference is done one sample at a time.

Dynamic Batch Size (flexibility at a cost)

If you want your engine to support multiple batch sizes — for example from 1 to 16 — you can provide a range. TensorRT will support any shape between min_shape and max_shape, and use opt_shape during optimization.

profile.set_shape("input", (1, 1, 28, 28), (8, 1, 28, 28), (16, 1, 28, 28))

This engine can accept batch sizes from 1 to 16. The builder will pick tactics optimized for batch size 8 (the opt_shape).

If you expect varying input sizes — e.g. batch sizes that depend on request load — this is a good compromise.

But keep in mind: the wider the range, the harder it is to optimize. Engines with large shape ranges will generally run slower than engines optimized for a narrow or fixed shape.

Don't forget dynamic_axes()

And remember: if you use dynamic shapes in TensorRT, your ONNX model must support them too. When exporting from PyTorch, use the dynamic_axes argument:

torch.onnx.export(
    model, dummy_input, "model.onnx",
    dynamic_axes={"input": {0: "batch_size"}}
)

This makes the batch dimension variable in the ONNX graph — a requirement if you plan to build a dynamic TensorRT engine.

Serializing and Reloading TensorRT Engines

Once you’ve built a TensorRT engine, you don’t want to rebuild it every time your application starts. Engine building can be slow — especially for large models. Fortunately, TensorRT lets you serialize an optimized engine to disk and reload it later with zero rebuild cost.

Saving the Engine to Disk

After building an engine with builder.build_engine(...), you can serialize it to a binary blob:

engine_bytes = engine.serialize()
with open("model.engine", "wb") as f:
    f.write(engine_bytes)

This writes the full engine — including the optimized graph, chosen tactics, memory plan, and precision info — to a file. You can include this file with your application, or store it in a model cache for reuse.

Loading the Engine at Runtime

Later, you can load the engine without needing the ONNX file or rebuilding anything:

import tensorrt as trt

logger = trt.Logger(trt.Logger.INFO)
runtime = trt.Runtime(logger)

with open("model.engine", "rb") as f:
    engine_data = f.read()

engine = runtime.deserialize_cuda_engine(engine_data)
context = engine.create_execution_context()

Once loaded, you can run inference exactly like before — allocate buffers, bind inputs and outputs, and call context.execute_v2(...). There’s no need to re-parse ONNX or rebuild the engine.

Engine Portability and Deployment Considerations

While engine serialization is extremely useful, it comes with an important caveat: TensorRT engines are hardware-specific. That means:

Engines are only guaranteed to run on the same GPU architecture they were built on.
Even TensorRT version mismatches can prevent an engine from loading.
Precision-specific tactics (e.g. FP16, INT8) may only work on GPUs that support those modes.

Because of this, it’s common to generate multiple engine files, one per target GPU or compute capability. Alternatively, you can build the engine on first launch of your application — sometimes called just-in-time (JIT) optimization — and then cache the engine locally for future runs.

Serialization lets you skip slow engine builds, but you should know your deployment environment — and test accordingly.

Running Models in Lower Precision

While TensorRT delivers major speedups through graph compilation and kernel optimization, some of the biggest performance gains come from running models in reduced numeric precision. Lowering the bit-width of weights and activations reduces memory usage, speeds up computation, and increases throughput — especially on GPUs with support for Tensor Cores.

TensorRT supports a wide range of numeric formats, but the most relevant ones for deployment today are:

FP32 (32-bit float) is the default format for most training workflows. It's precise and universally supported, but also the most memory- and compute-intensive option.
FP16 (16-bit float) cuts memory use in half and runs much faster on supported GPUs (Volta and newer). No calibration is needed — you simply enable the builder flag and TensorRT will use FP16 wherever safe.
INT8 (8-bit integer) provides even greater speed and compression, but requires explicit quantization. Models must include scaling parameters to map activations and weights to 8-bit integers — either through calibration (deprecated) or quantization-aware training.
FP8 (8-bit float) is a newer format supported on Hopper and some Ada GPUs. It enables ultra-low precision execution and is primarily used in large language model inference. It’s still considered experimental in most edge or embedded workflows.

In the past, TensorRT supported post-training quantization (PTQ) using calibrators and representative datasets — where a full-precision model was analyzed after training to estimate scaling factors for INT8 inference. As of recent releases, this workflow is deprecated.

NVIDIA now recommends exporting models that are already quantized — using quantization-aware training (QAT) — which yields more accurate results. This shift makes INT8 deployment more robust: instead of relying on runtime calibration, models carry explicit quantization metadata embedded in the graph.

The recommended way to quantize a model for TensorRT is by using the Model Optimizer library, which we’ll explore in the next chapter.

7. Advanced Techniques with Model Optimizer

The TensorRT Model Optimizer — often referred to as ModelOpt — is an open-source Python library from NVIDIA designed to apply training-aware model compression techniques before deployment. Unlike TensorRT, which focuses on optimizing inference after a model has been exported (typically in ONNX format), ModelOpt works upstream, directly on your PyTorch model, allowing you to compress and optimize it before ONNX export.

In the typical deployment pipeline:

PyTorch → (ModelOpt: optional) → ONNX → TensorRT → Engine

ModelOpt sits between training and export, giving you fine-grained control over model structure and precision, and preparing models for maximum efficiency during TensorRT inference. While TensorRT itself supports many advanced graph optimizations, it doesn’t modify your model’s architecture or weights.

That’s where ModelOpt comes in — offering advanced model-level compression techniques:

Quantization-Aware Training (QAT): Inserts fake quantization layers into your PyTorch model so that it can be fine-tuned to behave well under INT8 or even INT4 precision. This typically yields better accuracy than pure PTQ.
Structured Pruning: Removes unimportant filters, channels, or neurons to shrink the model and reduce FLOPs. Works well for vision and transformer models alike.
2:4 Sparsity and Weight Compression: Enforces GPU-friendly sparsity patterns that can be exploited by TensorRT on Ampere and newer hardware for actual runtime acceleration.
Knowledge Distillation: Trains a smaller “student” model to mimic a larger “teacher,” improving accuracy in highly compressed models.
Neural Architecture Search (NAS): Automatically searches for efficient model structures that meet size, latency, or accuracy constraints — useful when hand-designed models fall short.

Each of these techniques modifies the model before export, so when you later convert to ONNX and build a TensorRT engine, the graph already reflects those optimizations — and TensorRT can take full advantage of them.

For example, a typical optimized deployment might look like this:

Train a model in PyTorch as usual.
Use ModelOpt to apply QAT and prune unnecessary channels.
Export the compressed model to ONNX.
Build a TensorRT engine that uses INT8 precision and sparse tensor cores.

This approach lets you combine model-level compression (via ModelOpt) with execution-level optimization (via TensorRT) — achieving significantly smaller, faster models with minimal loss in accuracy. It’s a powerful addition to any deployment pipeline, and the next sections will show you how to use it.

Quantization-Aware Training with Model Optimizer

Quantization-aware training (QAT) in Model Optimizer always starts with post-training quantization (PTQ). This first step inserts fake quantization layers into your PyTorch model based on a few representative inputs. If you're only looking for a simple way to shrink your model and speed up inference — and you're okay with a small accuracy drop — PTQ alone might be all you need. It's quick, doesn't require any retraining, and works well in many real-world cases.

But if accuracy is critical, you can go one step further. After PTQ, Model Optimizer lets you fine-tune the quantized model using QAT — which means training the model while it simulates INT8 behavior. This helps it adapt to the lower precision and recover most (or all) of the lost accuracy. The good news is: both PTQ and QAT are built into the same workflow, so upgrading from “quick and simple” to “optimized and accurate” takes very little extra effort.

Note: TensorRT only supports symmetric quantization, which means no zero-point is used — just a scale factor. As a result, dequantization is a simple multiplication (float = scale × int), making it very efficient on GPU hardware.

Step 1: Train a Full-Precision (FP32) Model

Before we apply quantization, we always start by training our model in full 32-bit floating point. Whether you're using a pretrained model or training from scratch, it's important to first reach a good level of accuracy with normal training. Quantizing a randomly initialized model doesn't make sense — the weights aren't meaningful yet. So treat this step exactly like any other training workflow.

import torch
import torchvision.models as models
import torch.nn as nn
import torch.optim as optim

model = TinyCNN()
optimizer = optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()

model.train()
for epoch in range(100):
    for inputs, targets in train_loader:
        outputs = model(inputs)
        loss = loss_fn(outputs, targets)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

Loads a custom model (TinyCNN) — or could use a pretrained model.
Creates an optimizer and loss function — standard setup for supervised learning.
Runs a training loop — adjusts model weights to your dataset for a few epochs.

Step 2: Insert Fake Quantization Modules

Now that we have an accurate FP32 model, it’s time to prepare it for INT8 execution. The first step is post-training quantization: we calibrate the model using representative data, and ModelOpt inserts fake quantization modules to simulate low-precision behavior.

import modelopt.torch.quantization as mtq

calibration_data = [torch.rand(1, 1, 28, 28) for _ in range(100)]

def calibrate(model):
    model.eval()
    with torch.no_grad():
        for x in calibration_data:
            model(x)

config = mtq.INT8_DEFAULT_CFG
model_q = mtq.quantize(model, config, calibrate)

Creates a small set of representative calibration inputs to simulate real data.
Defines a calibration function that runs inference on this data to gather activation statistics.
Selects the default INT8 quantization config, which targets both weights and activations.
Calls quantize() to insert fake quantization modules and prepare the model for QAT.

Step 3: Fine-Tune the Quantized Model

Now that our model includes fake quantization modules, we can fine-tune it with quantization-aware training. Although it still runs in FP32 internally, the quantization layers simulate INT8 behavior — allowing the model to adapt to reduced precision. This helps recover the small accuracy drop introduced during calibration.

optimizer = optim.Adam(model_q.parameters(), lr=1e-4)

model_q.train()
for epoch in range(5):
    for inputs, targets in train_loader:
        outputs = model_q(inputs)
        loss = loss_fn(outputs, targets)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

Sets up a lower learning rate optimizer for gentle fine-tuning.
Enables training mode and runs a few epochs on the quantized model.
Allows the model to learn to compensate for quantization noise, improving INT8 accuracy.
QAT typically converges fast — often just a few epochs are enough.

Step 4: Export the Quantized Model to ONNX

Once training is complete, export the quantized model to ONNX. ModelOpt ensures that the quantization parameters (scales) are embedded in the graph.

dummy_input = torch.rand(1, 1, 28, 28)
torch.onnx.export(model_q, dummy_input, "model_qat.onnx")

You now have a fully quantized ONNX model, ready for deployment.

Step 5: Build a TensorRT Engine with INT8 Enabled

TensorRT can now load this ONNX file and build an INT8 engine. Since the quantization parameters are already baked into the graph, there's no need to provide calibration data again.

import tensorrt as trt

logger = trt.Logger(trt.Logger.INFO)
builder = trt.Builder(logger)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
parser = trt.OnnxParser(network, logger)

with open("model_qat.onnx", "rb") as f:
    parser.parse(f.read())

config = builder.create_builder_config()
config.max_workspace_size = 1 << 30
config.set_flag(trt.BuilderFlag.INT8)  # Prefer INT8
config.set_flag(trt.BuilderFlag.FP16)  # FP16 fallback

engine = builder.build_engine(network, config)

TensorRT will now use true INT8 kernels during inference, leveraging Tensor Cores for maximum performance — with minimal loss in accuracy, thanks to QAT.

This workflow gives you the best of both worlds: the compact, high-speed inference of INT8 — and the accuracy of a model that’s trained to survive quantization.

Structured Pruning for Smaller, Faster Models

Structured pruning involves removing entire structures — such as filters, channels, or attention heads — from a neural network. This technique reduces computational complexity and model size, leading to faster inference times and lower resource consumption.

While PyTorch's built-in torch.nn.utils.prune module offers various pruning methods, it primarily applies masks to weights, zeroing them out without physically removing them. This approach may not yield significant performance gains on certain hardware.

In contrast, Model Optimizer provides advanced structured pruning capabilities that physically remove redundant structures from the model. This results in actual reductions in model size and computational load, making it highly effective for deployment on NVIDIA GPUs.

A typical pruning workflow involves:

Starting with a fully trained FP32 model.
Applying structured pruning using ModelOpt.
Fine-tuning to recover accuracy.
Exporting the optimized model for deployment.

Step 1: Train the Original Model

We begin by training our model to convergence in full precision. We need a fully trained model before we do any form of pruning.

model = TinyCNN()
optimizer = optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()

model.train()
for epoch in range(100):
    # forward, loss, backwards, step

The model is trained in standard FP32 mode.
You should reach acceptable accuracy before pruning — pruning a poor model won’t help.

Step 2: Apply Structured Pruning

We now prune the model using ModelOpt’s prune() function, which permanently removes channels from convolutional layers to meet a given constraint (e.g. target FLOPs). The resulting model is structurally smaller — layers have fewer channels, and the architecture has physically changed.

import modelopt.torch.prune as mtp

dummy_input = torch.rand(1, 1, 28, 28)

def score_func(model):
    return evaluate_accuracy(model, validation_loader)

pruned_model, result = mtp.prune(
    model=model,
    dummy_input=dummy_input,
    constraints={"flops": "50%"},
    mode="fastnas",
    config={
        "data_loader": train_loader,
        "score_func": score_func,
    }
)

dummy_input: required to trace shapes through the model.
constraints: here we ask for 50% FLOP reduction.
mode="fastnas": fast structured pruning with optional accuracy preservation.
data_loader: training data to recalibrate batch norm layers during pruning.
score_func: fastnas will use this to test accuracy during pruning.

Constraints define your target — that is, how small or efficient the pruned model should be. The most common options are "flops" and "params" (parameters), which can be expressed as percentages ("50%") or absolute values (4.5e6).

Multiple pruning modes are supported, but for most computer vision and time series models, we recommend using mode="fastnas". This mode performs structured channel pruning using a fast neural architecture search strategy to find a smaller sub-network that meets your constraint while preserving accuracy.

The score_func lets the pruning algorithm evaluate how well different pruned architectures perform — by returning a validation accuracy metric. This guides the search toward architectures that retain performance even under tighter resource constraints.

Step 3: Fine-tune the Pruned Model

Pruning usually hurts accuracy. To recover performance, we fine-tune the new model for a few epochs.

optimizer = optim.Adam(pruned_model.parameters(), lr=1e-4)

pruned_model.train()
for epoch in range(5):
    for inputs, targets in train_loader:
        # forward, loss, backwards, step

Use a lower learning rate than the original training.
This helps the model adapt to the new structure without destabilizing earlier learning.

Step 4: Export the Pruned Model

The final model has a new, smaller architecture. You can now export it directly to ONNX for use with TensorRT.

torch.onnx.export(
    pruned_model,
    dummy_input,
    "pruned_model.onnx",
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={"input": {0: "batch_size"}, "output": {0: "batch_size"}}
)

No special conversion is needed — the architecture is already modified.
TensorRT will treat this as any other ONNX model. There’s no need to “tell” it the model was pruned.

2:4 Structured Sparsity for Accelerated Inference

Structured sparsity is a powerful optimization technique — not just for reducing model size, but for actually speeding up inference on supported hardware. One of the most effective formats is 2:4 structured sparsity, a fine-grained pattern introduced with NVIDIA’s Ampere architecture and supported by Sparse Tensor Cores in recent GPUs (A100, RTX 30xx, Jetson Orin). This format is designed for hardware acceleration — and when used correctly, it can provide a 1.5–2× speedup for certain layers with no accuracy loss.

The idea behind 2:4 sparsity is simple: for every group of 4 consecutive weights, 2 must be zero. This predictable pattern allows the GPU to skip computations during matrix multiplications. With ModelOpt, you can automatically enforce this pattern during training — and export a model that’s ready to be accelerated by TensorRT.

Step 1: Train a Dense Model

As always, begin by training your model in full precision. Structured sparsity is applied after the model has already learned useful features.

model = TinyCNN()
optimizer = optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()

model.train()
for epoch in range(100):
    for inputs, targets in train_loader:
        # forward, loss, backwards, step

This training loop is no different from standard FP32 training. The important part is to achieve strong baseline accuracy before introducing sparsity.

Step 2: Apply 2:4 Sparsity with ModelOpt

ModelOpt makes it easy to impose the 2:4 sparsity pattern across supported layers (typically fully connected or convolutional). Under the hood, it uses a custom regularization method that encourages blocks of four weights to contain exactly two zeros — a constraint that matches the hardware requirements.

import modelopt.torch.sparsify as mts

sparse_model = mts.make_sparse_2_to_4(model)

make_sparse_2_to_4() rewrites weight tensors in-place to enforce the 2:4 pattern.
Only supported layers (e.g. Linear, Conv2d) are affected — BatchNorm and activations are untouched.
This step can be applied to a trained model or used during training via hooks.

You can optionally inspect the sparsity level per layer:

mts.print_sparsity_report(sparse_model)

Step 3: Fine-Tune the Sparse Model

After applying the sparsity mask, you’ll often see a slight drop in accuracy — especially if pruning was aggressive. To recover performance, run a few more epochs of fine-tuning. The sparse pattern is preserved, but weights are adjusted to better fit the constraint.

optimizer = optim.Adam(sparse_model.parameters(), lr=1e-4)

sparse_model.train()
for epoch in range(5):
    for inputs, targets in train_loader:
        # forward, loss, backwards, step

Use a smaller learning rate to avoid destabilizing the sparse weights.
The make_sparse_2_to_4() function uses masking layers internally to preserve the sparsity pattern during training.

Step 4: Export the Sparse Model to ONNX

Once fine-tuning is complete, export the model to ONNX. The sparsity pattern is implicit — no special operator is required — but TensorRT will detect it automatically during engine build.

dummy_input = torch.rand(1, 1, 28, 28)
torch.onnx.export(sparse_model, dummy_input, "sparse_model.onnx")

Step 5: Build a TensorRT Engine with Sparse Support

Now that your ONNX model is sparse, you can enable 2:4 acceleration in TensorRT by setting the SPARSE_WEIGHTS flag. This tells TensorRT to scan the model for supported patterns and apply sparse kernels where possible.

config.set_flag(trt.BuilderFlag.SPARSE_WEIGHTS)  # enable 2:4 acceleration

TensorRT will detect and accelerate layers with 2:4-compatible sparsity.
No further annotations or calibration are needed.

If your hardware meets the sparsity requirements, TensorRT will replace dense kernels with optimized sparse kernels — unlocking real speedups with no change in architecture or deployment logic.

Beyond the Basics: Advanced Features and Frontiers

While this chapter has focused on the core functionality of TensorRT and ModelOpt — including engine compilation, quantization-aware training, and structured pruning — both tools support more advanced workflows that extend their usefulness in real-world deployments.

AutoQuantize in ModelOpt performs per-layer precision selection to optimize the balance between model size and accuracy. Instead of uniformly quantizing all layers, it automatically chooses the best format—such as FP8, INT8, or skipping quantization — based on accuracy impact, guided by a user-defined target like average bit-width. This enables efficient, hardware-aware compression with minimal accuracy loss.

Knowledge distillation is also supported in ModelOpt. It enables the training of smaller “student” models that learn to replicate the behavior of larger, more accurate teacher models. This is especially useful when combined with quantization or pruning, as it allows compressed models to retain much of the original model's accuracy even under aggressive optimization.

For use cases that require more than just post hoc optimization, ModelOpt includes support for lightweight neural architecture search (NAS). This allows users to automatically search over model variants — varying filter sizes, layer depths, or channel counts — to find architectures that meet specific performance budgets in terms of parameters, FLOPs, or inference time.

Finally, TensorRT now supports ultra-low-precision datatypes such as INT4 and FP8. These formats allow further reductions in memory and compute requirements, and are particularly relevant in transformer-based models or large-scale inference workloads. While tooling around these formats is still evolving, they represent the next step in model compression and are increasingly supported in both software and hardware.

Together, these advanced techniques make it possible to build models that are not only smaller and faster, but also better suited to the hardware they’ll run on — bridging the gap between research and real-world deployment.

Wrapping Up: From Models to Edge Deployment

We’ve come to the end of our Edge AI deployment story.

Over the course of this guide, we’ve moved step by step from high-level concepts to practical deployment techniques, always with an eye on the real-world constraints of edge environments: limited compute, power efficiency, resilience, and latency.

You’ve seen how edge AI differs fundamentally from cloud-based ML, and how that affects the hardware choices we make — from microcontrollers to Jetson boards to AI accelerators. You’ve explored how to optimize models through techniques like quantization, pruning, and knowledge distillation, and how to choose architectures that fit your constraints.

You then got hands-on with deployment tools:

You learned how to export models using ONNX, and how to inspect and debug exported graphs.
You ran inference using ONNX Runtime, taking advantage of execution providers and built-in graph optimizations.
You built high-performance, hardware-specific inference engines using TensorRT, gaining full control over precision, memory, and performance.
And finally, you explored NVIDIA’s Model Optimizer, diving into advanced compression techniques like Quantization Aware Training, Structured Pruning, and Sparsity Acceleration — the kinds of tools that push deployment into the realm of production-grade, scalable systems.

While there’s always more to learn — new formats, hardware evolutions, or novel optimization methods — you now have a practical, structured understanding of how to bring models from the lab to the edge. Whether you’re working in industrial automation, smart sensing, robotics, or embedded analytics, these tools and workflows are designed to help you build AI systems that are fast, efficient, and ready for the real world.

Thanks again for following along — and good luck building what's next!

Edge AI Deployment: Optimizing and Scaling AI Models on Devices

Course Overview

Table of Contents

1. Introduction to Edge AI

From Cloud to Edge: Why the Shift?

Latency: Real-Time Response in the Physical World

Bandwidth: Managing Data Volume Locally

Reliability and Resilience: When the Network Fails

Privacy, Security, and Compliance

Cost Considerations: The Surprising Economics of Edge

Power Efficiency and Battery-Powered Intelligence

Strategic Deployment: Cloud, Edge, or Both?

2. Overview of Edge Hardware and Accelerators

Industrial Priorities in Hardware Choice

Tooling, SDKs, and Software Ecosystems

Understanding CPUs, GPUs, and NPUs

Matching AI Workloads to Hardware

Microcontrollers for TinyML

Embedded Linux SoCs (With or Without NPU)

Single-Board Computers and Add-On Modules

The NVIDIA Jetson Series

External AI Accelerators (Hailo)

FPGAs for Deterministic AI

Intel-compatible Edge Devices

Industrial PCs (IPCs)

PLCs with AI Capabilities

Wrapping Up the Hardware Landscape

3. Model Optimization Techniques

Why Optimize Models for the Edge?

How Big Are Modern Models, Really?

Who's Doing the Optimization? Frameworks and Toolchains

Numeric Precision and Model Data Types

Post-Training Quantization (PTQ)

Quantization-Aware Training (QAT)

Pruning and Sparsity

Graph-Level Optimizations and Operator Fusion

Knowledge Distillation

Generating and Using Synthetic Data

Lightweight Vision Models for the Edge

Lightweight Models for Time Series and Audio

A Reality Check on Architecture Search

What Do These Frameworks Really Support?

4. Exporting Models with the ONNX Format

Why Export Models? From Training to Inference

What Is ONNX? A Cross-Framework Model Format

Exporting PyTorch Models to ONNX

Inspecting and Visualizing ONNX Models

Common Pitfalls When Exporting to ONNX

5. Efficient Inference with ONNX Runtime

Running Inference with ONNX Runtime

Comparing PyTorch and ONNX Model Outputs

Execution Providers and Backend Selection

Graph-Level Optimizations in ONNX Runtime

Post-Training Quantization with ONNX Runtime

Dynamic Quantization with ONNX Runtime

Static Quantization with ONNX Runtime

Float16 and Mixed Precision: Optimizing for GPUs

6. High-Performance Deployment with TensorRT

What Is TensorRT?

Graph Compilation and Tactic Selection

Building and Running TensorRT Engines

Configuring the Engine Build Process

Managing Memory for Inference

Dynamic Shapes and Batch Sizes

Serializing and Reloading TensorRT Engines

Running Models in Lower Precision

7. Advanced Techniques with Model Optimizer

Quantization-Aware Training with Model Optimizer

Structured Pruning for Smaller, Faster Models

2:4 Structured Sparsity for Accelerated Inference

Beyond the Basics: Advanced Features and Frontiers

Wrapping Up: From Models to Edge Deployment