Fri 08 August 2025
Stop Shipping PyTorch: Why ONNX Makes Deploying T5 Models (Actually) Simple
Let's be honest: shipping a PyTorch model in production often feels like dragging an iceberg behind your app. It works great in your notebook, until you need to deploy.
Here’s the real pain:
- Your app now drags along all of
torch
,numpy
, and friends. That means hundreds of megabytes of wheels and libraries. - Your Docker images balloon—sometimes by gigabytes.
- The user’s environment gets brittle. Suddenly, you’re juggling Python versions, CUDA, and a dependency graph that looks like an accident report.
- Simple install? Forget it. Now you’re shipping half the Python ecosystem.
Let’s see this pain in action with a concrete example: deploying a T5 seq2seq model for inference.
Why ONNX? The Universal Model Format
ONNX is the “export” button your deployment pipeline needs. You train in PyTorch (or TensorFlow, or whatever), then convert to ONNX—a compact .onnx
file. For inference, all you need is onnxruntime
. That’s it. No torch, no CUDA drivers, no fighting your dependency tree.
The Extra Step: Model Conversion
Here’s the catch: You can’t run your PyTorch model directly in ONNX. You have to convert it.
Let’s make this explicit. Suppose you trained a T5-small sequence-to-sequence model in PyTorch. For ONNX, you have to export it:
import torch
from transformers import T5ForConditionalGeneration
model = T5ForConditionalGeneration.from_pretrained("t5-small")
dummy_input = torch.randint(0, 32128, (1, 8)) # batch, sequence_length
torch.onnx.export(
model,
(dummy_input,),
"t5-small.onnx",
input_names=["input_ids"],
output_names=["logits"],
opset_version=14,
dynamic_axes={"input_ids": {0: "batch_size", 1: "seq_len"}}
)
This code turns your PyTorch T5 model into an ONNX model. You need this extra step, even for stock HuggingFace models.
Inference: PyTorch vs. ONNX (T5 Example)
Let’s walk through what inference looks like on both stacks. You’ll see where ONNX adds some up-front complexity, but wins at deployment.
PyTorch Inference (Short and Sweet)
from transformers import T5Tokenizer, T5ForConditionalGeneration
model = T5ForConditionalGeneration.from_pretrained("t5-small")
tokenizer = T5Tokenizer.from_pretrained("t5-small")
input_text = "translate English to German: The house is wonderful."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
output_ids = model.generate(input_ids)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
Great for research and prototyping, but this pulls in all of PyTorch, the HuggingFace Transformers library, and their dependencies. Shipping this as a CLI or server means a huge install for users.
ONNX Inference (More Work, Less Baggage)
ONNX requires a bit more boilerplate, but here’s how you run the same T5-small model after conversion:
import onnxruntime as ort
from transformers import T5Tokenizer
import numpy as np
tokenizer = T5Tokenizer.from_pretrained("t5-small")
ort_session = ort.InferenceSession("t5-small.onnx")
input_text = "translate English to German: The house is wonderful."
inputs = tokenizer(input_text, return_tensors="np").input_ids
# Typically, you also need to implement greedy decoding manually, as ONNX models don't include generate()
def greedy_decode(inputs, ort_session, tokenizer, max_length=30):
input_ids = inputs
for _ in range(max_length):
ort_inputs = {"input_ids": input_ids}
logits = ort_session.run(None, ort_inputs)[0]
next_token = np.argmax(logits[:, -1, :], axis=-1, keepdims=True)
input_ids = np.concatenate([input_ids, next_token], axis=-1)
if next_token[0, 0] == tokenizer.eos_token_id:
break
return input_ids
output_ids = greedy_decode(inputs, ort_session, tokenizer)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
This is longer and more manual than the PyTorch version, but you lose all the dead weight. Your deployable package shrinks to just onnxruntime
, a tokenizer, and a model file. No torch, no CUDA headaches, no scientific Python stack.
Why Bother? The Payoff Is Real
Yes, the ONNX approach requires a conversion step and (sometimes) a bit of extra inference code, especially for generative models like T5 that don’t have built-in decoding in ONNX.
But here’s what you gain:
-
Tiny Install Size: ONNX Runtime is measured in megabytes, not hundreds.
-
No Heavy Dependencies: You avoid torch, CUDA, numpy, and endless dependency pain.
-
Fast, Optimized Inference: ONNX Runtime leverages your hardware efficiently and is often faster than PyTorch CPU inference.
-
Trivial Packaging: Now you can build a simple wheel or even a standalone binary.
This matters especially for CLI tools, microservices, and environments where you do not want to ship the world just to run a model.
Key Takeaways
-
Deploying PyTorch models means dragging a huge stack into your app.
-
ONNX requires a conversion step but pays you back with lightweight, fast, and robust deployments.
-
T5 seq2seq in ONNX does require more manual inference code, but your users will thank you for a painless install and fast runtime.
Future-you will thank you for doing the extra work up front. For any production deployment, especially for tools, services, and anything public, stop shipping PyTorch. Export to ONNX, and make both your life and your users’ lives simpler.