Serving ML Models with gRPC
Skip REST and give gRPC a try
Written by Austin Poor
Most people who are looking to put their newly trained ML model into production turn to REST¹ APIs. Here’s why I think you should consider using gRPC instead.
Wait! What’s wrong with REST!?
Nothing! The main benefit of REST APIs is their ubiquity. Every major programming language has a way of making HTTP clients and servers. And there are several existing frameworks for wrapping ML models with REST APIs (e.g. BentoML, TF Serving, etc). But, if your use case doesn’t fit one of those tools (and even if it does), you may find yourself wanting to write something a little more custom. And the same thing that makes REST APIs versatile can also make them difficult to work with.
What is gRPC?
As its site states, gRPC is “a high performance, open source universal RPC framework,” originally developed at Google. The three main elements at the core of gRPC are: code generation, HTTP/2, and Protocol Buffers².
Protocol Buffers are a binary, structured data format designed by Google to be small and fast. Both gRPC services and their request/response message formats are defined in
gRPC client and server code is generated from the
.protobuf definition files, in your preferred language. You then fill in the business logic to implement the API.
HTTP/2-based transport provides gRPC with several key benefits:
- A binary protocol
- Bi-directional streaming
- Header compression
- Multiplexing several requests on the same connection
Okay, but what does that mean for me?
Type Safety & Documentation
Since gRPC APIs are defined via protobufs, they are inherently documented and type-safe. Conversely, REST APIs have no such guarantee you would need extra tooling like OpenAPI to define and document your service as well as a library to validate client requests.
Speed, Binary Data & Streaming
gRPC takes full advantage of HTTP/2 and Protocol Buffers to make your API as fast as possible. gRPC messages are made up of efficiently packed binary data compared with REST’s plain-text, JSON-encoded messages.
One commonly cited test shows gRPC to be roughly 7-10 times faster than REST. While the differences may be less visible with smaller requests, the inputs to ML models can often be large (e.g. large tables of data, images to be processed, or even video), where compression and binary formats shine.
In fact, since Protocol Buffers allow for binary data, a request could be a large table of data encoded with Apache Arrow or Apache Parquet. And further, thanks to the capabilities of HTTP/2, large binary messages can broken up into chunks and streamed.
Downsides & Alternatives
gRPC certainly isn’t perfect. For example, here are some issues you might run into:
- Slower initial development
- Less commonly used
- Messages aren’t human-readable, which makes debugging more difficult
- Client libraries need to be generated
Other approaches exist that might fit better with your workflow. BentoML, TF Serving, and Ray Serve are some great options for serving ML models. Or, if you’re looking for something a little more customizable, FastAPI and Flask are two great options that might be a better match.
Also, for a partial approach, you could also consider swapping out your message format from JSON to BSON or MessagePack.
Conclusion / TL;DR
gRPC APIs are fast and easy to work with. They are inherently type-safe, they allow for bi-directional streaming messages (e.g. breaking large files into chunks), and they use a fast and efficient message format (Protocol Buffers).
Next time you need to serve up an ML model via an API, consider using gRPC!
- gRPC (gRPC Python library)
- Protocol Buffers
 I know the term “RESTful” is maybe thrown around a bit too liberally — and applied to any HTTP-based API. In this article, I use the colloquial definition of a REST API.
 This is a bit of an oversimplification - gRPC is highly customizable. For example, you could use JSON instead of protobufs and HTTP/1 instead of HTTP/2. But…should you?
Thank you so much for reading! If you have any thoughts, questions, or comments, I'd love to hear them. You can find me on Twitter, Mastodon, or LinkedIn.
If you liked this post, you might also like:
Handling ML Predictions in a Flask App
Don't let long-running code slow down your Flask app
Data Scientists, Start Using Profilers
Find the parts of your algorithm that are ACTUALLY slowing you down