Latency and throughput in machine learning
When managing a data science project, the first thing to do is always understand the requirements from different stakeholders.
For instance we are building a hotels recommendation system, different stakeholders will have different concerns:
- Product team:
- They would want a model that returns recommended hotels in 100 milliseconds, as 30% of increase in latency will reduce sales by 0.5% AND also they raise a concern that on average we do have 100 visitors at every given second.
- Sales team: They want a model that recommends expensive hotels, as they bring more service fees.
- After sales team: they want hotels that will raise less negative feedback.
- Management: They want a model that maximizes the margin.
Let’s see how these contradicting requirements affect our model deployment system.
To be able to build and deploy a model that satisfies these requirements, we may have to run a relatively complexe model; which consists of multiple models, each one answering to one requirement, then all these models are combined to output a shared decision about the prediction; it is called ensemble ML methods.
Now what will raise with this approach, is high latency; meaning as complexe and accurate the model can be, it will take some time to process the data then run it through different models to later on combine the results into one prediction.
If we are to expose this machine learning system as API, and it will run more than the 100 milliseconds required by our Product team, then this model can be useless in production.
In simple words, Latency is the time it takes from receiving a query to returning the result.
Another aspect to consider is Throughput, which is how many queries are processed within a specific period of time (usually 1 second)
If the systems always processes one query at a time, with a 100 milliseconds of latency for each query, then the throughput is 10 queries in 1 second, which will only cover for 10 recommendations in one second, hence if we do have 20 visitors, only 10 will get recommendation the other group will have to wait little bit longer, knowing that we have 100 visitors on average at every given second.
Solutions:
To optimize this combination of latency and throughput, some of useful solutions would be:
- Use simpler models
- Quantization: involves reducing the precision of model weights and activations. This can significantly reduce the model’s size and inference time while sacrificing some accuracy.
- Pruning techniques: identify and remove less important weights or neurons from your neural network, reducing model size and improving inference speed.
- Utilize hardware accelerators like GPUs, TPUs, or specialized inference chips (e.g., NVIDIA TensorRT)
- Batches processing rather than one sample at a time. This reduces overhead and improves throughput.
- Asynchronous inference to handle multiple inference requests simultaneously. This can help reduce waiting times and improve throughput.
Conclusion
To add to the complexity of building ML models, a data scientist / engineer will have to consider many production aspects like scalability and effectiveness, even before starting the project to make sure resources are not wasted.