Standalone AI models for on-prem use

There are more than one reason why you might want to run your AI standalone. A simple reason is that you aren’t comfortable sharing your company data with a third party. But which model should you aim for, and which use cases should you consider them for?

Written by

Mattias Skarin

Published on

November 7, 2024

What to consider when selecting a standalone model?

Where you want to run it, or in short, the compute requirement. For example, if you plan to run the model on an IoT device, a smartphone or laptop – aim for models in the range of 1-11B parameters.
What use case it is good for. Not all models are equal, what they are good at depends largely on what data set it was trained on.

The license model. Regardless of functionality, the license model still needs to be scrutinized. It may restrict you from specific use cases or fine-tuning it on own data.

For example, OpenAI does not allow the use of Whisper in high stakes decisions.

“Our usage policies prohibit use in certain high-stakes decision-making contexts, and our model card for open-source use includes recommendations against use in high-risk domains”

(which basic rules out applicability in a wide range of contexts without human supervision)

Models to aim for – the short summary

For balance between complexity and speed: Llama 3.2 8b is the best choice.
For handling large documents and complex reasoning: Aim for Mistral Nemo 12b, but keep in mind it will require more resources and be slower.
For coding tasks in Java, Python & Rust consider Zamba2. For coding tasks in C# consider DeepSeek Coder.

General notes

2b – 8b models are smaller, faster, and can handle simpler tasks. They’re more lightweight and good for smaller, real-time tasks.

10b – 70b models are larger, slower, and better for complex, nuanced tasks. They have better performance on language understanding and generation, but are resource-intensive.

90 – 405B models – for a server park with high end GPU’s.

LLaVA (Large-scale Language and Vision Assistant) is a multimodal model that aims to achieve general vision and language understanding by combining visual encoders and large-scale language models.

Quantization reduces the memory usage of a model. It does so by reducing the number of bits used for precision during training. However, the trade-off is accuracy. As a rule of thumb aim for Q4 and higher.