Standalone AI models for on-prem use

There are more than one reason why you might want to run your AI standalone. A simple reason is that you aren’t comfortable sharing your company data with a third party. But which model should you aim for, and which use cases should you consider them for?

Written by

Mattias Skarin

Published on

2024-11-07
BlogAI

What to consider when selecting a standalone model?

  • Where you want to run it, or in short, the compute requirement. For example, if you plan to run the model on an IoT device, a smartphone or laptop – aim for models in the range of 1-11B parameters.
  • What use case it is good for. Not all models are equal, what they are good at depends largely on what data set it was trained on.
  • The license model. Regardless of functionality, the license model still needs to be scrutinized. It may restrict you from specific use cases or fine-tuning it on own data.

    For example, OpenAI does not allow the use of Whisper in high stakes decisions.

    “Our usage policies prohibit use in certain high-stakes decision-making contexts, and our model card for open-source use includes recommendations against use in high-risk domains”

    (which basic rules out applicability in a wide range of contexts without human supervision)

Models to aim for – the short summary

  • For balance between complexity and speed: Llama 3.2 8b is the best choice.
  • For handling large documents and complex reasoning: Aim for Mistral Nemo 12b, but keep in mind it will require more resources and be slower.
  • For coding tasks in Java, Python & Rust consider Zamba2. For coding tasks in C# consider DeepSeek Coder.

General notes

2b – 8b models are smaller, faster, and can handle simpler tasks. They’re more lightweight and good for smaller, real-time tasks.

10b – 70b models are larger, slower, and better for complex, nuanced tasks. They have better performance on language understanding and generation, but are resource-intensive.

90 – 405B models – for a server park with high end GPU’s.

LLaVA (Large-scale Language and Vision Assistant) is a multimodal model that aims to achieve general vision and language understanding by combining visual encoders and large-scale language models.

Quantization reduces the memory usage of a model. It does so by reducing the number of bits used for precision during training. However, the trade-off is accuracy. As a rule of thumb aim for Q4 and higher.

Standalone AI Models

ModelParametersQuantizationBest for
Llama3.21B, 3B, 11B, 90B, 405BQ8, FP16Advanced tools, large-scale tasks

Models 1-3B are text only,

Models 11B and above can reason with high resolution images.
Gemma22B, 9B, 27BQ8Efficient text generation and language tasks
Mistral-Nemo12B, 70BQ4Long-context tasks, multi-lingual support. (Available under Aparche)
LLaVA Phi 33,8BVisual recognition, chart intepretation.

Trained on additional document, chart and diagram data sets.

Phi 3 is fine tuned LLaVA model f with strong performance benchmarks on par with the original LLaVA model.
Nvidia NVLM 1.072BVisual tasks, summarize manual notes, intepret images.
Release notes.
Qwen2 01.5B, 7B, 72BQ4, Q8For: Text processing, general AI tasks
Deepseek-Coder16B, 236BQ8, FP16For: Code generation, fill-in-the-middle tasks
CodeGemma2B, 7BQ8Code generation, instruction-following tasks
Zamba 2 Instruction Tuned1.2B, 2.7BCoding Rust, Python, Java and chat conversations.

Low memory footprint model that beats Gemma2, Mistral 7B.

Apache 2.0 licenced.

How can I take a standalone model for a spin?

If you would like to try out a standalone model on your own computer, but without coding in Python, then try out one of these three tools.