Standalone AI models for on-prem use

There are more than one reason why you might want to run your AI standalone. A simple reason is that you aren’t comfortable sharing your company data with a third party. But which model should you aim for, and which use cases should you consider them for?

Written by

Mattias Skarin

Published on

November 7, 2024
AI

What to consider when selecting a standalone model?

  • Where you want to run it, or in short, the compute requirement. For example, if you plan to run the model on an IoT device, a smartphone or laptop – aim for models in the range of 1-11B parameters.
  • What use case it is good for. Not all models are equal, what they are good at depends largely on what data set it was trained on.
  • The license model. Regardless of functionality, the license model still needs to be scrutinized. It may restrict you from specific use cases or fine-tuning it on own data.

    For example, OpenAI does not allow the use of Whisper in high stakes decisions.

    “Our usage policies prohibit use in certain high-stakes decision-making contexts, and our model card for open-source use includes recommendations against use in high-risk domains”

    (which basic rules out applicability in a wide range of contexts without human supervision)

Models to aim for – the short summary

  • For balance between complexity and speed: Llama 3.2 8b is the best choice.
  • For handling large documents and complex reasoning: Aim for Mistral Nemo 12b, but keep in mind it will require more resources and be slower.
  • For coding tasks in Java, Python & Rust consider Zamba2. For coding tasks in C# consider DeepSeek Coder.

General notes

2b – 8b models are smaller, faster, and can handle simpler tasks. They’re more lightweight and good for smaller, real-time tasks.

10b – 70b models are larger, slower, and better for complex, nuanced tasks. They have better performance on language understanding and generation, but are resource-intensive.

90 – 405B models – for a server park with high end GPU’s.

LLaVA (Large-scale Language and Vision Assistant) is a multimodal model that aims to achieve general vision and language understanding by combining visual encoders and large-scale language models.

Quantization reduces the memory usage of a model. It does so by reducing the number of bits used for precision during training. However, the trade-off is accuracy. As a rule of thumb aim for Q4 and higher.

Standalone AI Models

ModelParametersQuantizationBest for
Llama3.21B, 3B, 11B, 90B, 405BQ8, FP16Advanced tools, large-scale tasks

Models 1-3B are text only,

Models 11B and above can reason with high resolution images.
Gemma22B, 9B, 27BQ8Efficient text generation and language tasks
Mistral-Nemo12B, 70BQ4Long-context tasks, multi-lingual support. (Available under Aparche)
LLaVA Phi 33,8BVisual recognition, chart intepretation.

Trained on additional document, chart and diagram data sets.

Phi 3 is fine tuned LLaVA model f with strong performance benchmarks on par with the original LLaVA model.
Nvidia NVLM 1.072BVisual tasks, summarize manual notes, intepret images.
Release notes.
Qwen2 01.5B, 7B, 72BQ4, Q8For: Text processing, general AI tasks
Deepseek-Coder16B, 236BQ8, FP16For: Code generation, fill-in-the-middle tasks
CodeGemma2B, 7BQ8Code generation, instruction-following tasks
Zamba 2 Instruction Tuned1.2B, 2.7BCoding Rust, Python, Java and chat conversations.

Low memory footprint model that beats Gemma2, Mistral 7B.

Apache 2.0 licenced.

How can I take a standalone model for a spin?

If you would like to try out a standalone model on your own computer, but without coding in Python, then try out one of these three tools.

Blog