Benchmarks¶
LLM benchmarks for Jarvis are tracked in the docs/benchmarks/ directory of the main repository. These benchmarks measure command parsing accuracy, latency, and memory usage across different models and backends.
What Is Measured¶
- Command parsing accuracy -- Does the LLM correctly identify the intended command and extract the right parameters from a voice transcription?
- Latency -- Time from receiving the transcription to returning a tool call (inference time).
- Memory usage -- VRAM and RAM consumption for different model sizes and quantization levels.
Tested Configurations¶
Benchmarks cover multiple dimensions:
| Dimension | Variants |
|---|---|
| Models | Qwen 2.5 (7B, 14B), Qwen 3 (14B, 32B), Llama 3.1 8B, Hermes 3 8B, others |
| Backends | MLX (macOS Metal), GGUF (llama.cpp), vLLM (CUDA) |
| Quantization | Q4_K_M, Q6_K, FP16 |
| Adapters | Base model vs LoRA fine-tuned |
Where to Find Them¶
Benchmark results and comparison tables are maintained in:
Each benchmark run records the model, backend, quantization level, hardware, test suite version, and per-command accuracy breakdown.
Running Benchmarks¶
Use the E2E command parsing test suite to generate benchmark data:
cd jarvis-node-setup
# Run all command parsing tests
python test_command_parsing.py -o benchmark_results.json
# Run for specific commands
python test_command_parsing.py -c calculate get_weather send_email -o benchmark_results.json
Results include per-command success rates, average response times, and a confusion matrix showing which commands get misclassified.