LLM Proxy¶

The LLM proxy provides a unified inference API across multiple backends --- MLX (Apple Silicon), GGUF (llama.cpp), vLLM (NVIDIA), Transformers (HuggingFace), and REST (remote APIs). It runs a 2-model system (live for real-time voice commands, background for async tasks), manages LoRA adapter training and loading, and processes async jobs via a Redis queue.

Quick Reference¶


Ports	7704 (API server), 7705 (model service)
Health endpoint	`GET /health`
Source	`jarvis-llm-proxy-api/`
Framework	FastAPI + Uvicorn
Tier	2 --- Command Processing

Architecture¶

The service runs as three processes, started by run.sh:

graph LR
    CC["Command Center<br/>:7703"] -->|"/v1/chat/completions"| API["API Server<br/>:7704"]
    API -->|"/internal/model/chat"| MS["Model Service<br/>:7705"]
    API -->|"enqueue job"| Redis["Redis Queue"]
    Redis --> QW["Queue Worker"]
    QW -->|"/internal/model/chat"| MS
    MS --> Backend["LLM Backend<br/>(MLX / GGUF / vLLM / Transformers / REST)"]

    subgraph "Process 1"
        API
    end
    subgraph "Process 2"
        MS
        Backend
    end
    subgraph "Process 3"
        QW
    end

Process	Port	Purpose
API Server	7704	Public-facing FastAPI app. Proxies chat requests to the model service, serves settings/training/pipeline endpoints. Does not load models.
Model Service	7705	Internal FastAPI app. Owns `ModelManager`, loads backends, runs inference. Protected by `X-Internal-Token`.
Queue Worker	---	RQ (Redis Queue) worker. Processes async jobs: background chat, adapter training, vision inference. Always uses the `background` model.

macOS exception

On macOS, the model service is disabled (RUN_MODEL_SERVICE=false). The API server loads models in-process to access Metal/MLX directly. The jarvis CLI handles this automatically.

2-Model System¶

The service maintains two model slots to balance latency and capability:

Slot	Purpose	Used By	Example
live	Real-time voice commands. Optimized for low latency.	Chat endpoint (default)	Qwen3-14B-Q6_K.gguf
background	Heavier async tasks. Can be a larger model.	Queue worker (always)	Qwen3-32B-Q4_K_M.gguf

Memory optimization: If both slots resolve to the same model path and backend type, ModelManager creates only one backend instance and shares it. This saves ~50% memory on constrained hardware.

Configuration Cascade¶

Each model slot checks settings in order: DB setting -> environment variable -> legacy fallback -> default.

Live model:
  Backend: model.live.backend → JARVIS_LIVE_MODEL_BACKEND → JARVIS_MODEL_BACKEND → "GGUF"
  Path:    model.live.name    → JARVIS_LIVE_MODEL_NAME    → JARVIS_MODEL_NAME

Background model:
  Backend: model.background.backend → JARVIS_BACKGROUND_MODEL_BACKEND → (falls back to live)
  Path:    model.background.name    → JARVIS_BACKGROUND_MODEL_NAME    → (falls back to live)

Hot Swap¶

Models can be swapped at runtime without restarting the service:

POST /internal/model/unload --- unloads all models (used before adapter training to free GPU VRAM)
POST /internal/model/reload --- reloads models from current settings

Inference Backends¶

All backends extend LLMBackendBase and implement generate_text_chat() and unload(). Optional methods include generate_vision_chat(), generate_text_chat_stream(), load_adapter(), and remove_adapter().

GGUF (llama.cpp)¶

The primary local backend. Works on all platforms.


Library	`llama-cpp-python`
GPU	Metal (macOS), CUDA (Linux)
Adapter support	Constructor-based reload (destroys + recreates model with `lora_path`)
Multi-GPU	Yes, via `JARVIS_GGUF_TENSOR_SPLIT` (e.g., `"0.5,0.5"` for 2 GPUs)

Features:

Thread-safe inference via threading.Lock
Context caching (hash-based prefix matching)
Flash attention support (JARVIS_FLASH_ATTN=true)
Mirostat sampling
Warmup inference on load

Key env vars: JARVIS_N_GPU_LAYERS (-1 for all), JARVIS_N_THREADS, JARVIS_N_BATCH (512), JARVIS_FLASH_ATTN (true), JARVIS_GGUF_SPLIT_MODE, JARVIS_GGUF_TENSOR_SPLIT.

MLX (Apple Silicon)¶

Native Apple Silicon backend using the MLX framework.


Library	`mlx-lm`
GPU	Metal (unified memory)
Adapter support	In-place weight swap --- no model reload needed
Platform	macOS only

Features:

Sophisticated KV cache prefix matching --- finds common prefix between cached and new tokens, trims cache, only processes the suffix. This dramatically speeds up repeated system prompts.
Dynamic LoRA adapter swap via mlx_lm.tuner.utils.load_adapters() / remove_lora_layers() (modifies weights in place)
Vision support via PIL image handling

vLLM (High-Throughput GPU)¶

High-throughput backend for Linux with NVIDIA GPUs.


Library	`vllm`
GPU	CUDA (NVIDIA)
Adapter support	Per-request LoRA via `LoRARequest` --- no reload, concurrent adapters
Multi-GPU	Yes, via `tensor_parallel_size`

Features:

Native prefix caching
Per-request LoRA selection (multiple adapters served concurrently)
JSON structured output via StructuredOutputsParams
Manual chat template formatting (ChatML, Llama3, Mistral)
Supports GGUF files with explicit tokenizer override

Key env vars: JARVIS_VLLM_TENSOR_PARALLEL_SIZE, JARVIS_VLLM_GPU_MEMORY_UTILIZATION (0.9), JARVIS_VLLM_MAX_LORAS (1), JARVIS_VLLM_MAX_LORA_RANK (64).

Transformers (HuggingFace)¶

General-purpose backend using the HuggingFace ecosystem.


Library	`transformers`, `torch`
GPU	CUDA, MPS, CPU (auto-detected)
Adapter support	PEFT LoRA via `PeftModel.from_pretrained()`
Quantization	BitsAndBytes 4-bit / 8-bit

Key env vars: JARVIS_DEVICE, JARVIS_TORCH_DTYPE, JARVIS_USE_QUANTIZATION, JARVIS_QUANTIZATION_TYPE.

REST (Remote API Proxy)¶

Proxies inference to remote APIs --- useful for cloud LLMs, hosted inference servers, or distributed setups. This is how you connect Jarvis to OpenAI, Anthropic, Ollama on another machine, or your own hosted GPU server.


Library	`httpx` (async)
Providers	OpenAI, Anthropic, Ollama, LM Studio, generic
Auth	Bearer token, API key, or custom headers
Vision	Yes (converts images to data URLs)

Quick Setup¶

Set the backend to REST and point it at your provider:

OpenAIAnthropicOllama (remote)Self-hosted GPU server

# In .env or via settings DB
JARVIS_LIVE_MODEL_BACKEND=REST
JARVIS_LIVE_REST_MODEL_URL=https://api.openai.com
JARVIS_REST_PROVIDER=openai
JARVIS_REST_MODEL_NAME=gpt-4o
JARVIS_REST_AUTH_TYPE=bearer
JARVIS_REST_AUTH_TOKEN=sk-your-api-key

JARVIS_LIVE_MODEL_BACKEND=REST
JARVIS_LIVE_REST_MODEL_URL=https://api.anthropic.com
JARVIS_REST_PROVIDER=anthropic
JARVIS_REST_MODEL_NAME=claude-sonnet-4-20250514
JARVIS_REST_AUTH_TYPE=api_key
JARVIS_REST_AUTH_TOKEN=sk-ant-your-api-key

JARVIS_LIVE_MODEL_BACKEND=REST
JARVIS_LIVE_REST_MODEL_URL=http://192.168.1.50:11434
JARVIS_REST_PROVIDER=ollama
JARVIS_REST_MODEL_NAME=qwen2.5:14b
JARVIS_REST_AUTH_TYPE=none

JARVIS_LIVE_MODEL_BACKEND=REST
JARVIS_LIVE_REST_MODEL_URL=https://llm.yourdomain.com
JARVIS_REST_PROVIDER=openai
JARVIS_REST_MODEL_NAME=your-model
JARVIS_REST_AUTH_TYPE=bearer
JARVIS_REST_AUTH_TOKEN=your-server-token
JARVIS_REST_REQUEST_FORMAT=openai

Hybrid Setup (Local Live + Cloud Background)¶

You can mix local and remote backends --- e.g., fast local model for real-time voice, larger cloud model for deep research:

# Live: local GGUF for low-latency voice commands
JARVIS_LIVE_MODEL_BACKEND=GGUF
JARVIS_LIVE_MODEL_NAME=.models/Qwen3-14B-Q6_K.gguf

# Background: cloud model for async tasks (deep research, summarization)
JARVIS_BACKGROUND_MODEL_BACKEND=REST
JARVIS_BACKGROUND_REST_MODEL_URL=https://api.openai.com
JARVIS_REST_BACKGROUND_MODEL_NAME=gpt-4o
JARVIS_REST_AUTH_TYPE=bearer
JARVIS_REST_AUTH_TOKEN=sk-your-api-key

Full REST Configuration¶

Variable	Default	Description
`JARVIS_LIVE_REST_MODEL_URL`	---	Base URL for live model API
`JARVIS_BACKGROUND_REST_MODEL_URL`	---	Base URL for background model API (falls back to live)
`JARVIS_REST_PROVIDER`	`generic`	Provider type: `openai`, `anthropic`, `ollama`, `lmstudio`, `generic`
`JARVIS_REST_MODEL_NAME`	---	Model name for live requests
`JARVIS_REST_BACKGROUND_MODEL_NAME`	---	Model name for background requests
`JARVIS_REST_AUTH_TYPE`	`none`	Auth type: `bearer`, `api_key`, `custom`, `none`
`JARVIS_REST_AUTH_TOKEN`	---	Auth token or API key
`JARVIS_REST_AUTH_HEADER`	`Authorization`	Custom auth header name (for `custom` auth type)
`JARVIS_REST_REQUEST_FORMAT`	`openai`	Request format: `openai`, `ollama`, `chatml`, `generic`
`JARVIS_REST_TIMEOUT`	`60`	Request timeout in seconds

The provider setting controls response parsing (different APIs return results in different shapes). The request format controls how messages are serialized. For most OpenAI-compatible APIs (vLLM, LM Studio, text-generation-webui), use provider=openai + request_format=openai.

Mock (Testing)¶

Returns [mock-text:model] <input>. Used in tests.

Backend Comparison¶

Backend	Platform	Adapter Loading	Multi-GPU	Streaming	Vision
GGUF	All	Model reload	Tensor split	Yes	Separate backend
MLX	macOS	In-place swap	N/A (unified)	Yes	Yes
vLLM	Linux+CUDA	Per-request	Tensor parallel	Yes	Separate backend
Transformers	All	PEFT merge	device_map	No	Separate backend
REST	All (network)	N/A	N/A	No	Yes

Redis Queue¶

Async jobs are processed via RQ (Redis Queue). The queue worker runs as a separate process and always uses the background model.

Job Types¶

Type	Description	Callback
`chat`	Background LLM inference	Posts result to callback URL
`adapter_train`	LoRA adapter training	Posts job status to callback URL
`vision`	Vision inference (image + text)	Posts result to callback URL

Submitting Jobs¶

POST /internal/queue/enqueue

{
  "job_id": "unique-id",
  "job_type": "chat",
  "request": { "messages": [...], "model": "background" },
  "callback_url": "http://jarvis-command-center:7703/api/v0/callback",
  "ttl_seconds": 300,
  "idempotency_key": "optional-dedup-key"
}

Deduplication¶

Jobs with the same job_id + idempotency_key are deduplicated via Redis SET NX with a TTL. This prevents duplicate training jobs or repeated inference requests.

Configuration¶

Variable	Default	Description
`REDIS_URL`	---	Full Redis connection URL
`REDIS_HOST`	localhost	Redis host (if `REDIS_URL` not set)
`REDIS_PORT`	6379	Redis port
`REDIS_DB`	0	Redis database number
`REDIS_PASSWORD`	---	Redis password
`LLM_PROXY_QUEUE_NAME`	`llm_proxy_jobs`	Queue name
`RUN_QUEUE_WORKER`	true	Whether to start the worker process

LoRA Adapter Training¶

The service manages the full adapter lifecycle: training, storage, caching, and inference-time loading.

Training Flow¶

sequenceDiagram
    participant Client
    participant API as API Server :7704
    participant DB as PostgreSQL
    participant Redis
    participant Worker as Queue Worker
    participant MS as Model Service :7705
    participant GPU

    Client->>API: POST /internal/queue/enqueue (adapter_train)
    API->>DB: Create TrainingJob (QUEUED)
    API->>Redis: Enqueue job
    Redis->>Worker: Dequeue
    Worker->>MS: POST /internal/model/unload (free GPU VRAM)
    Worker->>GPU: Run training subprocess
    Note over GPU: train_adapter_mlx.py (macOS)<br/>train_adapter.py (Linux)
    GPU-->>Worker: Training complete
    Worker->>Worker: Zip adapter artifacts
    Worker->>DB: Update job (COMPLETE)
    Worker->>MS: POST /internal/model/reload
    Worker->>Client: POST callback_url (result)

Training Parameters¶

Variable	Default	Description
`JARVIS_ADAPTER_LORA_R`	16	LoRA rank
`JARVIS_ADAPTER_LORA_ALPHA`	32	LoRA alpha (scaling)
`JARVIS_ADAPTER_LORA_DROPOUT`	---	Dropout rate
`JARVIS_ADAPTER_LEARNING_RATE`	---	Learning rate
`JARVIS_ADAPTER_EPOCHS`	---	Number of training epochs
`JARVIS_ADAPTER_BATCH_SIZE`	---	Training batch size
`JARVIS_ADAPTER_MAX_SEQ_LEN`	---	Max sequence length
`JARVIS_ADAPTER_TRAIN_DTYPE`	---	Training dtype
`JARVIS_ADAPTER_TRAIN_LOAD_IN_4BIT`	---	4-bit quantized training

Adapter Storage¶

Adapters are stored locally and optionally synced to S3/MinIO:

Local cache: LLM_PROXY_ADAPTER_DIR (default: /tmp/jarvis-adapters)
S3 storage: s3://{bucket}/{prefix}/{dataset_hash}/adapter.zip
LRU cache: In-memory adapter path cache (max 10 entries, configurable) with optional disk eviction

Resolution order: local cache -> local zip extraction -> S3 download.

Adapter Loading at Inference¶

Adapters are loaded per-request via the adapter_settings field in chat requests:

{
  "messages": [...],
  "adapter_settings": {
    "hash": "abc123",
    "scale": 1.0,
    "enabled": true
  }
}

How each backend handles this:

Backend	Mechanism	Reload Required
GGUF	Destroys model, recreates with `lora_path`	Yes (full reload)
MLX	`load_adapters()` / `remove_lora_layers()`	No (in-place swap)
vLLM	`LoRARequest` per request	No (concurrent)
Transformers	`PeftModel.from_pretrained()`	Partial (merge/unmerge)

Pipeline System¶

The pipeline system orchestrates multi-step model builds: generate training data -> train adapter -> validate -> merge -> convert to GGUF/MLX.

Method	Path	Description
`POST`	`/v1/pipeline/build`	Start pipeline build
`GET`	`/v1/pipeline/status`	Current pipeline status
`POST`	`/v1/pipeline/cancel`	Cancel running pipeline
`GET`	`/v1/pipeline/logs`	SSE stream of build logs
`GET`	`/v1/pipeline/artifacts`	List models/adapters/GGUF/MLX on disk

Pipeline endpoints require superuser JWT authentication.

Embeddings¶

A separate EmbeddingManager provides text embeddings via sentence-transformers:

POST /v1/embeddings

Variable	Default	Description
`JARVIS_EMBEDDING_MODEL`	`all-MiniLM-L6-v2`	Embedding model name

Runs on CPU independently of the LLM backends (384 dimensions by default).

Full API Reference¶

Public API (Port 7704)¶

Method	Path	Auth	Description
`POST`	`/v1/chat/completions`	App auth	OpenAI-compatible chat completions
`GET`	`/v1/models`	None	List loaded models
`GET`	`/v1/engine`	None	Inference engine info
`POST`	`/v1/embeddings`	App auth	Text embeddings
`GET`	`/v1/training/status/{job_id}`	None	Training job status
`GET`	`/v1/adapters/date-keys`	None	Date key vocabulary
`POST`	`/v1/pipeline/build`	Superuser JWT	Start pipeline build
`GET`	`/v1/pipeline/status`	Superuser JWT	Pipeline status
`POST`	`/v1/pipeline/cancel`	Superuser JWT	Cancel pipeline
`GET`	`/v1/pipeline/logs`	Superuser JWT	SSE pipeline logs
`GET`	`/v1/pipeline/artifacts`	Superuser JWT	List on-disk artifacts
`GET`	`/settings/`	Combined auth	List all settings
`PUT`	`/settings/{key}`	Combined auth	Update setting
`POST`	`/internal/queue/enqueue`	App auth	Submit async job
`GET`	`/health`	None	Health check

Internal API (Port 7705)¶

Method	Path	Auth	Description
`POST`	`/internal/model/chat`	Internal token	Run chat inference
`POST`	`/internal/model/chat/stream`	Internal token	Streaming chat (SSE)
`GET`	`/internal/model/models`	Internal token	List loaded models
`POST`	`/internal/model/unload`	Internal token	Unload all models
`POST`	`/internal/model/reload`	Internal token	Reload models
`GET`	`/health`	None	Model service health

Dependencies¶

PostgreSQL --- training jobs table, settings table
Redis --- async job queue
MinIO/S3 (optional) --- adapter artifact storage

Dependents¶

jarvis-command-center --- primary consumer for intent classification and response generation
jarvis-tts --- LLM-generated wake word responses

Impact if Down¶

No LLM-based command parsing or response generation. Voice commands requiring intent classification will fail. Commands with pre_route() fast-path matching may still work.