Skip to content

Whisper API

The Whisper API provides speech-to-text transcription via whisper.cpp. It supports speaker identification through voice profile enrollment, returning both the transcribed text and a speaker confidence score.

Quick Reference

Port 7706
Health endpoint GET /health
Source jarvis-whisper-api/
Framework FastAPI + Uvicorn
Backend whisper.cpp
Tier 3 - Specialized

API Endpoints

Method Path Description
GET /health Health check
POST /v1/audio/transcriptions Transcribe audio to text
POST /api/v0/voice-profiles/enroll Enroll a voice profile for speaker ID
GET /api/v0/voice-profiles List enrolled voice profiles
DELETE /api/v0/voice-profiles/{id} Delete a voice profile

Speaker Identification

Whisper returns transcription results with speaker metadata:

{
  "text": "what's the weather like",
  "speaker": {
    "user_id": 1,
    "confidence": 0.87
  }
}

The command center uses this to resolve display names and load user-specific memories.

Environment Variables

Variable Description
WHISPER_MODEL_PATH Path to the whisper.cpp model file
JARVIS_AUTH_BASE_URL Auth service URL for node validation

Dependencies

  • whisper.cpp -- speech-to-text engine
  • jarvis-auth -- validates node credentials
  • jarvis-logs -- structured logging

Dependents

  • jarvis-command-center -- sends audio for transcription

Impact if Down

Speech-to-text transcription is unavailable. If the command center uses server-side transcription (rather than on-device), voice input cannot be processed. Text-based commands still work.