Speech-to-Text Providers¶
An STT provider implements IJarvisSpeechToTextProvider to convert audio files into text. The node calls the provider after recording audio from the microphone.
Interface Reference¶
from abc import ABC, abstractmethod
from typing import Optional
from dataclasses import dataclass
@dataclass
class TranscriptionResult:
text: str # The transcribed text
speaker_user_id: int | None # Identified user ID (if speaker ID is enabled)
speaker_confidence: float # Confidence score for speaker identification (0.0-1.0)
class IJarvisSpeechToTextProvider(ABC):
@property
@abstractmethod
def provider_name(self) -> str:
"""Unique name for this provider. Example: 'jarvis_whisper'."""
...
@abstractmethod
def transcribe(self, audio_path: str) -> Optional[str]:
"""Transcribe an audio file to text. Returns None if transcription fails."""
...
def transcribe_with_speaker(self, audio_path: str) -> TranscriptionResult:
"""Transcribe audio and identify the speaker.
Default implementation wraps transcribe() with no speaker identification.
Override this to add speaker ID support.
"""
text = self.transcribe(audio_path)
return TranscriptionResult(
text=text or "",
speaker_user_id=None,
speaker_confidence=0.0,
)
The transcribe_with_speaker method has a default implementation that delegates to transcribe(). You only need to override it if your STT backend supports speaker identification.
Built-in Implementations¶
JarvisWhisperClient¶
The primary STT provider. Proxies audio through the Command Center's media endpoint, which forwards it to the jarvis-whisper-api service for transcription.
class JarvisWhisperClient(IJarvisSpeechToTextProvider):
provider_name = "jarvis_whisper"
def transcribe(self, audio_path: str) -> Optional[str]:
result = self.transcribe_with_speaker(audio_path)
return result.text if result.text else None
def transcribe_with_speaker(self, audio_path: str) -> TranscriptionResult:
# 1. Read the audio file
with open(audio_path, "rb") as f:
audio_data = f.read()
# 2. Upload as multipart form data to Command Center
response = self.jcc_client.post(
"/api/v0/media/stt/transcribe",
files={"audio": ("audio.wav", audio_data, "audio/wav")},
)
# 3. Parse response (includes speaker identification if available)
data = response.json()
return TranscriptionResult(
text=data.get("text", ""),
speaker_user_id=data.get("speaker", {}).get("user_id"),
speaker_confidence=data.get("speaker", {}).get("confidence", 0.0),
)
Speaker identification: When voice profiles are enrolled on the Whisper API, the response includes a speaker object with user_id and confidence. This enables personalized responses and user-specific memories.
KeyboardProvider¶
A development/testing provider that reads text from standard input instead of processing audio:
class KeyboardProvider(IJarvisSpeechToTextProvider):
provider_name = "keyboard"
def transcribe(self, audio_path: str) -> Optional[str]:
# Ignores the audio file entirely
text = input("You: ")
return text.strip() if text.strip() else None
This is useful for testing commands without a microphone or audio setup. Set "stt_provider": "keyboard" in your node config to use it.
Writing a Custom Provider¶
Here is an example that uses OpenAI's Whisper API (cloud):
from stt_providers.base import IJarvisSpeechToTextProvider, TranscriptionResult
from typing import Optional
import httpx
class OpenAIWhisperProvider(IJarvisSpeechToTextProvider):
@property
def provider_name(self) -> str:
return "openai_whisper"
def __init__(self):
self._api_key = self.secret_service.get_secret("OPENAI_API_KEY")
def transcribe(self, audio_path: str) -> Optional[str]:
if not self._api_key:
return None
with open(audio_path, "rb") as f:
response = httpx.post(
"https://api.openai.com/v1/audio/transcriptions",
headers={"Authorization": f"Bearer {self._api_key}"},
files={"file": ("audio.wav", f, "audio/wav")},
data={"model": "whisper-1"},
)
if response.status_code == 200:
return response.json().get("text")
return None
Save as stt_providers/openai_whisper_provider.py, then set "stt_provider": "openai_whisper" in your node config.
Audio Format¶
STT providers receive a file path to a WAV audio file. The node's audio recording system handles format conversion before calling the provider, so you can assume:
- Format: WAV (PCM)
- Sample rate: 16000 Hz
- Channels: Mono (1 channel)
- Bit depth: 16-bit