Describe the bug
/voice records audio successfully (level meter reacts, mic capture confirmed via raw PulseAudio capture) but every transcription comes back empty, for all three models offered in the /voice model picker:
nemotron-3.5-asr-streaming-0.6b
nemotron-speech-streaming-en-0.6b
nemotron-speech-streaming-es-0.6b
All three share the same nemotron_speech (RNNT) architecture, so switching models does not help — there is no working model in the current picker.
Root cause (traced to source)
The Foundry Local Core native audio transcription path throws on every call:
Microsoft.ML.OnnxRuntimeGenAI.OnnxRuntimeGenAIException: MultiModalProcessor cannot be created. nemotron_speech is not a registered multi-modal model type.
at Microsoft.ML.OnnxRuntimeGenAI.Result.VerifySuccess(IntPtr)
at Microsoft.ML.OnnxRuntimeGenAI.MultiModalProcessor..ctor(Model)
at Microsoft.AI.Foundry.Local.AudioClient.Transcribe(String, String, Nullable`1)
This is not a stale/outdated onnxruntime-genai engine issue. I downloaded and compared the public onnxruntime-genai v0.14.0 release (published 2026-05-29, after the "Multilingual Streaming Nemotron ASR" PR merged 2026-05-22) against the bundled runtime (labeled 0.14.1 in deps_versions.json, but not a publicly-tagged release). Both contain full, identical NemotronSpeechModel/NemotronSpeechState support (confirmed via symbol inspection of libonnxruntime-genai.so).
Cross-checking against microsoft/onnxruntime-genai's src/models/model.cpp (https://github.com/microsoft/onnxruntime-genai/blob/main/src/models/model.cpp) confirms this is architecturally expected:
nemotron_speech is an RNNT-type model (ModelType::IsRNNT) and is correctly routed to NemotronSpeechModel when the model is loaded (CreateModel() in model.cpp) — this is why Model loaded successfully: nemotron-3.5-asr-streaming-0.6b-generic-cpu:3 appears in the Foundry log and the model shows as loaded.
- However,
MultiModalProcessor's constructor uses a separate, hardcoded processor_factory_ registry that only contains vision+text chat model types (e.g. phi4mm, gemma4, etc.) — RNNT/TDT/ALM (Whisper) audio model types were never meant to go through MultiModalProcessor at all.
So the bug is that Microsoft.AI.Foundry.Local.Core's AudioClient.Transcribe/streaming session path unconditionally constructs a MultiModalProcessor for audio transcription, instead of dispatching RNNT/TDT/ALM model types (nemotron_speech, Parakeet TDT, Whisper) through their own dedicated processing path (the same one used successfully at model-load time). Since Microsoft.AI.Foundry.Local.Core.so is a closed-source component, this can't be worked around or patched client-side — no onnxruntime-genai version swap fixes it, since the engine itself already works correctly and simply refuses (correctly, by its own design) to construct a MultiModalProcessor for a non-multimodal-chat model type.
Affected version
GitHub Copilot CLI 1.0.69-0 (also reproduced on 1.0.66-1 — same bundled runtime pins: onnxruntime-genai 0.14.1, foundry-local-core 1.2.3, so the bug is not CLI-version-specific).
Steps to reproduce the behavior
- On Linux x64 (reproduced under WSL2/WSLg, but likely affects any Linux x64 install), run
/voice, enable it, select any of the three offered models.
- Press the voice-record shortcut, speak clearly for several seconds (confirmed via raw
parec capture that real speech is captured at healthy signal levels, and the in-app level meter visibly reacts).
- Stop recording.
Expected behavior
Spoken audio is transcribed to text and inserted into the input box.
Actual behavior
No error is shown to the user; the recording UI behaves normally, but the transcript is always empty. ~/.github-copilot-cli/logs/foundry.core*.log shows:
[ERR] Error executing audio_transcribe: Microsoft.ML.OnnxRuntimeGenAI.OnnxRuntimeGenAIException: MultiModalProcessor cannot be created. nemotron_speech is not a registered multi-modal model type.
...
Audio stream stop: session <id>, final flush text: '', full transcript: ''
Additional context
- Confirmed mic capture and PulseAudio pipeline are fully healthy (not a WSLg/audio-hardware issue) via direct raw
parec -d RDPSource capture showing strong, correctly-timed speech signal.
- Confirmed via a standalone script using the bundled
foundry-local-sdk package directly (bypassing the Copilot CLI JS entirely) that audioClient.transcribe() throws the identical MultiModalProcessor/nemotron_speech error against a real captured WAV file, ruling out any Copilot-CLI-specific bug in the JS layer.
- Suggested fix: in Foundry Local Core's audio transcription path, dispatch by
ModelType::IsRNNT/IsTDT/IsALM (mirroring CreateModel()'s dispatch in onnxruntime-genai's model.cpp) instead of unconditionally constructing a MultiModalProcessor.
Describe the bug
/voicerecords audio successfully (level meter reacts, mic capture confirmed via raw PulseAudio capture) but every transcription comes back empty, for all three models offered in the/voicemodel picker:nemotron-3.5-asr-streaming-0.6bnemotron-speech-streaming-en-0.6bnemotron-speech-streaming-es-0.6bAll three share the same
nemotron_speech(RNNT) architecture, so switching models does not help — there is no working model in the current picker.Root cause (traced to source)
The Foundry Local Core native audio transcription path throws on every call:
This is not a stale/outdated
onnxruntime-genaiengine issue. I downloaded and compared the publiconnxruntime-genaiv0.14.0 release (published 2026-05-29, after the "Multilingual Streaming Nemotron ASR" PR merged 2026-05-22) against the bundled runtime (labeled0.14.1indeps_versions.json, but not a publicly-tagged release). Both contain full, identicalNemotronSpeechModel/NemotronSpeechStatesupport (confirmed via symbol inspection oflibonnxruntime-genai.so).Cross-checking against
microsoft/onnxruntime-genai'ssrc/models/model.cpp(https://github.com/microsoft/onnxruntime-genai/blob/main/src/models/model.cpp) confirms this is architecturally expected:nemotron_speechis an RNNT-type model (ModelType::IsRNNT) and is correctly routed toNemotronSpeechModelwhen the model is loaded (CreateModel()inmodel.cpp) — this is whyModel loaded successfully: nemotron-3.5-asr-streaming-0.6b-generic-cpu:3appears in the Foundry log and the model shows as loaded.MultiModalProcessor's constructor uses a separate, hardcodedprocessor_factory_registry that only contains vision+text chat model types (e.g.phi4mm,gemma4, etc.) — RNNT/TDT/ALM (Whisper) audio model types were never meant to go throughMultiModalProcessorat all.So the bug is that
Microsoft.AI.Foundry.Local.Core'sAudioClient.Transcribe/streaming session path unconditionally constructs aMultiModalProcessorfor audio transcription, instead of dispatching RNNT/TDT/ALM model types (nemotron_speech, Parakeet TDT, Whisper) through their own dedicated processing path (the same one used successfully at model-load time). SinceMicrosoft.AI.Foundry.Local.Core.sois a closed-source component, this can't be worked around or patched client-side — noonnxruntime-genaiversion swap fixes it, since the engine itself already works correctly and simply refuses (correctly, by its own design) to construct aMultiModalProcessorfor a non-multimodal-chat model type.Affected version
GitHub Copilot CLI 1.0.69-0 (also reproduced on 1.0.66-1 — same bundled runtime pins:
onnxruntime-genai0.14.1,foundry-local-core1.2.3, so the bug is not CLI-version-specific).Steps to reproduce the behavior
/voice, enable it, select any of the three offered models.pareccapture that real speech is captured at healthy signal levels, and the in-app level meter visibly reacts).Expected behavior
Spoken audio is transcribed to text and inserted into the input box.
Actual behavior
No error is shown to the user; the recording UI behaves normally, but the transcript is always empty.
~/.github-copilot-cli/logs/foundry.core*.logshows:Additional context
parec -d RDPSourcecapture showing strong, correctly-timed speech signal.foundry-local-sdkpackage directly (bypassing the Copilot CLI JS entirely) thataudioClient.transcribe()throws the identicalMultiModalProcessor/nemotron_speecherror against a real captured WAV file, ruling out any Copilot-CLI-specific bug in the JS layer.ModelType::IsRNNT/IsTDT/IsALM(mirroringCreateModel()'s dispatch inonnxruntime-genai'smodel.cpp) instead of unconditionally constructing aMultiModalProcessor.