Rick Strahl discusses the practical integration of Windows.Media SpeechRecognition within WPF applications using .NET. The post highlights code samples, SDK dependencies, workarounds for WinRT/.NET issues, and language management.

Integrating Windows.Media SpeechRecognition in WPF Applications Using .NET

By Rick Strahl

Windows includes a built-in SpeechRecognition engine available through Windows Media services. In .NET, you can access these features via the Windows SDK (WinSdk), exposing them to .NET applications. This article details the integration of these speech capabilities into a WPF application, highlights the necessary dependencies, discusses challenges around SDK integration, and offers practical solutions.

Introduction

The Windows.Media SpeechRecognition features replace the older System.Speech functionality, providing a much-improved recognition engine with a familiar but more capable API. This guide focuses on:

Integrating Windows Media SpeechRecognition in WPF
Building a simple wrapper class for speech dictation
Handling the required SDK and WinRT dependencies
Dealing with SDK/WinRT integration quirks
Working through relevant pitfalls and workarounds

Required Dependencies

Because the SpeechRecognizer class is part of the Windows SDK and originated from the WinRT/UWP development period, you need two NuGet packages:

<PackageReference Include="Microsoft.Windows.SDK.Contracts" Version="10.0.22621.2" />
<PackageReference Include="Microsoft.Windows.CsWinRt" Version="*" />

SDK.Contracts must match your intended minimum Windows version (10.0.22621 covers all Win10 and Win11 versions commonly).
Ensure the matching Windows SDK is installed on your development machine, typically via Visual Studio Installer > Individual Components or Windows Desktop Development.

Creating a VoiceDictation Class

This reusable class handles:

Starting listening for dictation
Stopping the listener
Shutting down on idle
Handling and fixing up dictation results

Typical usage involves wiring commands to hotkeys or menu options:

// Initialization
if (mmApp.Configuration.EnableVoiceDictation)
    VoiceDictation = new VoiceDictation();

// Command example
public CommandBase StartListeningCommand { get; set; }
void Command_StartListening() {
    StartListeningCommand = new CommandBase((parameter, command) => {
        Model.Window.VoiceDictation?.StartAsync().FireAndForget();
    }, (p, c) => true);
}

public CommandBase StopListeningCommand { get; set; }
void Command_StopListening() {
    StopListeningCommand = new CommandBase((parameter, command) => {
        Model.Window.VoiceDictation?.Stop();
    }, (p, c) => true);
}

VoiceDictation Class: Key Implementation Details

Singleton Engine: You should have only one SpeechRecognizer instance at any time due to resource constraints and async tracking. Initialization isn’t fast, and overlapping engines can cause instability.
Language: The engine uses the current system UI language by default but can be set via an IETF language code (e.g., en-US, de-DE).
Event Hookup: Crucial events include ResultGenerated (for generated text results) and Completed (for idle/timeouts/end of session).

Sample snippet:

public class VoiceDictation {
    private SpeechRecognizer _recognizer;
    private bool _isCompiled;
    private bool _isDisposed;
    public bool IsDictating { get; private set; }

    public VoiceDictation() {
        // Language selection
        if (string.IsNullOrEmpty(mmApp.Configuration.VoiceDictationLanguage))
            _recognizer = new SpeechRecognizer();
        else
            _recognizer = new SpeechRecognizer(new Language(mmApp.Configuration.VoiceDictationLanguage));
        _recognizer.Constraints.Add(dictation);
        _recognizer.ContinuousRecognitionSession.ResultGenerated += ContinuousRecognitionSession_ResultGenerated;
        _recognizer.ContinuousRecognitionSession.AutoStopSilenceTimeout = TimeSpan.FromMinutes(1);
        _recognizer.ContinuousRecognitionSession.Completed += ContinuousRecognitionSession_Completed;
        Keyboard.AddKeyDownHandler(mmApp.Window, KeydownHandler);
    }
    // ...
}

Commands: Example hotkeys include F4 to start and ESC to stop dictation. Language changes require recreating the engine.

Starting and Stopping Dictation

CompileConstraintsAsync: The recognizer must be compiled (async op) before listening.
Session Start/Stop: All major ops (start, stop, compile) on SpeechRecognizer are async, but WinRT uses IAsync patterns incompatible with .NET Tasks. Reflection-based .AsTask() wrappers are used as a workaround.

public async Task StartAsync(DictationListenModes listenMode = DictationListenModes.EscPressed) {
    if (IsDictating) return;
    try {
        await EnsureCompiledAsync();
        var action = _recognizer.ContinuousRecognitionSession.StartAsync();
        await AsTask(action);
        IsDictating = true;
        // Show progress/status
    } catch (Exception ex) when (ex.Message.Contains("privacy")) {
        // Open privacy settings
        ShellUtils.GoUrl("ms-settings:privacy-speech");
    }
}

Speech Features: Windows Speech must be enabled, or the async start will fail. You can open relevant config dialogs programmatically via Process.Start("ms-settings:privacy-speech").

Handling Asynchronous Pattern and Dueling References

WinRT methods return IAsyncAction and IAsyncOperation<>, not Task.
The .AsTask() extension isn’t available due to ambiguous/duplicated type signatures between SDK and WinRT. Reflection is used to access these at runtime.

Custom Reflection-based workaround (simplified):

Task AsTask(object action) { /* ...Reflection logic... */ }
Task AsTask<T>(object action) { /* ...Generic version with reflection... */ }

This is a cautionary tale—if you need to await these ops in standard .NET, Reflection may be unavoidable.

Capturing and Processing Recognized Speech

ResultGenerated Event: Receives dictated text and inserts it into the editor or UI control.
UI Thread: Use Dispatcher to access UI elements safely.
Text Fix-Up: Punctuation, commands (like “stop recording”), and spacing are handled via custom methods to improve context and accuracy.

private async void ContinuousRecognitionSession_ResultGenerated(SpeechContinuousRecognitionSession sender, SpeechContinuousRecognitionResultGeneratedEventArgs args) {
    if (args.Result.Status != SpeechRecognitionResultStatus.Success || !mmApp.Configuration.EnableVoiceDictation) return;
    var text = args.Result?.Text;
    if (string.IsNullOrEmpty(text)) return;
    await mmApp.Window.Dispatcher.InvokeAsync(async () => {
        var ctrl = Keyboard.FocusedElement;
        if (ctrl is TextBox tb) {
            // Insert text logic here
        }
        // Editor logic, fix-up, etc.
    });
}

Language Keywords and Switching

Keyword Issues: Command words like “space” and “return” are checked in English, but may fail in other languages, requiring further handling.
Language Switching: Only possible by recreating the speech engine with the desired language code.

Deployment and SDK/WinRT Issues

Large Dependency Size: Using the SDK and WinRT adds ~30MB to deployables.
Negotiating Ambiguous Types: Overlapping APIs result in complicated workarounds, mainly through Reflection.

Alternative: Win-H Shortcut

Windows provides a native speech-to-text overlay (Win+H) for basic dictation. It’s less integrated but doesn’t require code.

Summary

Windows.Media SpeechRecognition delivers robust dictation capabilities compared to the legacy System.Speech engine. Despite some integration challenges (large dependencies, API overlap, and async integration issues), it’s a practical option for advanced voice features in desktop applications. Direct programmatic integration delivers a much better user experience than global dictation shortcuts.

Resources

Related Posts:

If you found this content useful, consider making a small donation to support further work.

This post appeared first on “Rick Strahl’s Blog”. Read the entire article here