Contact

Writing a Screen Reader in Rust
For Windows

A screen reader is the one piece of software its users never look at.

It narrates the operating system out loud to someone who can't read the pixels: every button, every menu, every field that takes focus. For the people who rely on it, the screen reader is the interface. Everything else is implementation detail.

Until recently, I had no idea how one actually worked.

How I got here

This is how I first got exposed to web accessibility. In June 2025 a new German law, the Barrierefreiheitsstärkungsgesetz, comes into effect, requiring most websites to be accessible. German site owners are scrambling, and our clients at @next.motion started asking for audits and improvements.

While working on the accessibility of one site, I got curious about something the checklist never explains: how does a screen reader actually work?

So I did the thing I always do with a black box. I tried to build one.

To keep the scope small, I limited myself to Windows, by far the most popular OS in my target audience.

What it has to do

Stripped down, a basic screen reader needs to:

  • Know which UI element you're currently on
  • Read text aloud
  • Listen and react to input
  • Play sounds for feedback

And, just as deliberately, a list of things mine would not do:

  • Have a GUI
  • Read anything beyond the focused element

That second omission is the big one. Reading real web content means fighting the browser, which sandboxes its content from exactly this kind of access. That's a project in its own right. For now, the focused element is more than enough to learn from.

Reaching into the OS

Screen readers live close to the metal. They need low-level access to OS APIs to pull text out of other applications and poke at their controls. Rust is more or less the only low-level language I'm comfortable in, so the choice made itself. The borrow checker™ is the cherry on top. I'd rather not chase memory leaks through someone else's UI tree.

After that it was a matter of finding the right things to stand on.

For reading the UI: leexgone/uiautomation-rs, a Rust wrapper around the Windows UI Automation API. It exposes nearly every control on the desktop and lets you interact with them programmatically. Exactly the first item on my list.

For speech, Windows already ships a text-to-speech engine; I just needed a way in. retep998/winapi-rs provides raw FFI bindings to the whole Windows API. That covers reading text aloud.

For the rest, there's rodio for audio playback and mki for keyboard input, straight off crates.io.

A name

I called it Aria, after ARIA: the set of roles and attributes that tell assistive tech what an HTML element actually is. A screen reader is one of the main things that reads them. It felt right to name the reader after the vocabulary it reads.

Before writing any code, I also made a simple logo. Premature for a side project, maybe, but it makes the thing feel real.

Aria Logo

Reading the screen

UI Automation makes it easy to hook into Windows interface events. To follow focus, I created a FocusChangedEventHandler and handed it to the automation runtime:

struct FocusChangedEventHandler {
    previous_element: Mutex<Option<UIElement>>,
}

let automation = UIAutomation::new().unwrap();
let focus_changed_handler = FocusChangedEventHandler {
    previous_element: Mutex::new(None),
};
let focus_changed_handler = UIFocusChangedEventHandler::from(focus_changed_handler);

Its handle method fires every time focus moves. Or rather, every time Windows thinks focus moves, which turned out to be several times for a single tab press. Without a guard, Aria would announce the same button two or three times in a row, stuttering over itself. So I keep the previous element around and bail early when nothing actually changed:

impl CustomFocusChangedEventHandler for FocusChangedEventHandler {
    fn handle(&self, sender: &uiautomation::UIElement) -> uiautomation::Result<()> {
        let mut previous = self.previous_element.lock().unwrap();

        if let Some(prev_elem) = previous.as_ref() {
            if prev_elem.get_runtime_id()? == sender.get_runtime_id()? {
                return Ok(());
            }
        }

        *previous = Some(sender.clone());

        let name = sender.get_name().unwrap().trim().to_string();
        let control_type = sender.get_control_type().unwrap();

        log::info!("Focus changed to: {}", name);
    }
}

That control_type turned out to be more useful than it looks. It tells me not just what to say, but how the moment should feel.

Speaking

The text-to-speech side is refreshingly boring. Create a SpeechSynthesizer, hand it a string, get an audio stream back, play it:

let synthesizer = SpeechSynthesizer::new()?;

let stream = synthesizer
    .SynthesizeTextToStreamAsync(&HSTRING::from("Hello, World!"))?
    .get()?;

let source = MediaSource::CreateFromStream(
    &stream,
    &HSTRING::from(stream.ContentType()?),
)?;

let player = MediaPlayer::new()?;
player.SetSource(&source)?;

player.Play()?;

std::thread::sleep(std::time::Duration::from_secs(2));

Ok(())

The sleep at the end is a placeholder, not a design. It just keeps the program alive long enough to hear the sentence finish. Making speech interrupt and queue properly came later.

Earcons

When focus lands on a text field, Aria plays a sound:

if control_type == ControlType::Edit || control_type == ControlType::ComboBox {
    play_sound(INPUT_FOCUSSED_SOUND);
}

This is the detail I ended up caring about most. That little sound is an earcon: a sound that means something without saying anything. The tick of a turn signal. The beep of a truck backing up. Nobody explains them to you, and everybody understands them.

Before a word is spoken, a sound has already told you where you are.

A sighted user gets this for free: a text field looks like a text field, and your fingers know to type before you've finished thinking. Take the visuals away and that instant, wordless sense of place has to come from somewhere else. Sound is the obvious place to put it.

The implementation is deliberately dumb. Three sounds (startup, shutdown, input-focused), baked straight into the binary as byte arrays. No asset pipeline, no config:

use rodio::{Decoder, OutputStream, Sink};
use std::io::Cursor;

pub const STARTUP_SOUND: &[u8] = include_bytes!("../assets/sounds/startup.mp3");
pub const SHUTDOWN_SOUND: &[u8] = include_bytes!("../assets/sounds/shutdown.mp3");
pub const INPUT_FOCUSSED_SOUND: &[u8] = include_bytes!("../assets/sounds/input-focussed.mp3");

pub fn play_sound(sound_data: &[u8]) {
    let sound_data_clone = sound_data.to_vec();

    std::thread::spawn(move || {
        let (_stream, stream_handle) = OutputStream::try_default().unwrap();
        let sink = Sink::try_new(&stream_handle).unwrap();

        let cursor = Cursor::new(sound_data_clone);
        let source = Decoder::new(cursor).unwrap();

        sink.append(source);
        sink.sleep_until_end();
    });
}

Then playing one is a single call:

play_sound(STARTUP_SOUND);

Listening for keys

Keyboard input works the same way focus does. You hand mki a closure and it calls you back:

mki::bind_any_key(Action::handle_kb(|key| {
    use Keyboard::*;

    match key {
        Escape => TTS::stop(true).unwrap(),
        _ => on_keypress(format!("{:?}", key)),
    }
}));

Escape cuts off whatever Aria is saying, the single most important key on the keyboard when a computer won't stop talking at you. Everything else gets read back as you press it.

Putting it together

The rest was wiring: making speech asynchronous so it could be interrupted, a small command-line interface, a config file for the parts people would actually want to change. Each piece was simple on its own. The interesting part was always the seam between them: what should happen the instant focus moves while a sentence is still being read.

I didn't set out to ship a screen reader, and Aria isn't one you'd want to rely on. But building even this much changed how I think about the work. The gaps you can gloss over on screen, a button with no name, a field that announces itself as nothing, become the entire experience when sound is all you have.

You don't really understand an interface until you've used it with your eyes closed.

The build and the source (MIT) are on GitHub.