Revisiting the discussion on large language model based screen readers for everyday use.

By Ipadman, 27 January, 2025

Forum

Assistive Technology

I remember reading a thread some time ago about the possibility of having a large language model based screen reader running natively on hardware like a Mac or iPhone in order to produce high-quality artificially generated voices for screen readers, at the time the conclusion was that it would not be practical mainly because it would require constant Internet connectivity in order to generate the speech and because artificial voice platforms charged by token, and the average screen reader uses many thousands of these each day.

With the quantum leaps in artificial intelligence technology, and the ability to run fairly comprehensive large language models on mainstream hardware, do you think that we will see artificial voices for screen readers soon.

For example, some of Google‘s new artificially generated voices are pretty realistic.
I wouldn’t want to use these all the time, I prefer my screen reader to sound robotic for the most part but I can definitely think of use cases where I would find the technology useful.

Options

Comments

agentic screen readers

I would be more thrilled at the idea of my screen reader being able to do stuff on instructions like the open ai operator or the browser extention that google previewed back in december, at a more localised level. that would kind of be the end of inaccessible interphases and unlabeled buttons and clickable elements that are not friendly to the normal screenreader.

Blind will become even less…

Blind will become even less tech litterate with that, I don't like this idea...

You're conflating a couple…

You're conflating a couple things here. Large language models like GPT-4 or Google's Gemini aren't exactly ideal for this, since they're non-deterministic and they're not meant for this. You'd be much better served by a generative speech synth, similar to what 11Labs does, or the OpenAI voice engine. Heck, we've already seen voice engine hallucinate a couple times by censoring words with beeps sometimes and other times just saying them, particularly profanity. If you mean running an open-source text-to-speech engine or any sort of text-to-speech engine that's generative in nature, it's entirely possible, impractical, but utterly and entirely possible. It's a question of hardware and optimization more than a question of capability, since we're more than capable of synthesizing natural speech in a generative manner. But it takes time, and when using a screen reader, latency is not ideal.
While generative AI is what powers large language models, image models, etc., it isn't a large language model. A large language model is applying the concept of generative AI to producing text specifically. Of course, you can apply the same thing to audio and get something that synthesizes speech in a generative manner, and that's what 11Labs does. Regardless, I wouldn't be confident using it day-to-day because, as already stated, they're non-deterministic. That means their outputs can be pretty random sometimes, and needless to say, dealing with hallucinations when using a computer isn't great for productivity. I don't think most screen reader users want natural speech anyway. What they really want is speed, and lots of it, too. It doesn't matter how natural something sounds. If you've gotten used to it, you practically interpret it as data instead of sounds. If you ask any seasoned screen reader user if they even take note of the voice, I guarantee you they'll tell you that no, they don't in fact recognize the voice as a voice with gendered attributes and specific resonances, etc. It's just a stream of text, basically.

If you want neural text-to…

If you want neural text-to-speech in general for books or whatever, you don't need to rely on generative AI. Look up Piper or the Sonata Voices for NVDA add-on. Microsoft Edge also has really high quality neural voices. They're non-generative, so they're not quite as emotive or as natural as something like 11 Labs, but they don't hallucinate. And if you present the same sentence to it 5, 6, 10, 20 times, they won't say it a little differently each time. Plus, they're a lot more efficient to run. I can run the local copy of Microsoft Edge's neural voices with insane smoothness on my Windows machine. It's just like 50 milliseconds extra latency. Without major advances to the architecture though, they'll still be in the uncanny valley. Not like, for instance, something like ChatGPT's voice mode. Which, yeah, if I heard that on the phone, I would probably not recognize it as AI, since most of the issues are technical, rather than emotive or, you know, in how it says things. Missing high frequencies, etc.

I think large action model…

I think large action model were theorized with the ai pin and humain but... Well, we know what happen with both? :) Thanks to the two comments above mine I couldn't have explained it myself.

I think both Humane and…

I think both Humane and Rabbit products could be made more accessible, but it seems like their current state is more of a design choice than anything else.

It’s a bit ironic, considering Humane claimed to be committed to accessibility when I used to interact with them on Discord. They certainly “talked the talk,” but unfortunately, they didn’t “walk the walk.”

I haven’t been active on their server for a while now. The last time I checked, it was mostly people trying to stay optimistic, but you could tell even they were starting to lose hope.

As for CosmOS, it introduces a completely new way of interacting with a computer. Traditional screen readers—like the ones many of us are used to—don’t really align with this kind of design.

AI voices

AI voices are more than likely going to be a thing in time, how reliable or responsive they are will be a large question in their adoption in all likelihood. If nothing else they can be trained by the developer then run locally so internet connectivity isn't always necessary.

Right now I wouldn't trust operator AI not to hallucinate and press the wrong button but in time perhaps. It might be an interesting option for gaming accessibility if cheaters don't result in measures to prevent this. Maybe not an option for competetive online games but for the kind of games that thrive on mod support then it could be viable.

My problem with human like screen reader voices

Do you even take note of how natural the speech is at this point? Like, if you've used a screen reader for any length of time, you don't take note of the voice at all. If you're used to it, that is. You turn the rate high enough, and suddenly all those pesky artifacts disappear like magic. Screen readers aren't tools for entertainment, they're for efficiency. The only way we can be as efficient as our sighted colleagues is if we turn the rate up. Their goal isn't to read something in a human manner. Their goal is to present information as quickly as possible. If you want something read in a human-like voice, you can use a text-to-speech program or something. And we already have that. Eleven Labs. Microsoft Edge. I'm pretty sure Speech Central also does it. I just don't see the utility of neural voices in screen readers.

They are so slow that even…

They are so slow that even now eloquence is still very very popular. This tells a lot.

It's data we need, not slick…

It's data we need, not slick human like voices... At least, not most of the time. Fast and clean is ideal. I do wish we could use the neural ones from microsoft, they're very pleasant to listen to but, as for human like speech for voiceover, it's already speaking far faster than a human would. consistency is the key with screenreader voices so we can learn it. Humanistic variation would be terrible.

Saying that... If we had access to eleven labs to read books, rather than navigating our devices, that would be great. I'd like to see kindle latch on to this in its app.

It's True!

I prefer my screen reader to be fast first and expressive second. I stumbled accross a documentory of NVDA founders, and the guy was using E-Speak at rate boost on, and at rate 70 probably. And I thought I was fast at rate 55 with boost on. Gotta increase that thing once in a while these days.
But yes, when it's about comprehention, understanding each and every word, figure out which Applevis commenter offended who with what exact words, I might prefer an expressive voice, something like Microsoft's nural voices, or even what 11 Labs offer. I even maintain a reading mode configuration profile activated with a keyboard shortcut, which automatically switches to this nice sounding voice and brings down speech rate to a more human sounding level when I want to listen something in supreme detail, all in 1 keyboard shortcut. It goes back with same shortcut.

Same, or something similar…

Same, or something similar. Critical listening and simply trying to parse information to get to the crux of the data are quite different things.

Quantity vs Quality

A couple of things.
First, don't knock the desire to use a human-like voice as a daily driver for TTS. There are all kinds of computer cowboys, and keyboard samurai. There are also all kinds of hearing conditions.
While I will agree that nothing gets you the information concisely as something like Eloquence or ESpeak, there are situations where taking in information from a human-like TTS voice is more comprehensible than something more robotic.

Personally, I use the old Vocalizer voices for NVDA as my daily, and Eloquence when working with anything code-related.

Auditory Learners

Or some kind of label like that. Once, long, long ago, in a high school far away, with that first Toshiba laptop that had built-in speech, I got in an argument with this girl who sat behind me in a literature class. Years before, some sort of study had supposedly shown that speeding up the rate of human speech on recorded information helped many people, probably blind people, process and retain the information. Thus was born those tape players that could both speed up speech and keep the speech somewhat the same pitch.
For some reason, I told her that I turned the speech rate up to read things on my computer.
She said, "No! People aren't dogs!"
I handed her the headphones and had the computer read something.
She still didn't believe me.
My thought is how fast of a rate could one of these LLM or AI things, about which I am ignorant, learn to generate understandable speech? In other words, could the AI make machine speech extremely fast and also not irritating to a user as that user gives it feedback, like it's too choppy etc. Could we have finely -tuned, ultra-high-rate speech to truly make people into dog listeners, what ever she meant by that?

Here's the thing

We won't know what AI voices will be capable of until enough research has gone into them. It's entirely possible they could eventually become as consistent as Eloquence with the ability to speak just as fast, or maybe not. It's worth experimenting with. There's also the use where you might want to read fiction such as out of copyright or freely released stuff using your screen reader, and in these cases if it's for pleasure using a slower more natural voice may make sense. It also depends what the goal of the training is since it could be trained to sound like Eloquence just as readily as sounding like a human.

I wonder if the not dogs thing was her still insisting fast speech would be ultrasonic, I've no idea on that front. Some people just don't make sense.

AtIcosa

Training the AI voice! That's kind of what I was trying to get at. Thanks.
The AI could tune and mold itself to what the listener wants and needs in an interactive way without the listener having to be a programming guru. Although, you have to wonder if both the listener and the AI are training each other, like with the dog/owner situation...

Okay, but why train…

Okay, but why train something to sound like a formant-synthesis-based text-to-speech voice which can go at high speech rates when we already have Eloquence and ESpeak? Like, it feels like trying to emulate a piece of hardware you already have. Why do that? It's more than perfectly capable of creating natural speech at a fairly low latency. And I think that's its niche. You can't really broaden it past that because you're just accomplishing the same goal with more overhead.
For the record, I use Samantha at rate 100 which a quick test With the say command and the time command, which allows you to time how long it takes for something to finish, I get a WPM of 700, sometimes 710, etc., testing it on fairly short blocks of text. Of course, it might be a bit slower if you were to get larger blocks of text because of punctuation and the like, but again, I feel very confident in saying it's right around the 700 WPM mark.

@jim pickens

Because not every person's hearing, including the brain end of the processing, is the same, as Brian pointed out. It could just as well be done to a natural sounding voice. I'm thinking about those people in debate teams and such that learn to speed talk to cram more and more of their arguments into the time limit. They don't sound like a simulated, human voice on a computer that gets choppy and weird sounding at a high rate.
It probably won't happen, but it's interesting to think about all the possibilities of an AI based screenreader that trains itself to the user's needs.

just use EQ. EQ is basically…

just use EQ. EQ is basically a way to shift the frequencies of an audio file. Boost certain parts, reduce certain parts, etc. Say for instance you have high frequency loss, you can just compensate for that. Heck, I believe Apple does that already with their hearing aid compensation features with AirPods and such. There's really no need for AI, so long as that's what you mean. Of course I could be misunderstanding, so feel free to clarify.

Clarification

The EQ isn't what I intended, more like the empty spaces between sounds being reduced but the sounds of the words also being smoothed out. Getting rid of digital wow and flutter, maybe... don't know what to call it. If I turn up the rate of speech on some voices, they get a choppy quality that's distracting and they start reminding me of the old-timey tape players for the blind that sped up the rate of a recorded voice without raising the pitch. As I said, there are debaters, and I'm sure other public speakers, who learn to speak very rapidly without sounding like that. So you throw in that some people can't handle the sound of a Klatt voice like espeak no matter what you do to it, or even that a user might just not like the sound of any of the voices on a device, the AI would make it possible to adapt a screenreader to what a user needs without the user having to be a programmer. Talking about a fully AI screenreader.

Voices and Noises

I believe what OldBear is talking about, is Noise, with a capital 'N'.
Regarding Samantha, the high efficiency version of Samantha (downloadable from Freedom Scientific) is really good, even at higher rates, or the standard Samantha voice on iOS, if that is your preference. :)

Why replicate existing options

I'm not suggesting someone immediately jump into producing a product based on AI voices but it's entirely valid for researchers to mess with these things because you never know what you might find. The increased overhead might be acceptable if they find they can replicate the speed of eloquence with more clarity, or maybe not but we won't know until we try. We're approaching why change it if it works territory, and at that point we'd never have developed cars because horses exist. Again I'm not suggesting a new product or full change of the industry, just that experimentation happens because why not?

There are also times I've had to switch toa more realistic voice when I've been ill with a really bad headache, a niche use but still a valid one.

With DeepSeek, anything is possible

With the advent of DeepSeek. And, China essentially showing the world that useful software can be created with very little cost. Something like this could be very much so reality.

DeepSeek is something we…

DeepSeek is something we want to happen again, preferably in a different AI space, but it proves nothing more than that more efficient ways to train models can be found. This is a general statement on AI progress instead of anything relevant to the discussion on LLM-based screen readers. I don't think we even wantgenerative screen readers anyway. Other neural architectures exist and we could probably utilize those better. You need to remember that while generative AI is the new trend, that doesn't mean other sorts of neural networks are now obsolete or whatever. Give it five years or something, probably less since we have titans, which are basically just an upgraded transformer, and you'll have a new paradigm shift.

The whole distillation issue

Plus, if the rumors about distillation are true, well, I hate to break it to you, but they basically just piggybacked off another company's training. Basically, distillation is training your AI model, or fine-tuning it, on the outputs of another. And while it's great for getting more efficient models that are more specialized at what they do, you're probably not going to get something that's better than the original overall. Plus, distillation on their API outputs is against the OpenAI terms of service, if you're going to do it on your own models that are not hosted by OpenAI. You can distill models in the OpenAI API, or I think you can anyway, with O1 and O1 Mini, but you can't use those for any models outside of OpenAI's sphere of influence. Plus, there's also the question of generation loss, which, if you've ever used a cassette recorder or whatever, you know that if you record several tracks and then bounce them out, record other tracks, bounce them out, etc., etc., the outputs are going to get lower and lower quality as you go. So, how long until that noticeably decreases a model's quality? It's great for creating smaller, more focused models, not so good for creating breakthroughs. Again, this might not be true, though, and it's all open source anyway, so that's a win for AI in general, or OpenAI. And I mean OpenAI as an AI that is open, not the company.

Deepseek

Also lets not forget Deepseek still used a farm of GPUs for training, it's not something random companies will be able to do without significant investment even now. That's assuming we ignore as mentioned the possibility that Deepseek effectively ripped off OpenAI by using their models to train Deepseek.

Open AI ripped off everybody first

Open AI literally stole every bit of information they have from the entire Internet. That’s why organizations like the New York Times ended up suing them. So let’s not play that game. DeepSeek is about creating and using an AI in a more affordable way. $6 million versus $100 billion.. it was about disrupting a monopolistic toxic and uncompetitive market. And it was very successful. There’s a reason why the model has been made open source and free to everybody. Simply put, it’s about allowing smaller companies to create their own AI tools and provide actual competition to the worldwide market. Eventually, somebody’s going to consider using something like DeepSeek as part of a screen reader that there creating.

There has been talk of opening up part of a defunct nuclear reactor to run these USA capitalistic corporations AI models. DeepSeek debunk all that. There’s going to be a need for different GPUs. But it’ll be maybe a room of GPUs as opposed to a nuclear reactor.

First of all, it violates…

First of all, it violates the Terms of Service. Therefore, OpenAI lawyers can sue. It doesn't matter how noble a goal, lawyers will be lawyers. Now, the point on training. How is this any different than human creativity? We're nothing but super-powered transformers when it comes to creativity. We have our lives and our personal contexts. And we have shared experiences, i.e. the books, art, and music we experience. That is going to 100% inspire you. Nothing is truly yours, not really. It's all, in some way, based off of what came before you. By using the same four-on-the-floor rhythm, or the same basic art style, you're already taking something from a previous generation. Plus, the argument that AI can plagiarize? Well, so can you. If you know a scene so well you can recite it off the top of your head or write it in a text document, you just plagiarized. Congratulations. Now, listen. AI isn't even that great at plagiarism. You ask any uncensored large language model, and 99.5% of the time, it won't get the scene right, even if it's super, super well-known.

Re:OpenAI

I don't deny OpenAI took some legally dubious actions to obtain their data, I never said they were any better than DeepSeek. The problem I have with the discussion around DeepSeek is not one of morality, it's the suggestion that DeepSeek achieved their results with significantly lower resources. It's like saying I can drive a thousand miles on five gallons of fuel, then having a tow truck pull your car all but the last five miles. People wanting to create new AI models entirely from scratch can't use this method, and creating a model from scratch is still very valid as other advances are made and other requirements become necessary.

If everyone just uses the DeepSeek method all we'll get are lots of AI models distilled from a handfull of donors, they'll inherit flaws and limitations in those donors, and sooner or later those donor models will need to be upgraded or replaced requiring the full fat training method. Distillation absolutely has its uses but in the same way as Android being derived from Linux still has value.