I remember reading a thread some time ago about the possibility of having a large language model based screen reader running natively on hardware like a Mac or iPhone in order to produce high-quality artificially generated voices for screen readers, at the time the conclusion was that it would not be practical mainly because it would require constant Internet connectivity in order to generate the speech and because artificial voice platforms charged by token, and the average screen reader uses many thousands of these each day.
With the quantum leaps in artificial intelligence technology, and the ability to run fairly comprehensive large language models on mainstream hardware, do you think that we will see artificial voices for screen readers soon.
For example, some of Googleās new artificially generated voices are pretty realistic.
I wouldnāt want to use these all the time, I prefer my screen reader to sound robotic for the most part but I can definitely think of use cases where I would find the technology useful.
Comments
agentic screen readers
I would be more thrilled at the idea of my screen reader being able to do stuff on instructions like the open ai operator or the browser extention that google previewed back in december, at a more localised level. that would kind of be the end of inaccessible interphases and unlabeled buttons and clickable elements that are not friendly to the normal screenreader.
Blind will become even lessā¦
Blind will become even less tech litterate with that, I don't like this idea...
You're conflating a coupleā¦
You're conflating a couple things here. Large language models like GPT-4 or Google's Gemini aren't exactly ideal for this, since they're non-deterministic and they're not meant for this. You'd be much better served by a generative speech synth, similar to what 11Labs does, or the OpenAI voice engine. Heck, we've already seen voice engine hallucinate a couple times by censoring words with beeps sometimes and other times just saying them, particularly profanity. If you mean running an open-source text-to-speech engine or any sort of text-to-speech engine that's generative in nature, it's entirely possible, impractical, but utterly and entirely possible. It's a question of hardware and optimization more than a question of capability, since we're more than capable of synthesizing natural speech in a generative manner. But it takes time, and when using a screen reader, latency is not ideal.
While generative AI is what powers large language models, image models, etc., it isn't a large language model. A large language model is applying the concept of generative AI to producing text specifically. Of course, you can apply the same thing to audio and get something that synthesizes speech in a generative manner, and that's what 11Labs does. Regardless, I wouldn't be confident using it day-to-day because, as already stated, they're non-deterministic. That means their outputs can be pretty random sometimes, and needless to say, dealing with hallucinations when using a computer isn't great for productivity. I don't think most screen reader users want natural speech anyway. What they really want is speed, and lots of it, too. It doesn't matter how natural something sounds. If you've gotten used to it, you practically interpret it as data instead of sounds. If you ask any seasoned screen reader user if they even take note of the voice, I guarantee you they'll tell you that no, they don't in fact recognize the voice as a voice with gendered attributes and specific resonances, etc. It's just a stream of text, basically.
If you want neural text-toā¦
If you want neural text-to-speech in general for books or whatever, you don't need to rely on generative AI. Look up Piper or the Sonata Voices for NVDA add-on. Microsoft Edge also has really high quality neural voices. They're non-generative, so they're not quite as emotive or as natural as something like 11 Labs, but they don't hallucinate. And if you present the same sentence to it 5, 6, 10, 20 times, they won't say it a little differently each time. Plus, they're a lot more efficient to run. I can run the local copy of Microsoft Edge's neural voices with insane smoothness on my Windows machine. It's just like 50 milliseconds extra latency. Without major advances to the architecture though, they'll still be in the uncanny valley. Not like, for instance, something like ChatGPT's voice mode. Which, yeah, if I heard that on the phone, I would probably not recognize it as AI, since most of the issues are technical, rather than emotive or, you know, in how it says things. Missing high frequencies, etc.
I think large action modelā¦
I think large action model were theorized with the ai pin and humain but... Well, we know what happen with both? :) Thanks to the two comments above mine I couldn't have explained it myself.
I think both Humane andā¦
I think both Humane and Rabbit products could be made more accessible, but it seems like their current state is more of a design choice than anything else.
Itās a bit ironic, considering Humane claimed to be committed to accessibility when I used to interact with them on Discord. They certainly ātalked the talk,ā but unfortunately, they didnāt āwalk the walk.ā
I havenāt been active on their server for a while now. The last time I checked, it was mostly people trying to stay optimistic, but you could tell even they were starting to lose hope.
As for CosmOS, it introduces a completely new way of interacting with a computer. Traditional screen readersālike the ones many of us are used toādonāt really align with this kind of design.
AI voices
AI voices are more than likely going to be a thing in time, how reliable or responsive they are will be a large question in their adoption in all likelihood. If nothing else they can be trained by the developer then run locally so internet connectivity isn't always necessary.
Right now I wouldn't trust operator AI not to hallucinate and press the wrong button but in time perhaps. It might be an interesting option for gaming accessibility if cheaters don't result in measures to prevent this. Maybe not an option for competetive online games but for the kind of games that thrive on mod support then it could be viable.
My problem with human like screen reader voices
Do you even take note of how natural the speech is at this point? Like, if you've used a screen reader for any length of time, you don't take note of the voice at all. If you're used to it, that is. You turn the rate high enough, and suddenly all those pesky artifacts disappear like magic. Screen readers aren't tools for entertainment, they're for efficiency. The only way we can be as efficient as our sighted colleagues is if we turn the rate up. Their goal isn't to read something in a human manner. Their goal is to present information as quickly as possible. If you want something read in a human-like voice, you can use a text-to-speech program or something. And we already have that. Eleven Labs. Microsoft Edge. I'm pretty sure Speech Central also does it. I just don't see the utility of neural voices in screen readers.
They are so slow that evenā¦
They are so slow that even now eloquence is still very very popular. This tells a lot.
It's data we need, not slickā¦
It's data we need, not slick human like voices... At least, not most of the time. Fast and clean is ideal. I do wish we could use the neural ones from microsoft, they're very pleasant to listen to but, as for human like speech for voiceover, it's already speaking far faster than a human would. consistency is the key with screenreader voices so we can learn it. Humanistic variation would be terrible.
Saying that... If we had access to eleven labs to read books, rather than navigating our devices, that would be great. I'd like to see kindle latch on to this in its app.
It's True!
I prefer my screen reader to be fast first and expressive second. I stumbled accross a documentory of NVDA founders, and the guy was using E-Speak at rate boost on, and at rate 70 probably. And I thought I was fast at rate 55 with boost on. Gotta increase that thing once in a while these days.
But yes, when it's about comprehention, understanding each and every word, figure out which Applevis commenter offended who with what exact words, I might prefer an expressive voice, something like Microsoft's nural voices, or even what 11 Labs offer. I even maintain a reading mode configuration profile activated with a keyboard shortcut, which automatically switches to this nice sounding voice and brings down speech rate to a more human sounding level when I want to listen something in supreme detail, all in 1 keyboard shortcut. It goes back with same shortcut.
Same, or something similarā¦
Same, or something similar. Critical listening and simply trying to parse information to get to the crux of the data are quite different things.
Quantity vs Quality
A couple of things.
First, don't knock the desire to use a human-like voice as a daily driver for TTS. There are all kinds of computer cowboys, and keyboard samurai. There are also all kinds of hearing conditions.
While I will agree that nothing gets you the information concisely as something like Eloquence or ESpeak, there are situations where taking in information from a human-like TTS voice is more comprehensible than something more robotic.
Personally, I use the old Vocalizer voices for NVDA as my daily, and Eloquence when working with anything code-related.
Auditory Learners
Or some kind of label like that. Once, long, long ago, in a high school far away, with that first Toshiba laptop that had built-in speech, I got in an argument with this girl who sat behind me in a literature class. Years before, some sort of study had supposedly shown that speeding up the rate of human speech on recorded information helped many people, probably blind people, process and retain the information. Thus was born those tape players that could both speed up speech and keep the speech somewhat the same pitch.
For some reason, I told her that I turned the speech rate up to read things on my computer.
She said, "No! People aren't dogs!"
I handed her the headphones and had the computer read something.
She still didn't believe me.
My thought is how fast of a rate could one of these LLM or AI things, about which I am ignorant, learn to generate understandable speech? In other words, could the AI make machine speech extremely fast and also not irritating to a user as that user gives it feedback, like it's too choppy etc. Could we have finely -tuned, ultra-high-rate speech to truly make people into dog listeners, what ever she meant by that?
Here's the thing
We won't know what AI voices will be capable of until enough research has gone into them. It's entirely possible they could eventually become as consistent as Eloquence with the ability to speak just as fast, or maybe not. It's worth experimenting with. There's also the use where you might want to read fiction such as out of copyright or freely released stuff using your screen reader, and in these cases if it's for pleasure using a slower more natural voice may make sense. It also depends what the goal of the training is since it could be trained to sound like Eloquence just as readily as sounding like a human.
I wonder if the not dogs thing was her still insisting fast speech would be ultrasonic, I've no idea on that front. Some people just don't make sense.
AtIcosa
Training the AI voice! That's kind of what I was trying to get at. Thanks.
The AI could tune and mold itself to what the listener wants and needs in an interactive way without the listener having to be a programming guru. Although, you have to wonder if both the listener and the AI are training each other, like with the dog/owner situation...
Okay, but why trainā¦
Okay, but why train something to sound like a formant-synthesis-based text-to-speech voice which can go at high speech rates when we already have Eloquence and ESpeak? Like, it feels like trying to emulate a piece of hardware you already have. Why do that? It's more than perfectly capable of creating natural speech at a fairly low latency. And I think that's its niche. You can't really broaden it past that because you're just accomplishing the same goal with more overhead.
For the record, I use Samantha at rate 100 which a quick test With the say command and the time command, which allows you to time how long it takes for something to finish, I get a WPM of 700, sometimes 710, etc., testing it on fairly short blocks of text. Of course, it might be a bit slower if you were to get larger blocks of text because of punctuation and the like, but again, I feel very confident in saying it's right around the 700 WPM mark.
@jim pickens
Because not every person's hearing, including the brain end of the processing, is the same, as Brian pointed out. It could just as well be done to a natural sounding voice. I'm thinking about those people in debate teams and such that learn to speed talk to cram more and more of their arguments into the time limit. They don't sound like a simulated, human voice on a computer that gets choppy and weird sounding at a high rate.
It probably won't happen, but it's interesting to think about all the possibilities of an AI based screenreader that trains itself to the user's needs.
just use EQ. EQ is basicallyā¦
just use EQ. EQ is basically a way to shift the frequencies of an audio file. Boost certain parts, reduce certain parts, etc. Say for instance you have high frequency loss, you can just compensate for that. Heck, I believe Apple does that already with their hearing aid compensation features with AirPods and such. There's really no need for AI, so long as that's what you mean. Of course I could be misunderstanding, so feel free to clarify.
Clarification
The EQ isn't what I intended, more like the empty spaces between sounds being reduced but the sounds of the words also being smoothed out. Getting rid of digital wow and flutter, maybe... don't know what to call it. If I turn up the rate of speech on some voices, they get a choppy quality that's distracting and they start reminding me of the old-timey tape players for the blind that sped up the rate of a recorded voice without raising the pitch. As I said, there are debaters, and I'm sure other public speakers, who learn to speak very rapidly without sounding like that. So you throw in that some people can't handle the sound of a Klatt voice like espeak no matter what you do to it, or even that a user might just not like the sound of any of the voices on a device, the AI would make it possible to adapt a screenreader to what a user needs without the user having to be a programmer. Talking about a fully AI screenreader.
Voices and Noises
I believe what OldBear is talking about, is Noise, with a capital 'N'.
Regarding Samantha, the high efficiency version of Samantha (downloadable from Freedom Scientific) is really good, even at higher rates, or the standard Samantha voice on iOS, if that is your preference. :)
Why replicate existing options
I'm not suggesting someone immediately jump into producing a product based on AI voices but it's entirely valid for researchers to mess with these things because you never know what you might find. The increased overhead might be acceptable if they find they can replicate the speed of eloquence with more clarity, or maybe not but we won't know until we try. We're approaching why change it if it works territory, and at that point we'd never have developed cars because horses exist. Again I'm not suggesting a new product or full change of the industry, just that experimentation happens because why not?
There are also times I've had to switch toa more realistic voice when I've been ill with a really bad headache, a niche use but still a valid one.