AI-powered 'scene description' A toy or a tool?

By Unregistered User (not verified), 29 March, 2024

Forum

Assistive Technology

Multi-modal models like GPT4 and Claude 3 are being used to provide scene descriptions to blind and visually impaired people. This technology uses the ability of AI models to analyse images and then generate natural language descriptions of the visual content.

How it works – the user sends an image to the system, and the multi-modal AI model analyses the different elements in the image, such as objects, people, scenery, and activities. It then generates a descriptive paragraph or set of sentences that aim to convey the essential visual information to the user.

For example, if the image showed a group of people having a picnic in a park, the description might say something like: "This image shows a group of five people sitting on a red and white checkered blanket in a grassy park. There are trees in the background and a sunny sky with a few clouds. The people appear to be eating food from a wicker picnic basket and smiling."

The value of this technology for blind and visually impaired individuals is that it can provide them with an improved understanding and mental picture of visual scenes that they cannot directly perceive. This can enhance their overall experience and ability to engage with visual content in various contexts, such as social media, websites, or everyday life.

However, it's important to note that there are limitations to the current state of this technology. While the descriptions generated by AI models can be quite accurate and detailed, they may not always capture every nuance or subtlety present in an image. Additionally, the descriptions can sometimes be subjective or influenced by the biases present in the training data used to develop the model.

So, to answer the question of whether this technology is a toy or a useful tool, the answer likely lies somewhere in between. While it is certainly a promising and innovative application of AI, it should not be viewed as a perfect or complete solution to the challenges faced by blind and visually impaired individuals. Rather, it can be a helpful supplementary tool that can enhance accessibility and provide additional context, but it should be used together with other assistive technologies and reasonable adjustments.

As for when this technology will become a more reliable and widespread tool, that largely depends on continued advancements in AI and multi-modal modelling capabilities. As the underlying models improve in accuracy, robustness, and the ability to capture nuanced visual details, the scene descriptions they generate will become more comprehensive and reliable. Additionally, as more real-world testing and feedback is gathered from blind and visually impaired users, the technology can be further refined and tailored to their specific needs.

In summary, while the current state of AI-powered scene description technology for the blind and visually impaired is promising, it should be viewed as a complementary tool that can enhance accessibility, but not a complete solution. As AI capabilities continue to advance, the potential of this technology to become a more reliable and widespread assistive tool will likely increase.

What do you think? A toy or a tool?

Options

Comments

The AI letting you down may be a floored assumption.

If your cats are like others, theres no way they’re going to allow this highly memable moment, Their magnum opus, the crowning glory of a life’s work. 2 cats 1 photo to go down how you expect as the lowly photographer. No no no, they’ll have been plotting this one for weeks. It’s easy to blame the AI for hallucinations as the cause of your injuries and humiliation but it seems more likely to me that it was the cats acting skills. Pretending to be asleep isn’t easy while laying in ambush and yet somehow they pulled it off. Against the odds, with Adrenalin coursing through their veins. Counting down to 0 hour. Anticipating going over the edge and yet they did it. You should be a proud cat mum not doubting the future of assistive tech. Let them bask in the glory of a jape well executed.

P.S. I’d stick all the cups and glasses in the dishwasher just to be safe if you know what I mean.

Thoughts on toys and tools.

I’m pretty sure the main improvements needed are a video feed streaming to the AI, The AI having persistent memory of whats actually happening and how to carry context forward from one moment to the next and a voice first interface. GPS and mapping data would be nice too. I think Multi modal AI is already a tool, it’s just not as powerful as it’s going to be one day. Hopefully soon. When you think about it, what blind people are missing is the ability to see. Cameras have been able to do this for well over 100 years. They started out not having an active role in the process of capturing light, instead they were just a container for some film. Then in the 90’s they started actually recording the visual data themselves when we got digital cameras. They didn’t understand what they were doing in any way but at least they were now actively doing the light capturing and recording. Then about 15 years ago, they started understanding meaning in a very rudimentary way. They got smile detection, face detection and other low level processing tricks. Very recently they started being able to put everything together with some serious brain power behind the operation. A system now exists that takes a photo, processes it and describes exactly whats happening. It understands the visual world in a way that means it can describe it to us. It’s not perfect but nor is a learner driver after their 3rd lesson. They get things wrong all the time and so does AI. It’s only a beginner though and it has 10’s of billions of dollars being spent on its training program. Once the technology exists to stream a constant feed to the brains and the brains have enough power to keep track of both meaning and change over time, then we move from describing a still image to describing the world with context and thats what our eyes can’t do for our brain. It’s not ideal because it’ll have to use language to communicate what it sees and how we should understand what it sees but I think it will be like having a sighted person, always there, always explaining, never tired or bored or irritated. Never needing to feel like we’re putting on anyone to ask what we need to know and what sighted people take for granted. A voice first interface seems like the most efficient communication to me but maybe someone far brighter will come up with something better. Maybe AI could take care of that for us. Figuring out a communication system thats fast enough and efficient enough to get far more information over to our brain than a simple voice can manage. It seems pretty certain that the next 1 - 10 years are going to change things for blind people in very fundamental ways.

A really exciting time to be blind.

Inclusion

Well for anyone who has tried Personality mode on Pixie bot it can definitely be used as a toy. But I think that is overlooking how enriching this kind of technology can be. Sure in it's current form you have to be careful about what you can use it for and it does have some serious limitations.

A couple of examples for me. I got married last year to my long-suffering sighted girlfriend. Now weddings are very, very visual affairs because it seems that they are entirely centred around the photographers. At least that's what I found out. Fortunately the ones we had were good fun but I was worried that I was going to be missing out. And after the event I was bombarded by photos on WhatsApp. For example, one was sent and the only text was "Wow!". I've given up trying to educate the members of my family about the merits of photos. But I was able to send that to Be My AI and from what it told me I could figure out that it was a photo of me, my Dad and wife and get a feel for what was going on. This sort of thing helped me avoid feel excluded from an event that was at least half supposed to be centred around me. And similarly the memes and jokes sent around that otherwise would have made me miserable or I'd have had to take a stand or whatever. I actually felt like I could take part.

But on the other hand, when we got back the 600 odd photos from the photographers, what am I honestly going to do with that? It's not realistic that I am going to look at each one in turn and send them to Be My AI. After one or two it would become a bit redundant anyway. And the AI doesn't know who was there so can't give much context. I'd really like AI to be able to process the photos and be able to convert all that information into some kind of memories of the day. Maybe it could use photos from my contacts to figure out who the people are, or maybe that's just creepy I don't know.

On a similar note, the other big event happened recently when I lost my dog. For non-dog owners this might not seem a big deal, but she was really special to me. And up until a week or two before she went I realised that all my memories of her were in photos that I could no longer see, and my sadly not great memory, and that made me incredibly upset. I was lucky enough to have the Metas and fortunately she had a few days were she perked up and I captured some videos but still it's not quite the same.

So again personally for me to be able to have something go through a load of photos and somehow extract from them something that could enhance my memories would be really good.

Again a lot of this is knowing the context of photos and who the humans and animals in them are. But there should be some information on the photos that it could know, for example, that a photo was taken at such and such a place in the Lake District and my two dogs were goofing around on the beach or something.

Maybe it's just because I'm still not really adjusted to thinking about these sorts of things as a blind person, and maybe I've just not been collecting things in a way I can use. But I'm not going to be the only one who goes blind later in life and then has this cold feeling. I remember Shaun from Double Tap talking about how he had deleted all his photos and that horrified me.

As I get older I start to realise that my memory isn't perfect and isn't getting better, so I hope this is something that AI will be able to help me with as it gets improved over time.

But I wouldn't consider any of this sort of thing a toy. Sure in its current form it's not perfect. It thought my terrier was a labrador in one photo, but I would say it is so much better than nothing and does provide me with a tiny bit of comfort.

I remember a few years ago using the Google and Samsung image recognising tools, I forget what they are called. One of them thought my whippet was an anteater, the other thought she was a pigeon. So things have improved since then.

I think a lot of blind stuff is treated as utility. How do I read this letter? How do I find out what this bottle is? How do I navigate to this place? But great though those things are, there is also a place for things that enrich your life, help you to feel included or give you comfort. These aren't completing specific tasks but just make life a bit better. I can't wait to find out where this is going.

The first time I tried Be My AI it blew me away, as I'm sure it did everyone.

I remember on the day before my wedding demonstrating it to my Mum and Dad and it gave such an incredible description of their sitting room. I could then say, ok what's in that painting over there - tell me about the characters in it, describe the boat, what kind of style is it etc etc and it was just incredible. But not only that due to the way it described them I could think "oh yes, I remember that painting" and then it triggers other childhood memories and feelings.

I know it is flawed at the moment, but I think with many things it doesn't matter as long as it gets the overall tone and the gist of it right it is still very enriching.

So all this waffle is to say that yes I think it can be used as a toy, but I think that is doing it a gross disservice. I feel so lucky to be in an age where this sort of thing exists at all.

Screw the bully’s.

Who has been bullying you? I’ve not been around much for the last few months but just don’t let it change what you do. Your posts are always thoughtful and interesting if others aren’t interested in pontificating on the future and the incredible possibilities that are ahead of us then thats their loss.

I reckon be excited.

Why the hell not. Maybe the bullish will be more right and maybe the bears will but either way. We’re going wherever we’re going. Whoever was most right or wrong isn’t important, the important thing is having fun getting there. If being excited about exciting things is what floats your boat then cool, if fear and anxiety about change at a previously unknown rate is more your thing then perhaps the talks about exciting things isn’t the right place to be but either way, don’t let other peoples fear and negativity affect how you are in the world. IMHO.

I think for me it's more of a cool toy at the moment.

But it could be very useful in the future.

I can't wait for a video feed, oh it will drain the battery like hell but my god am I going to use it to get around, a video feed with AI and GPS that's accessible to the blind on glasses would be my dream.

tool definitely!

the qualitative improvement something like BeMyAI has brought to my life has been huge, personally. Yes, it's imperfect, yes, it has to be used as a supplimentory tool for right now, and yes, it has got it's biases, but all of this doesn't take away from the utility AI has. As someone who has to interact with visual information on a regular basis, and as someone who's job demands that I be constantly aware of this visual info, the kind of indipendents this is brought is been amazing. And like couple of you pointed out, some easily implementable things like becoming context-aware through location information can take it to the next level. @MRG, as much as some will call it creepy, the idea of an app being able to spot people in my contact list has been something I've been wanting for ever since BeMyEyes debuted. Now think of such capabilities being loaded into a smart glass like Seleste, along with the ability to process live feeds. And there you have it. And I have not even started on the kind of workplace accessibility AI can bring about. So, you get me... I guess what we want in the immediate future would be a hybrid kind of app that processes info partly on-device and partly online. As in, it gathers stuff like location info, and other such context-specific info and sends it along with the image, making the describtion more usable.

little more than a novelty for now

I am excited about this stuff, but I think that it won't be truly useful until it has persistent memory and contextual awareness. That doesn't mean I don't enjoy sending pictures and hearing the descriptions. It would be amazing if it could at least pull in location data to tell me that this was a picture taken at a particular restaurant on my trip to Florida. I doubt we'll see a system which identifies people in our contacts, but, maybe, just like DAISY players and talking book libraries are restricted, maybe someone will build an app for us. Maybe Celeste will be the one.

how to use Claude and Gemini for visual tasks

Hey folks, any help with this would be appreciated. Thanks!

Where are my cats?

I'm planning on trying to teach seeing AI to recognize my cats with it's new features. Some times my old cat likes to escape and before raising an alarm, I'd like to look around to find him. :)

Someone mentioned a live video feed to the AI. I really think this is a standout feature in Seeing AI. I think it's what will set Celeste apart also.

Glasses loaded with Seeing AI

could in itself be so much of a handy kind of tech. the goodness of Seeing AI, with the ease of use that comes with the glass.

I Like It!

Definitely a tool for me. Perhaps I thought differently when I first found out about all this stuff, but now that I've used it a bit I can honestly say that it has been very helpful. I recently noticed that the latest update to Seeing AI contains some additional AI capabilities which seem to be more precise or something like that. I tried out the object description thing, but couldn't quite get it. It has something like 4 steps, and there's some sort of vibration feedback. Anyone else tried this out yet?

I've tried the new Seeing AI features.

Frankly, I think Be My Eyes is far superior. I compared the description of a birthday party in both apps and Be My AI gave me much more information, and it sounded more natural.

Getting Information

I want AI to just do it, and red me what it says without having to find the button on the screen to take a picture, and to do it without having to keep the device perfectly still. In fact, I want AI to do it better than a sighted person could do it. I was attempting to post a long explanation of the tediousness of trying to get Seeing AI to tell me the pressure information molded into the side of a hand truck tire this morning. However, trying to type an explanation and also make sense was just as tedious and frustrating. Ya, I know AI could help with that too...

Its about the physical device as well

@OldBear what you ask for is possible even now with wearable devices, say, smartglasses, loded with AI, and equipped with the right kind of 'modules'. The tedium results from our own inability to hold the device steady, click a picture, the field of view a phone camera covers etc. What would again be a small revolution would be the ability for the end-user to design custom 'modules' based on the kind of visual information they constantly need plus continuous clicking (like what's there in Seleste). For example, for some of us, color/currency identification might be very important, while for others, stuff like how many people are there at a given location, if they're facing this group of people etc might be of more use.

Still Green

AI needs to ripen a bit. And that just gave me an idea. I need to test some of these apps to see if they can detect the color of various fruit on my trees as they ripen.
Anyway, Gokul, what you said about stabilizing the camera covers most of my gripe, and I like your idea of being able to set up different functions to fit my needs. I mean in a real world situation--at least for me--where I don't have a tripod or scanner box handy. On top of that, Seeing AI doesn't utilize the side button for taking pictures, and doesn't work with my bluetooth clicker. I guess I could just take a picture with the camera app and recognize it with Seeing AI.

I feel like we’ve plateaued

I’m still waiting for realtime or close to realtime video. I’ve tested a few of these for pictures, the interfaces are a little different and I prefer one or the other for their little quirks but there is only so much you can do with describing a picture. I know a live video feed would drain my battery like crazy but it would be worth it to not have to keep taking pictures or wait for the AI to process picture after picture in pseudo realtime, they call it video but we aren’t quite there yet. Ideally this could be done on device for even closer to realtime info but I’ll keep dreaming.

Agreed but we’ll see

You aren’t wrong but I’m not giving up hope. Perhaps they have relationships behind the scenes, it's not bad PR as we’ve seen with Be My AI. Maybe they figure a $10 or $20 monthly subscription would be doable for more of us, it's certainly an easier pill to swallow than Aira’s prices so far. I fear you’re right but hope you’re wrong.

Plateauing

it seems that way because the core service IE seen describtion has been around for sometime now but there're stuff to be done, like live feed as mentioned earlier. But for the whole thing to break new ground like BeMyAI did to start with, the focus should shift to multi-sensory approaches of conveying information; and involve the community in a big way in development. Then and only then can solutions to practical, day-to-day problems be sort out.

Re: Getting information and audio description on demand

This feature is already available in the Be My Eyes app. No need to open the app or click any buttons.

Be My Eyes now has many Siri shortcuts built in. For example, without opening the app, you can use shortcuts such as:

Ask Be My Eyes (followed by your question)
Read with Be My Eyes
Describe Quickly with Be My Eyes
Describe Fully with Be My Eyes

For example, I held up my phone just now and asked "Ask Be My Eyes what is in this room and what color the rug is", and Be My Eyes responded with the contents of the room and the color and texture of the rug.

Yes, would be nice to have a video capability, but the current features are pretty nice and easy to use if one knows about them!

Give it a shot.

--Pete

Do I trust the human?

It was a disturbing thought when I thought about human verification. AI can be wrong, but is it sitting there in a weird mood, thinking I'm going to mess with the blind people today...? Or could the dice roll land you with a human that calls any color from chocolate to off white, brown?

Perhaps, AI that checks AI

It's not always with the screen description type of AI, but if I need to be absolutely sure that the red wire is red, I look at it with multiple color apps. I also sometimes check a known, red item to be sure those apps are actually giving me a correct reading. I feel an echo of panic when I think of having to ask a human, but that's probably a problem with me.
So I wonder, can you pit AI description against AI description, over and over for an image, and eventually get very close to an accurate result? Weed out a lot of the AI aberrations. Train yourself to interpret the AI results, like looking at something from many angles. In a sense, narrow your eyes like an artist, and pick out the shapes and shades from the object your trying to understand.
As an example, I brought up in some other thread here a picture of a hawk in my camera roll. VO AI says it's an owl standing on a pile of trash. Seeing AI says it's a bird standing on a fish, though, it has also said it is a hawk. Merlin Bird ID says it is a Cooper's hawk. A human said it is a bird with a smaller bird in its talons on the ground and it is probably eating it.
None of these gave me all the information, but putting it together, I am very certain it is a Cooper's hawk, which are common around here, and it has preyed on a smaller bird, probably a dove or pigeon. I do note that AI does not identify the correct object the hawk is standing on.
* As a side note, I can almost completely rule out that the hawk is standing on a fish because of where I live, but it is still useful that it interprets the hawk as standing on an animal.

Context sensitivity

@OldBear this is where context/location sensitivity as mentioned at several of these discussions could come into play. If the AI model were able to use your location/the location information in the picture while interpretting it, it could make a more informed comment as to the bird variety etc.

Interactive beyond further questions

Yes, Merlin Bird ID does use my location. I have no idea if Seeing AI uses my location, but Micro Soft probably has that information, as does Apple for Voice Over.
Can you tell any of these AI apps information and have it reanalyze the image? For example, telling it the hawk is probably not standing on a fish because I live in such and such desert.

I think @mr grieves pretty…

I think @mr grieves pretty much summed up what I was going to say. I'm not one to waffle on, lol. At its current stage, I think it's a bit of both.

If all the processing were done on the device, I would probably use it more due to the speed. Even standalone AI devices have to go through the whole process of taking a picture, sending it up to the cloud, analyzing, and describing the image. And even after all that, you still have to deal with hallucinations.

Speaking of pictures, I just received one my brother took from his office window at work.

The picture shows a panoramic view of a city skyline under a partly cloudy sky. In the foreground, there are lower buildings with flat roofs, some with visible industrial equipment. A dense cluster of mid-rise buildings is seen further back, leading to a distant skyline with numerous high-rise buildings. Notable features include a large bridge and a tower with a spire, possibly a landmark, in the far background. Greenery is interspersed throughout the urban area, with trees and open spaces visible between the buildings. The overall impression is of a bustling, modern city with a mix of architectural styles and abundant green spaces.

That bridge that it's describing in the distance is the Sydney Harbor Bridge.

But the AI isn't going to tell you the name of the bridge, or maybe it depends.

When someone was demoing the AI pin, it mentioned the Golden Gate Bridge in the distance, so not sure why this one didn't mention the Sydney harbour Bridge.

Identifying context-specific information.

I believe gpt4, and therefore apps like be my AI can already identify landmarks and stuff like that if we "reanalyse" with the location information. If in the above scenario, if you tell that the picture is of the skyline of Sydney, it could probably give the landmark name.
Something like Gemini could in fact use the search engine capabilities to simply find it out without us giving any info on our part...

A description with the location I got with Meta AI

The image shows a stunning waterfall, known as Tortum Waterfall, located in Uzundere, Turkey. It's surrounded by lush greenery and towering rocks that create a picturesque scene. The water cascades down from a great height, forming a misty veil around the base of the falls, where a rainbow appears to arc across the sky. The overall effect is breathtakingly beautiful, making it an ideal spot for nature lovers and photographers.

Thanks, that answers my question

Refining the description with questions and more information would work for me. My original point was to get closer and closer to trusting the AI descriptions by cross-checking and contrasting them against each other. I'm not yet using most of these apps being discussed.
@ Lottie, I sometimes build or make repairs to electronics or electric machinery for my own personal use; nothing very complicated. The color recognition apps in combination with an 18% gray card, like what is used in photography, have been a game changer for me. As far as a support worker... I haven't had anything remotely like that since the 90s when I was in college.

Well, here's another situation...

If I'm making a set of earrings or a neckless, I might double check the colors of the beads/charms or the metals to be sure they are all the same before I get started. I might also run it through the AI to be sure a picture of the finished item at least shows the item. I did not have a way of taking pictures of my work back when I made ceramic sculptures, and only have a few examples other people took of them to brag about etc.

@Emre TEO

That's really quite nice - assuming it is accurate and you weren't just pointing your phone at your bath tap.

How were you analysing it with meta? Is this using the meta ray-bans or is there an app you can use? I did see that the meta apps would be rolling out with some sort of AI built in (using Llama 3 or whatever it is called) It would be nice to get something like that built into facebook which doesn't even have a proper share option.
itself.

I believe photos have the location coordinates embedded into them (assuming you allow it) so the AI shouldn't need to guess where the photo was taken. What might be nice is if the direction the camera was pointing at could also be recorded there and then the AI would be able to have a pretty good guess at what you were looking at even without analysing the photo. (Obviously only for certain types of photos.)

Description by rayban meta

I took the photo with rayban meta and copied the answer I got with look and ask from the meta view app.

Re: Look and Ask

Ah, thank you - that is fantastic. I am a little jealous - I have the Ray-bans but am not in the US. I think having that sort of thing right on your face makes a big difference.

Meta AI

Meta AI has roled out for several countries. It's integrated into the Meta apps such as facebook and Instagram apart from a web interface I checked the web interface out; it's interesting; your typical chatbot but offers easy image generation capabilities. But it has no upload picture option so that you can get one described. Haven't checked out how the app integration works; if we'll be able to share a picture from insta and ask meta to give me a describtion. That'd be prittey cool.