Our dreams have come true - Gemini 2.0 is released with its real-time audio/video streaming capabilities!

This is bonkers good. I used…

This is bonkers good. I used it to look at some instructions for a mouth guard fitting, asked it to help me set my dishwasher cycle using the lights on the control panel, and it identified a coffee pod. I didn't know where the writing was so I said I'll keep rotating and I want you to look for text, and when it finished it identified it.

This is, quite frankly, game changing.

I suspect Open AI will be announcing this over the next couple of days to. They do seem to be moving in lockstep, suggesting that Be My Eyes will be getting this too in the near future but tailored to those without sight. Very exciting.

Also, just a pointer, your markdown is slightly off, you're missing the opening bracket.

https://aistudio.google.com/live

Top tip... If you bring…

Top tip...

If you bring this up, each time it will ask you to allow access to the camera and microphone.

Go into settings, right down to the bottom and apps. Scroll down to safari, and keep scrolling down until you get to settings for websites, it's a heading so you can skip by heading if you like. Tap on camera, set to allow (it's probably on ask), and then back out and do the same with microphone.

Second top tip:

Make this a bookmark on your homescreen by opening the link, going to the share tab on the bottom bar, scroll down to add to homescreen, and bob's you're aunt/uncle.

I think there are limitations on use. It could be timing or it might be the number of requests involved with processing the audio and video. It just means that you can't leave it on indefinitely, but you can certain do tasks that would otherwise have taken sighted help.

It's not perfect, it tends to agree with what you are saying so worth double checking with questions like, what button am I currently touching. Trying to catch it out seems to be a good way of avoiding issues and gaining some confidence in what it can do.

perhaps i'm misunderstanding something.

It's interesting that I can just pick up something and ask it what I'm holding and it gets it right sometimes but I asked it to tell me to rotate the can until it can see the text in full and it says ok I'll tell you, to rotate the can until I can see the text, and it doesn't. Is this a bug or am I misunderstanding how this wirks?

Phrasing

It's probably just a matter of phrasing. Keep at it, I imagine sooner or later it will understand what you are trying to ask of it, and provide you with the information you want.

Can't open the link.

When I double tap arx on the link it doesn't work. I tried coppying the link and then pasting it but it gives me an error.

Oh, I don't know about…

Oh, I don't know about specific website permissions. I do recall something similar.

Also, sorry for the terminology...

It's really really really really really really cool.

I think the way it is working is taking pictures every second so when you make an inquiry it is looking at the most recent images. I only had a brief play this morning but it seemed you have to focus it's attention on the thing you're holding. What am I holding? Is there any text on it? I'm going to move it until you can read something. Can you see anything now?... That sort of thing.

I think there is certainly going to be a nack to this and it's still buggy, it's beta though, expected. It's the start of what we've been waiting for though. A really really really really good thing that possibly alters the sport.

Wow

This is great. Hopefully soon we won't have to allow access everytime. Only slight confusion is we have a mic and camera button but I think you don't have to allow access to both each time. doing the camera seems to open both. I think this is what Envision are using in their ally app on the glasses.

There's not many times

It isn't often that I feel I should get crazy crazy excited about something but having just tried this for the first time. Wow, just, wow.
Spotlight is on you now, open AI. I suspect that's what we'll get on the final day of the 12 days of open AI.

About permissions...

How to Stop iPhone from Asking for Camera/Mic Permissions Every Time

Go to the website.
Tap on the page menu button.
Click on the More button at the bottom-right corner of the screen.
Look for the heading "Website Settings for...".
Change the microphone and camera settings from "Ask" to "Allow."

Thank me later! 🎉

Great

Mert Ozer as requested thanks lol. This worked.

Actually spoke to soon

Hi Mert Ozer are you saying that as soon as you now open the webpage you can start talking? Because just closed and reopened the site and I still have to open the camera button. Double checked and my settings are showing as allow. Tried to find another button that may help but no luck.

@Lee My pleasure...

It’ll get much better, but even now, it’s unbelievable. I just don’t think it has the ability to stay on track and keep providing details nonstop. For example, when I was navigating my high school, I asked the AI to describe the doors I was passing through. It started by describing two or three doors it could see, but then I had to say “go ahead” or “keep going” every time. So, it’s not continuous, and using this feature while traveling could be a bit dangerous for me. But look, we couldn’t even get detailed image descriptions until two years ago!

@Lee

Not really, by allowing the permissions from website settings we get rid of it asking for mic/camera permissions every time but I still have to start the camera/and the mic. I feel like that's how it should be, though. It's a web UI to input text, audio, image.

Top tip

Yes, It's bloody brilliant! Now all we want is this on a wearable; Google did demonstrate Astra yesterday so we'll have it in the near future I guess.
That aside, top tip: you can tell it that you're visually impaired at the start of each session and it'll remember that fact for the duration of that session. This'll help it assist you appropriately as far as identifying text etc. Hopefully once it becomes a full public release, it'll have permenant memmory.

Busy through Saturday

I am busier than a one-handed pancake chef through Saturday, but couldn't resist taking ten minutes to play around with this. It's freaking brilliant. Maybe I'll have time to explore further next week.

Google Gemini app?

Any idea when this might appear on the app instead of through Safari?

Really interesting

Don't have time to test this out right now but it sounds awesome.
I hope, as others have said, that OpenAI takes note of this (and hopefully have their own version ready soon) but also that the "blind-specific" apps/services, Be My Eyes, Seeing AI, and so on, also watches this closely, because having that integrated in something specifically made for blind people would be nothing short of fantastic.

asking to look out for something seems not to work

Hello,

I was just playing 5 minutes with this tool. But when I say: "tell me when you see a person in front of the camera." And I walk into the frame of the camera. I get no reaction. Maybe other people have better results with that kind of questions.

I hope OpenAI will present their own version of AVM with vision today, tomorrow or next week.

Hit and miss

Tried this outside regarding a bus stop. It said it would let me know when it saw it. Total silence. However, it may have been the connection. Seemed to drop off a lot outside with 4g signal in the UK. So, it may get better as inside I asked it to tell me when it saw a cup and it worked.

absolute amazing

We just need a way to make it speak faster.

Speaking faster is fine, but

So long as they do not get rid of the voice they are using, I will be happy. It almost sounds like you're talking to an actual person.
Almost ...

The really amazing thing

Okay, live video description is pretty amazing.

Aside from that, the really amazing thing is that you talk to it without having to deal with onscreen dictation buttons, like a real human; it talks back without you having to tap the speech bubble; and it has access to new images as it needs it, without you having to find and double-tap any buttons. This user interface makes all the difference in the world, in my opinion. and it's not like that is new technology or anything. It's simply that existing image description app developers have never bothered to design it this way.

real-time monitoring

Has anyone actually gotten real-time monitoring to work? Is it even supposed to do this? So far I need to ask every time if something changes or I want feedback. For me, it's almost the same as Meta AI's "Look and Describe" command without the WakeWords which obviously makes it quicker and smoother so over-all a plus but IMHO not a total game-changer from what we already have. Don't get me wrong though, this is definitely headed in a good direction!

Also, I find folks comments on speech feedback interesting. For me, I'm talking to a computer so I don't want or expect feedback to be slow, emotional or humanistic. Give me abbreviated/useful/effective information in order for me to be agile and productive.

This feature is now rolling…

This feature is now rolling out on Chat GPT for paid plans.

Coming to a future near you ...

Coming to a future near you, having something like this integrated into a smart phone. Be at Apple, Google, or whatever's clever.

Questions

This looks like a great step forward, though I’m having limited success so far.
It’s not using the voice I selected, and now I can’t find the option to change it. The various menus don’t seem to work too well with VoiceOver. Anyone else having more luck?
The “tell me when you see…” idea isn’t working for me at all. I have to keep asking, which suggests it’s just taking pictures really, not video. Is this any different to Ally for those on that beta too? Still a nice slick interface though.
When wearing AirPods, sound sometimes reverts to the phone speaker when I start a session. Anyone else seen this?
Dave

The Voice

I hope it just comes to the application so I can use voice I wish

My (limited) experiences

I've been testing mine, mainly by standing in front of a window, and constantly saying, "what do you see"? Please note that I live in a metro city, in the downtown area, in a high-rise apartment building, several floors up, so outside is quite lively with people in traffic and such. I have yet to play with any of the settings, I simply give it camera and microphone access, and start chatting away. I will say that, out of the box as it were, it is rather detailed, and pretty accurate at least as far as my belongings inside my home.
It was even able to read the small print off of a soda can, which I have not been able to get any other type of AI service to do. Ever.
iPhone SE 2022 running iOS 18.2.

David, it is indeed taking…

David, it is indeed taking photographs rather than the conventional idea of video. The other way to look at it, is it is a single frame per second video, or that's the regularity on the Open AI version.

Ollie

Hey Ollie. Apologies yeah, I get that. But it doesn’t feel like it is actually taking repeated pictures, because if I say tell me when you see a bottle and then start scanning the room with the camera, it only finds it if I keep asking the question. So is it actually only looking each time I speak?
In that way it seems kinda the same as Envision Ally.
I’ve only played with it a little bit so far though, and definitely see the amazing direction this stuff is going.
Dave

Like chat GPT, it doesn't…

Like chat GPT, it doesn't seem able to give a real time response, IE, tell me when this happens, or you see this. I don't think it can act on its own. I'm hoping, like in Andy's video all those months ago, that Be My Eyes will allow us to have visual triggers. It seems to be aware of what has happened, what is going on, but lacks the ability to speak of its own volition.

Gemini is better in this department

When compared to chat gpt. If you tell it multiple times that you're blind, you need real-time info etc, it does respond to a little extend, unlike chat gpt which stubbornly refuses. In both the cases, it is not that the system itself cannot do real-time monitoring, it's rather that there's some restriction placed on it. In the case of Gemini, it appears to be some instruction, like say "respond to visual info on detecting a spoken prompt", rather than an explicit restriction, which seems to be the case with chat gpt.

the bugger just halusinated an awsome future!

So I was trying to set-up Windscribe in my windows pc which as everyone knows, is not accessible with screenreaders or, at least JAWS. So I thought, why not just share my screen to Gemini live and make it read the screen and then use my keyboard along, which seemd like a nifty solution, but that was till it read the first screen and asked me, do you want me to click on the quick connect button? (note that I had already given it the context; stuff like I'm blind and that this app is totally inaccessible with my screenreader etc). To say that I was plesently surprised would be an understatement. I said "Sure go ahead" and then it went on to click that button, talk about the next screen, select a free server etc, etc, and basically to complete the process. And then to just make sure, I took a picture of the screen with BME and duh, nothing had happened, and everything was as it was in the beginning.
My conclusions: if you have read/heard etc about the mariner browser extention, I bet they're working to make it part of Gemini live/project astra etc. and possibilities for accessibility as far as such a thing is concerned are just enormous! Google is already into the agentic future, and I wouldn't at all be surprised if that is one of the anouncements made by OpenAI during the next 5 days of their ongoing 12 day string of anouncements.

The mighty Google

All hail Google. All hail our AI Overlords! 🤖

I think this is something…

I think this is something that google will have over open AI as it has a suite of apps, chrome, youtube, sheets, docs etc, which can all be nicely bound together and accessed by the google AI. Open AI can't do this. This is what apple AI wil be doing in the new year, though I'm not sure how deep it will go.

There was that app some developer was working on, and I can't remember the name of it, which was working on mac with an ability to take control of the mouse etc. I think that is where it is all headed. It was a little flaky, not because of the concept but because of the AI the app was using. I can see in the very near future a way of telling a computer to do tasks and the task is done. That's kinda what we want in the first place, the least friction between concept and enactment.

Has anyone been able to create a shortcut to start this process?

So, There are actions in shortcuts to open chat GPT and start voice mode, there isn't one to hit the camera switch at that point. Do any of you guys have a work around? I would love to use assistive touch to do this, but when getting to the draw jesture part of the flow, in the assistive touch options, voiceover goes silent. I know there were some Be My Eyes shortcuts months ago that launched the app, took a pic, then asked the system a question, so I know this is possible in theory, because I'm trying to do a similar task.

Step-by-step instructions to try out the live feature

Hi everyone, can someone please give me instructions on how to try out the live feature where I can point my phone at something, ask what do you see? am I right in assuming that this does not work with the Gemini app yet? If so, what website do I go to? And how do I enable the camera etc.? I have read through this thread, but for some reason it’s not working for me. I am not sure what I am doing wrong? I am using an iPhone 10. Will it work with this particular phone? Or do I need a later model? Thanks in advance for any help, Jason

the link works now.

You should be able to tap on the link in the first post, then on the page there's two buttons, one for mic and one for camra, tab on both those buttons and agree to the agreement stuff, then you can talk to the phone.

Gemini live video interaction cut off after a minute or two each

I’ve observed over multiple live video/audio interactions with the AI that after about a minute to a minute and a half or maybe two minutes, the AI quit responding to my questions with my video running. Then I have to hit the refresh button on my browser and again activate the Camera access button from the AI studio page and restart. I’d replicated this multiple times. Are you all also experiencing what seems is a very very tight and small amount of time that we are being allowed per interaction with the live video description interaction with the AI? Less than two minutes constraint on this time is horrible. I am a ChatGPT plus subscriber and there is no such tight time constraint with their live video based interaction with ChatGPT.

I mean ...

I mean, this is essentially a live beta. Furthermore, you said it yourself that you subscribe to ChatGPT, whereas this is free ... for now. In that respect it only makes sense to limit its use.

Yeah, I was getting this…

Yeah, I was getting this limit too. It's fine though, it's free.

I'll probably put this up on a different thread but, does anyone know of any good, and non-dorky chest mounts for iPhone? I've got the ray-bans but they are far less useful over here in the UK and am thinking, until the LIve AI for Meta rolls out here, I'm better off with my phone, but could do with a hands free option.

Looks like AIRA might have…

Looks like AIRA might have some competition coming for it. This is awesome!

8 to 10 minutes I second the request for good chest strap

I seem to get about eight or 10 minutes and then I do need to refresh it and hit the camera button again. I would be grateful for suggestions for chest straps as well. Would be amazing to mow this thing and let her rip

Only letting me do recordings

It seems like it's only letting me record a video and then ask questions about it. It's not letting me talk to it while doing a live video. I'm doing this in Safari on an iPhone 15 pro. Any ideas about what I might be doing wrong?

RE: Recordings

What I do:
1. Launch the shortcut that I set up on my home screen, to the link in the original post of this thread.
2. Double tap the "camera" button, and give permissions for it.
3. Start conversing with the AI in real time.
A note on step 2, after giving permissions, I noticed a subtle change in the audio quality of my iPhone. This is how I know that the microphone has been activated, even though I did nothing to the microphone button, giving camera permission seems to have also given microphone permissions as well. Just a heads up on that.

Also a tip, though this may have already been mentioned, once the page is loaded, go to page settings next to the address bar, and give full permission to camera and microphone. This will only be for this webpage, but will make things a lot easier moving forward.

HTH.

Unclear language perhaps

I wonder are you being misled by the label on the camera button?
For me, VoiceOver says “Camera, description, start recording”. However it is not creating a recording, you can simply start asking questions.
Dave

Share screen

How can I share my phone screen on gemini live? Only the camera appears

Not sure that you can do that on the iphone

It would be neat but I don't think it's possible. I have done it on the pc and it is very neat

Vs ChatGpt

Wow! Sounds awesome. I’m a ChatGpt Plus subscriber and have been marveling at this same feature on the latest version of its voice chat feature, and now it sounds like I need to go try this Gemini option too. What a time to be blind! Thanks for sharing this

Our dreams have come true - Gemini 2.0 is released with its real-time audio/video streaming capabilities!

Options

Comments