AppleVis Extra 104: In conversation with the creators of Seeing AI, Winner of Best App in the 2024 Golden Apples

By AppleVis, 13 February, 2025

In this edition of the AppleVis Extra, David Nason speaks to Saqib Shaikh, a key member of the team behind Microsoft's Seeing AI, winner of Best App in the 2024 AppleVis Golden Apple awards.

Transcript

Disclaimer: This transcript was generated by Aiko, an AI-powered transcription app. It is not edited or formatted, and it may not accurately capture the speakers’ names, voices, or content.

Hello there, and welcome to another episode of the AppleVis Extra Podcast.

My name is David Nason and I am delighted to be joining you again to talk about the 2024 Golden Apple Awards.

Today's focus is the best app category.

We had 10 brilliant nominees in this category and they were One Password, Anytime Podcast Player, Drafts, Mona from Astadon, OKO, Cross Street and Maps, PixieBot, Seeing AI, Tapit Pro Audio Recorder, Todoist, To Do List and Calendar, and Voice Vista.

So a great list of nominees there, very strong.

Our runners up getting special mention are PixieBot and Voice Vista.

So again, huge well done to them for being runners up.

Our winner and a very familiar winner, I think it's their fourth Golden Apple, which is amazing, is Microsoft's Seeing AI.

So huge well done to them.

It's an app that's been around a long time and is still available in the App Store.

If you search for Seeing AI from Microsoft.

So yes, huge well done to Seeing AI for winning the best app, Golden Apple 2024.

And to chat about it, I am delighted to be joined from the Microsoft Seeing AI team by Saqib Shaikh.

Hello Saqib, thank you so much for joining me on the podcast today.

Hi, thank you so much for having me.

It's a pleasure to be here.

And you're here because Seeing AI, I don't think for the first time won a Golden Apple Award for 2024 in the best app category, which is fantastic.

So congratulations on that, first of all.

Thank you.

A great honor and actually a surprise.

And it means so much because it's from the community.

You know, there are so many awards from either the government or industry, but I think the ones from the community just mean that much more.

Absolutely.

You know, to be voted by the people who are using apps and there's, you know, a great range of apps nominated this year.

So to top the pole is absolutely brilliant.

And actually, before we jump into too much detail, maybe for the uninitiated, if there are any in our community, can you give us a quick rundown of what Seeing AI is?

Good point.

Seeing AI, we talked about it as a talking camera app or a visual assistant.

It's a mobile app where you open it up and it uses the camera to tell you what it can see.

And it has different modes for different tasks in your daily life from reading things to you, whether that's immediately or more slightly with formatting through to describing photos that you take or from your camera roll or from other apps, all the way through to very task-specific channels, as we call them, like finding something or exploring the world around you or knowing if the lights are on.

Amazing.

What do you do yourself on the team out of interest and are you there long?

Have you been there, you know, part of Seeing AI from the start or can you tell us a bit?

Yeah, I've been at Microsoft since I like many, many years, close to 20, I guess.

Wow, my goodness.

But I've been in Seeing AI was something I started with some colleagues and so been doing this for, well, we launched in 2017.

So that's just over seven years of being out there.

And of course, we worked on a bit before then.

And so I now lead the team.

And yeah, that's me.

So you actually helped set up the whole thing day one for Seeing AI?

Yeah, so it was a fun story.

So Microsoft had its very first hackathon.

These are events where the CEO basically said, you can have a week off work to make whatever you want.

And, you know, I'd studied AI at college way back when.

And I thought, okay, I want to make an AI app for blind people.

And, you know, it went nowhere.

It was a side project.

And but then, you know, I met more colleagues who are interested in Microsoft and, you know, you get the right people together.

And eventually it got the CEO's attention.

And then eventually we launched.

And then eventually my manager was like, you know, this thing has wings, you should do it full time.

And it became my day job, which, you know, never planned.

But it's been an incredible journey since.

That's kind of how I got into accessibility as well.

It's doing it on the side of my desk.

And then eventually it becomes your job, but you're just spending an awful lot of time on it.

What's your background?

Are you a product manager or are you a developer or what's your own kind of?

Yeah, I'm a developer, software engineer.

That's my background.

And then, you know, obviously, over time, I've started doing more and more of the product management type stuff as well.

And yeah, I really enjoy that hybrid.

Yeah, absolutely brilliant.

When did I come out then?

Did you say it was the project started in 2017 or it came out in 2017?

Because it feels like one of those apps that's been so ingrained now, it feels like I've had it forever.

I can't remember when it started.

Yeah, it came out in 2017.

That was when we shipped 1.0.

Okay, amazing.

That would have been with a set of kind of channels like Short Text, Document, Product, Colour.

It had a lot of the channels from day one, I think, didn't it?

Yeah, it really had the core channels.

And the big thing then was even the basic image descriptions.

It was the first time I had ever seen AI descriptions when we started developing Seeing AI.

And then, you know, all the other things like reading text are very popular, even back then.

And it was just really cool to see all what AI could do all in one place.

And then over time, it's really been a, I think it was a conversation between scientists and the blind community and figuring out, you know, how do we make great experiences that leverage the latest technology available?

So then we added more and more channels over time.

And I think I seem to remember it kind of was almost put out there as this is almost a research project, a research project, is that right?

Yeah, exactly.

We always talked about as a research project, and to some extent, it still is, I think, over time as we have lots and lots of users, it obviously is a full-fledged Microsoft product now.

But we still have our roots that have been research, talking to the scientists, like they say, to figure out how we push the boundaries of what's possible.

And it's so interesting because obviously in the last, I don't know, two years, is it two and a half years now?

I'm not sure when OpenAI initially burst on the scene.

And then obviously we've had Microsoft involved with that, but also a co-pilot.

And then you've got, you know, Gemini from Google and Claude and all, you know, all these, this is just absolute explosion in AI.

But you guys introduced us to the power of AI or the potential, at least to AI, you know, eight years ago, I guess.

So it must be amazing to see how far it's come.

It has, like, you know, AI for people who are blind was, you know, not a mainstream thing way back when we started.

Excuse me.

But now, yeah, there's so much more we can do and things are moving incredibly fast.

And it's a very exciting time because just the tools that we have at our disposal are, you know, they're so much better now and they're getting better every few months.

Yeah, absolutely.

And I think, you know, this is a 2024 award, technically.

And the big improvement in 2024, the big addition in 2024 was video.

Is that something, firstly, is that something you ever thought you'd see, you know, being able to do actually describe videos to people, but also, yeah, how big of a, how big of a leap was that, I suppose, last year?

Yeah, that was a huge thing.

And honestly, I felt until, you know, maybe two years ago, like you say, I thought video was something that was years and years in the future.

It was always the goal.

I've described, you know, seeing AI from day one as, you know, the vision is, what if I had an assistant that could take the place of a sighted guide or, you know, someone who's describing what's around me or answering my questions as I go about my life when there's not a human available?

Of course, you know, when people are around, people are great, but having AI to fill that role.

And I think the past two years have really shown that, oh my goodness, a lot of those pieces that we dreamt of way back when are actually, to some extent, possible now.

And even though we don't have real-time video just yet, the ability to take a video and describe it scene by scene as we do is just a huge leap forward, both for the short kind of videos that seeing AI supports, and they're also trying to bring that technology to sort of movie quality content as well.

So yeah, as a blind test myself, very exciting times.

So for those who don't know the video feature, you can basically share an MP4 file with seeing AI, or you can open it from your, the Browse Photos feature.

And then we will, at the beginning of every scene right now, we will pause the video and describe what's about to happen.

And yeah, it's really cool.

It might describe that someone's walking into the room or someone's kicking a ball or whatever it is.

And the quality is continuing to get better.

So yeah, excited to see where that goes.

And then eventually, like I say, bring it to more types of content.

But we're starting with those short social media type videos.

And can you report a video that isn't in your own camera roll?

I can't remember.

So if it's not in your camera roll, then you use the share so you can share it from any other.

What we would love to do is to be able to describe videos in web players, whether that's a streaming service like Netflix or a video player like YouTube.

Can't do that today.

Unfortunately, it has to be a video file that you have, but you never know.

Maybe one day.

And I mean, even you were speaking about the goal of live video.

And like you say, it's great when there's people around, but even when there are people around, we'd love to feel independent when we can.

And that's, I think, what this kind of AI, whether it is the video stuff or even the more the more basic stuff like short text reading, even currency checks and all that stuff, it kind of just gives you that sense of independence and that genuine, you know, independence that we maybe didn't have before with a lot of things.

Exactly.

And having that, it is about choice, right?

Everyone should be able to, you know, be independent when they want and work with others when they prefer and use technology when that's appropriate for them.

And so the real time video or the real time assistance is definitely something we're working on.

And I'm very, very excited about when we're able to deliver that.

Yeah, absolutely.

And I think, I don't know if it was 2024 or 2023, but the world view got improved as well with the object finding, which I guess is another, it's kind of along the same lines and it kind of guides you to an object.

You can teach it specific objects as well, can't you?

So you can say where is the bag or kind of general things, but if you want to teach us your specific bag, you can do that as well.

Yeah, we call it Find My Things.

And yeah, that was this past year, 2024.

And this is interesting because it's the first time I think that it's an app that allows blind people to actually train the AI.

And where you have all these, what the media would call large language models, this is the total opposite.

All the training is happening in just a few seconds on your device.

So this was a project, we did it with our research lab to see, okay, how do blind people train the AI?

How do we do all that on a device?

And as you mentioned, once you've taught it to recognize your objects, it provides you with audio experience to find it.

So, you know, you drop your AirPods, for example, then maybe you could use this to pick them up again.

That's funny.

That's the first thing I taught it.

Where's my AirPods?

Because it's such a detail.

I'm always losing mine.

Absolutely.

And there's, you know, you haven't stopped there.

And I think that's one of the interesting things about seeing AI as well.

It wasn't just something you released a few years ago and maintained.

You've continued to not just develop the quality through better technology, but actually come up with new ideas of what it could do.

And now as well this year, you've, I guess, because there's so much in the app now, you decided to do an overhaul of the user interface, which people may have seen in the last few days.

Yeah.

And often when you're making an app, you think, you know, what's the one or two things people will use it for?

Often with seeing AI though, I feel like we want to provide greater autonomy and independence and inclusion for all parts of life.

So there's an unlimited number of possibilities and we love hearing from the blind community and then figuring out, okay, what are the challenges we can solve?

And there's still many, many unsolved challenges, despite all the hype in media about the capabilities of AI.

You know, we could make a huge list of things which AI can't do today.

And so at some, in some ways I feel the blind community often leading the way and not just blind people, people with disabilities, because we have a different need, a greater need.

It stretches what we need AI to do for us or technology more generally.

So as I said, okay, putting more and more into the app.

Now, in the beginning, we had this idea of channels, like the channels on a TV or radio, and you tell it what you're interested in.

But that list can't get infinitely long.

And so you end up sort of the most useful things getting a bit hidden, but also AI is getting better at allowing you just to ask a question and you don't need to switch channels so much.

So this is really what prompted the recent redesign of the app that you mentioned.

Yeah, I think an example of that is you no longer have to go to a different channel for reading short text or scanning a document.

It opens up on the read tab and it will just do short text.

And if you want to do the scan of the document, you still can by selecting an option.

Exactly, yes.

So we did really two things here.

The first is we introduced tabs along the bottom.

So read, describe and more.

And then the more tab has your old list of channels for everything else you might want to do, like currency or color or light or even the world channel you mentioned.

But we found people spend most of their time doing either read or describe.

So the read tab combines the short text document and handwriting channels.

So as you say, it's reading short text.

If you want to have document alignment, you just toggle that switch or take a photo yourself.

And then the describe tab brings together the scene and person tabs that we had before.

So it'll tell you who it can see around you and help you align them to take a picture.

But then it has the rich descriptions that we've had for maybe a year or more now.

But it's just more front and center.

And as those have got more reliable, we now feel comfortable that it can be the default experience.

Yeah.

And you've expanded, I think, again, I think it was last year, it expanded how much you can interact.

So it's not simply taking a picture and is giving you an answer, but you can now ask questions in return.

Yes, exactly.

Both for documents and for general images.

You've got that Q&A feature.

Which is definitely useful.

It is.

I love like, you know, being in a menu, sorry, being in a restaurant and asking questions about the menu or take a photo.

But then sometimes I just want to get a description of one part of it.

And that's where the ability to ask questions is very helpful.

Yeah, no, I've used that.

I use that all the time, all the time, literally.

And I mean, there's quite a lot of apps in this space.

Obviously, it's grown.

You guys were probably the first movers and then we've got things like Beam AI and there's, you know, Invasion and there's, you know, various wearables coming along.

So where do you feel seeing AI stands out?

Do you think there's particular aspects it's great at and that it stands out?

As a product, I think what stands out is just the holistic nature from, you know, reading things in real time to having very rich document scanning.

That also lets you navigate by headings and tables and all that formatting.

And as you mentioned, asking questions, but then also the usual described images, which is so common now.

It's funny to say that such an advanced technology is common, but yes, it is.

But as much as the product, I think what seeing AI represents is the team's desire to push the state of the art forward.

Like I say, a conversation with the blind community on the one side, scientists on the other, and to try and figure out how do we enable these new experiences?

Because as someone who's blind, AI is so amazing.

Yet, as I go about my daily life, I can think of dozens of small things that I wish there was some technology to help with.

And so I looked to someday in the future where more and more of those will be solved for as well.

Definitely.

Yeah, keep pushing.

And I think it's a great thing for us to hear as well because, you know, Soundscape obviously was a brilliant app for us and unfortunately to get discontinued.

Obviously, Microsoft's open sourced it and we have a couple of other great apps.

There's another Soundscape app and there's VoiceVista that have kind of picked up and ran with it, which is amazing.

But I think we are all delighted here that Microsoft is still very much committed to seeing AI.

Yeah, they absolutely are.

And I am very appreciative that, you know, the powers that be at Microsoft are supportive of our work and yeah, let us keep carrying on.

Yeah, and it's interesting that you mentioned like one of its strengths is its holistic nature.

I think that's right.

People talk about it as like a Swiss Army knife or, you know, a multi-tool for blind people.

And there's things maybe it probably does do that others don't as well.

I like that I can take a picture with us, get the AI to describe it, and then there is actually an option to save.

And I can use both the front and the back facing camera.

So for if you're a blind photographer or you just want a bit of help getting the right photos, you can actually use seeing AI.

Whereas some of the others are just there for description, but they don't actually let you save the picture.

Well, that's interesting.

Yeah, that's definitely something that I use because I use this as my primary camera as well now.

And you mentioned user feedback.

So how does that work?

How do you go about getting ideas from the community and getting feedback and all that kind of stuff?

You're really looking at all the sources.

And so, you know, you can always reach out to us directly seeing AI at Microsoft.com, but also looking at all the reviews and the blogs and podcasts and X slash Twitter.

But of course, the AppleVis forums are a valuable source as well.

So really, we try to be everywhere, make sure that we have our fingers on the pulse of the community.

And then for more direct, you can email us or we have the beta group as well, where that's really valuable to hear from people on early ideas.

And is there a simple way to apply for the beta group or is that a selected group or how does that work?

It's sort of it is selected, but it's not a, you know, if you're interested in joining, you can email us and we'll see as we have capacity to add more people we can.

Do you know what the email address is to context the team?

Yes, seeingai at Microsoft.com Nice and simple.

Seeai at Microsoft.com.

Yeah, and we I know you've been active on the community before, for example, even when there was a bug before with the I remember when the the help things were popping up more frequently than they should, you know, you were on the you were on the forum and you answered.

So I think that's great to see as well that people know that you're you are truly engaged with the community.

Yeah, I'm a lead this project on Microsoft.

But then, of course, I'm a blind guy and a user as well.

So I really appreciate AppleVis and the site and all the work you guys do.

Brilliant.

And I suppose last question would be any other plans for this year or any future plans that you that you'd like to We keep tinkering in the lab.

So, you know, there are a lot of great ideas of, you know, what this new powerful AI could do to help the blind community.

So, yeah, what's this space where keeping prototyping trying different things and as they become ready and reliable and something we think people can trust.

Because then that's important that not only we have cool technology, but it's something reliable and trustworthy.

But yeah, and as things are ready, they will be out and whether you'll be able to answer this question or not, I don't know.

But I'll ask anyway, do you have any plans around wearables at all?

And whether that's, you know, I don't know if Microsoft are working on their own or just, you know, working with other third parties, you know, who have who have wearable devices out there.

So, yes, since we did our first sort of launch many years ago, wearables is the thing that really wanted the thing.

That is exciting in the moment is we've got a lot more options and they're actually devices you could buy rather than, you know, more prototype things.

And this past year, actually, we also launched a partnership with ARX ARCs who make a USB headset with a camera on it.

And that currently only works on Android, but that was the first wearable we've supported.

But we're talking to other companies and just seeing what other devices can we easily integrate with so that you get the senior experience hands free.

Brilliant.

Yeah, I think there'll be loads of excitement anytime there's anything to answer.

The tests are in that I think it's one of those areas that's just everybody's really excited to see now, like you say, wearables seem to be becoming viable and affordable for for everybody.

Yeah, I already know I'm trying to carry my cane, open the door, carry my bag.

And then where's your phone going to go?

I need a fourth hand.

Yeah, exactly.

And who wants to walk down the streets holding the phone in front of them really, really, if they don't have to.

So yeah, I think that's a huge next step for the likes of seeing AI.

And that's brilliant.

So is there anything else you'd like to like to cover or like to mention before we wrap up?

I think we covered everything, but ultimately I'm just excited to keep going hearing from all of the audience about, you know, what's interesting you what do you wish senior I could do.

You know, we thrive based on your feedback and the things we make up purely to serve this community.

So do let us know by email or you can probably find me around on the Internet.

And yeah, I'm excited by the new wave of AI.

And let's see what we can dream up in the labs and eventually ship.

Amazing.

Well, thank you so much for joining me and congratulations again for winning best app of 2024 in the Golden Apples on AppleBiz.com.

Thank you so much.

It's a pleasure.

Thanks everyone for listening.

Bye bye. you

Podcast File

AppleVisPodcast1641.mp3 (23.39 MB)

Options

Comments

Still love this app

I would love to see this app on an affordable pair of smart glasses like the Meta Ray-Bans. Ever since the new interface, I can't find the world and scene channels. When I go to the 'more' tab, I can find product, person, currency, find my things, colour and light. Have the world and scene options been discontinued? I've looked under the 'describe' tab but nothing there either. It's just take photo, browse photos and recognise a face.

Search