describe images using local llms and shortcuts to do this

By Diego, 11 May, 2026

Forum

iOS and iPadOS

Hey guys!
I would like to try using the iPhone to play. For this, since using Gemini uses up a lot of tokens quickly, I would like to ask how good local models are for describing images and if it is possible to make a shortcut for this.
The idea would be the following:
I press a button on the controller. The ption change. I make a gesture on the iPhone screen with VoiceOver. silently, it takes a screenshot of the screen, sends it to llm with a specific prompt, speaks and deletes the prompt.
Do you think it would work? Find out which option is in focus, player status, among others.

Options

Comments

Interesting Idea

This is an interesting idea and I’d be interested to see how other people see this working.

When you say “send it to an LLM”, what do you mean though?

For instance, I’ve got a Mac mini M4 with 24gb of RAM. I’m running Ollama on it to host my LLMs on it and I’m in the early stages of playing with Openclaw to orchestrate AI agent tasks.

From what I have found so far, I can process prompts in Ollama using LLMs like Qwen 2.5 14b and Gemma3 12b pretty comfortably locally when used directly, although performance does drop off a cliff when you bring Openclaw into the mix, I only get decent performance on that using a small and snappy remote model, like Chat GPT 4O-mini, which serves the purpose for what I want.

If all you wanted from processing an image was say text recognition, I’d guess a smaller model like the local ones I listed above might do a reasonable job, but if you are after rich, detailed descriptions of screens and graphics, I doubt you’d get that without using a much higher end local model, say a 30b or 70b one, but then you’d need a real top end machine (say an M5 Pro with 64GB of RAM) to even come close to considering it as a usable option for full local processing.

People on here may have far more expertise than me, but I imagine what you are talking about is an iPhone shortcut that would communicate with a device that could host the LLM and other software and process your requests, like a Mac acting as a server, or perhaps something like an AWS EC2 instance, the latter of which would obviously have its own costs and setup needed.

There are all sorts of workflows you could make to achieve what you are hinting at, but everything I’ve certainly read seems to indicate that most people settle for a hybrid approach, where local models are used to do the simple grunt tasks to process as much as possible locally before offloading the heavy lifting to remote LLMs like Gemini or Open AI.

What’s your motivator here out of interest? Privacy? An AI hobby project?

local llm on iPhone

Hi.
The idea is to run llms directly on the iPhone, hence the need to run small models.
My pc is a rog ally x, and even with the 24 gb of RAM it has, which are for the integrated GPU and memory, it can't run well.
Even taking a screenshot of the chrome shortcut and asking, just for testing, what item is this? It gives a timeout.
I think it's because Windows isn't very good at running llms, unless you have a super machine. From what I've seen people say, Macs can do the job better.
That's why I would like to try running directly on the iPhone.
Apps do this and I know apple has a system for using llms so I thought this might be possible.

Ahh I See

Ahh ok!

I am really no expert here, but when you think about Apple running LLMs on the iPhone, for Apple Intelligence on device processing, you need to remember that these models are highly specialised and designed to carry out very narrow and specific sets of tasks - you can get away with a micro model for this kind of thing and I'd argue that its part of the reason why Apple Intelligence just isn't very good at the moment and why it offloads to Open AI so often :)

Maybe in the future, iPhones might be powerful enough to run more complex models locally, or maybe more ways will have been found to compress/quantise models to increase options, but as much as I’d love to see you prove me wrong, I think your options for running local AI projects on the iPhone are going to be limited.

For instance, I’ve hardly gone into Xcode at all, but I’d imagine Apple exposes some SDKs for Apple Intelligence now, but I’d argue even with these, you are unlikely to be able to produce anything more sophisticated than what native IOS apps already offer.

I’d love to see you give it a go though if you have the time and the knowhow :)

Incidentally, as I understand it, the reason why people say that Macs are “better” at local AI is because they have unified memory - that is in my 24GB Mac mini example, both the GPU and CPU have access to all of that memory. People who boast about running huge local models on Windows on Youtube and Reddit etc no doubt have 4 figure GPUs with huge amounts of VRAM and top end processors. I thought like you did that more RAM means you can run bigger models in Windows, but from what I’ve read, in Windows in particular, its the amount of VRAM on the GPU that counts as this is where the models are loaded.

If you are getting timeouts locally though, as I was getting a few days ago, there are things you can do to try and help. For instance, reduce context window size (basically put, the amount of memory your AI has), or reduce max token sizes. Also, asking any LLM to recognise an image takes a surprising amount of power and apps like Be My Eyes fool us into thinking its really straight forward because they outsource the work to commercial models.

Try installing a really small model, say with about 7 billion parameters and keep prompts really short and simple and work up to what it could break at.