Oh actually that’s a great card for LLM serving!
Use the llama.cpp server from source, it has better support for Pascal cards than anything else:
https://github.com/ggml-org/llama.cpp/blob/master/docs/multimodal.md
Gemma 3 is a hair too big (like 17-18GB), so I’d start with InternVL 14B Q5K XL: https://huggingface.co/unsloth/InternVL3-14B-Instruct-GGUF
Or Mixtral 24B IQ4_XS for more ‘text’ intelligence than vision: https://huggingface.co/unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF
I’m a bit ‘behind’ on the vision model scene, so I can look around more if they don’t feel sufficient, or walk you through setting up the llama.cpp server. Basically it provides an endpoint which you can hit with the same API as ChatGPT.
You can still use the IGP, which might be faster in some cases.