@brucethemoose

brucethemoose@lemmy.world · 23 hours ago

You can still use the IGP, which might be faster in some cases.

brucethemoose@lemmy.world · edit-2 1 day ago

Oh actually that’s a great card for LLM serving!

Use the llama.cpp server from source, it has better support for Pascal cards than anything else:

https://github.com/ggml-org/llama.cpp/blob/master/docs/multimodal.md

Gemma 3 is a hair too big (like 17-18GB), so I’d start with InternVL 14B Q5K XL: https://huggingface.co/unsloth/InternVL3-14B-Instruct-GGUF

Or Mixtral 24B IQ4_XS for more ‘text’ intelligence than vision: https://huggingface.co/unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF

I’m a bit ‘behind’ on the vision model scene, so I can look around more if they don’t feel sufficient, or walk you through setting up the llama.cpp server. Basically it provides an endpoint which you can hit with the same API as ChatGPT.

brucethemoose@lemmy.world · edit-2 4 days ago

1650

You mean GPU? Yeah, it’s good, I was strictly talking about purchasing a laptop for LLM usage, as most are less than ideal for the money. Laptop vram pools are relatively small and SO-DIMMS are usually very slow.

Things will get much better once the “Max” AMD SKUs proliferate.

brucethemoose@lemmy.world · edit-2 4 days ago

Yeah, just paying for LLM APIs is dirt cheap, and they (supposedly) don’t scrape data. Again I’d recommend Openrouter and Cerebras! And you get your pick of models to try from them.

Even a framework 16 is not good for LLMs TBH. The Framework desktop is (as it uses a special AMD chip), but it’s very expensive. Honestly the whole hardware market is so screwed up, hence most ‘local LLM enthusiasts’ buy a used RTX 3090 and stick them in desktops or servers, as no one wants to produce something affordable apparently :/

brucethemoose@lemmy.world · edit-2 4 days ago

I was a bit mistaken, these are the models you should consider:

https://huggingface.co/mlx-community/Qwen3-4B-4bit-DWQ

https://huggingface.co/AnteriorAI/gemma-3-4b-it-qat-q4_0-gguf

https://huggingface.co/unsloth/Jan-nano-GGUF (specifically the UD-Q4 or UD-Q5 file)

they are state-of-the-art at this size, as far as I know.

brucethemoose@lemmy.world · edit-2 4 days ago

8GB?

You might be able to run Qwen3 4B: https://huggingface.co/mlx-community/Qwen3-4B-4bit-DWQ/tree/main

But honestly you don’t have enough RAM to spare, and even a small model might bog things down. I’d run Open Web UI or LM Studio with a free LLM API, like Gemini Flash, or pay a few bucks for something off openrouter. Or maybe Cerebras API.

…Unfortunely, LLMs are very RAM intensive, and >4GB (more realistically like 2GB) is not going to be a good experience :(

brucethemoose@lemmy.world · edit-2 4 days ago

Actually, to go ahead and answer, the “fastest” path would be LM Studio (which supports MLX quants natively and is not time intensive to install), and a DWQ quantization (which is a newer, higher quality variant of MLX models).

Hopefully one of these models, depending on how much RAM you have:

https://huggingface.co/mlx-community/Qwen3-14B-4bit-DWQ-053125

https://huggingface.co/mlx-community/Magistral-Small-2506-4bit-DWQ

https://huggingface.co/mlx-community/Qwen3-30B-A3B-4bit-DWQ-0508

https://huggingface.co/mlx-community/GLM-4-32B-0414-4bit-DWQ

With a bit more time invested, you could try to set up Open Web UI as an alterantive interface (which has its own built in web search like Gemini): https://openwebui.com/

And then use LM Studio (or some other MLX backend, or even free online API models) as the ‘engine’

Alternatively, especially if you have a small RAM pool, Gemma 12B QAT Q4_0 is quite good, and you can run it with LM Studio or anything else that supports a GGUF. Not sure about 12B-ish thinking models off the top of my head, I’d have to look around.

brucethemoose@lemmy.world · edit-2 4 days ago

Honestly perplexity, the online service, is pretty good.

As for local running, one question first: how much RAM does your Mac have? This is basically the factor for what model you can and should run.

brucethemoose@lemmy.world · edit-2 4 days ago

I don’t understand.

Ollama is not actually docker, right? It’s running the same llama.cpp engine, it’s just embedded inside the wrapper app, not containerized. It has a docker preset you can use, yeah.

And basically every LLM project ships a docker container. I know for a fact llama.cpp, TabbyAPI, Aphrodite, Lemonade, vllm and sglang do. It’s basically standard. There’s all sorts of wrappers around them too.

You are 100% right about security though, in fact there’s a huge concern with compromised Python packages. This one almost got me: https://pytorch.org/blog/compromised-nightly-dependency/

This is actually a huge advantage for llama.cpp, as it’s free of python and external dependencies by design. This is very unlike ComfyUI which pulls in a gazillian external repos. Theoretically the main llama.cpp git could be compromised, but it’s a single, very well monitored point of failure there, and literally every “outside” architecture and feature is implemented from scratch, making it harder to sneak stuff in.

brucethemoose@lemmy.world · edit-2 4 days ago

OK.

Then LM Studio. With Qwen3 30B IQ4_XS, low temperature MinP sampling.

That’s what I’m trying to say though, there is no one click solution, that’s kind of a lie. LLMs work a bajillion times better with just a little personal configuration. They are not magic boxes, they are specialized tools.

Random example: on a Mac? Grab an MLX distillation, it’ll be way faster and better.

Nvidia gaming PC? TabbyAPI with an exl3. Small GPU laptop? ik_llama.cpp APU? Lemonade. Raspberry Pi? That’s important to know!

What do you ask it to do? Set timers? Look at pictures? Cooking recipes? Search the web? Look at documents? Do you need stuff faster or accurate?

This is one reason why ollama is so suboptimal, with the other being just bad defaults (Q4_0 quants, 2048 context, no imatrix or anything outside GGUF, bad sampling last I checked, chat template errors, bugs with certain models, I can go on). A lot of people just try “ollama run” I guess, then assume local LLMs are bad when it doesn’t work right.

brucethemoose@lemmy.world · edit-2 4 days ago

Totally depends on your hardware, and what you tend to ask it. What are you running? What do you use it for? Do you prefer speed over accuracy?

brucethemoose@lemmy.world · edit-2 4 days ago

TBH you should fold this into localllama? Or open source AI?

I have very mixed (mostly bad) feelings on ollama. In a nutshell, they’re kinda Twitter attention grabbers that give zero credit/contribution to the underlying framework (llama.cpp). And that’s just the tip of the iceberg, they’ve made lots of controversial moves, and it seems like they’re headed for commercial enshittification.

They’re… slimy.

They like to pretend they’re the only way to run local LLMs and blot out any other discussion, which is why I feel kinda bad about a dedicated ollama community.

It’s also a highly suboptimal way for most people to run LLMs, especially if you’re willing to tweak.

I would always recommend Kobold.cpp, tabbyAPI, ik_llama.cpp, Aphrodite, LM Studio, the llama.cpp server, sglang, the AMD lemonade server, any number of backends over them. Literally anything but ollama.

…TL;DR I don’t the the idea of focusing on ollama at the expense of other backends. Running LLMs locally should be the community, not ollama specifically.

brucethemoose@lemmy.world · edit-2 6 days ago

Practically though, Microsoft wanted the Chief (and Cortana I guess) because that’s what sells. So that was kinda the constraint.

They could have sent him on a space boat somewhere more remote though. Basically infinite without Halo 4/5 and a way more intimate plot.

brucethemoose@lemmy.world · edit-2 6 days ago

Forza Motorsport

Yeah, it seems weird to me.

It looks like they have the framework for a more elaborate multiplayer/matchmaking system and… just didn’t really use it?

I am not even touching FM now because the 10-minute races are so awful (and they got rid of all the 20 minute ones).

brucethemoose@lemmy.world · edit-2 6 days ago

Yeah, Halo 3 wrote them into a corner.

Still, they could have creatively ‘reset’ the scope. Focus on a frontier story, a prequel, some kind of cataclysm, probably reset a lot of characters. Even if they lost the silent-chief ‘feel’ of the trilogy, the tone change would have been excused (like Reach to an extent).

Infinite tried this, I guess, but they hauled over too much baggage from previous games and lore.

Honestly, if I were in charge of Halo I’m not sure what I’d do now… As I’m sure ‘make Halo Infinite multiplayer only, kill any semblance of story there, f that wierd stuff in the novels and start over with a narrower setting’ would get shot down.

brucethemoose@lemmy.world · edit-2 6 days ago

Sometimes there’s a rock hard justification. Orbital mechanics is a great one. Game engines are literally not built for physics at celestial scales.

KSA’s feature scope is way narrower than, say, a game with tons of NPCs and voxels and elaborate foilage and MMO-scale multiplayer and such. The DayZ guy and that studio are also pretty experienced at this point.

So yeah, I agree! And I’m glad KSA is seemingly progressing well, actually…

brucethemoose@lemmy.world · edit-2 6 days ago

It’s not so simple, there are papers on zero data ‘self play’ or other schemes for using other LLM’s output.

Distillation is probably the only one you’d want for a pretrain, specifically.

brucethemoose@lemmy.world · edit-2 6 days ago

…even after a major reboot of the game engine…

Custom engines claim another victim.

Game dev is hard. Game engines are apparently impossible and cost prohibitive these days, unless licensed out en masse. They’re killing studios and franchises left and right.

brucethemoose@lemmy.world · edit-2 6 days ago

I elaborated below, but basically Musk has no idea WTF he’s talking about.

If I had his “f you” money, I’d at least try a diffusion or bitnet model (and open the weights for others to improve on), and probably 100 other papers I consider low hanging fruit, before this absolutely dumb boomer take.

He’s such an idiot know it all. It’s so painful whenever he ventures into a field you sorta know.

But he might just be shouting nonsense on Twitter while X employees actually do something different. Because if they take his orders verbatim they’re going to get crap models, even with all the stupid brute force they have.

brucethemoose@lemmy.world · edit-2 6 days ago

There’s some nuance.

Using LLMs to augment data, especially for fine tuning (not training the base model), is a sound method. The Deepseek paper using, for instance, generated reasoning traces is famous for it.

Another is using LLMs to generate logprobs of text, and train not just on the text itself but on the *probability a frontier LLM sees in every ‘word.’ This is called distillation, though there’s some variation and complication. This is also great because it’s more power/time efficient. Look up Arcee models and their distillation training kit for more on this, and code to see how it works.

There are some papers on “self play” that can indeed help LLMs.

But yes, the “dumb” way, aka putting data into a text box and asking an LLM to correct it, is dumb and dumber, because:

You introduce some combination of sampling errors and repetition/overused word issues, depending on the sampling settings. There’s no way around this with old autoregressive LLMs.
You possibly pollute your dataset with “filler”
In Musk’s specific proposition, it doesn’t even fill knowledge gaps the old Grok has.

In other words, Musk has no idea WTF he’s talking about. It’s the most boomer, AI Bro, not techy ChatGPT user thing he could propose.