Is the Future of AI Local?

☆ Yσɠƚԋσʂ ☆@lemmygrad.ml · 3 days ago

Is the Future of AI Local?

Bronstein_Tardigrade@lemmygrad.ml · 1 day ago

You can already download and integrate a version of DeepSeek with openKylinOS for $0.

☆ Yσɠƚԋσʂ ☆@lemmygrad.ml · 1 day ago

Sure, but it’s still expensive for most people to get frontier model performance locally. I expect that in a few years the models will get optimized enough that even ones that run on modest hardware will be able to do everything a current frontier model does. And that’s going to be the big game changer because there isn’t going to be much demand for models as a service at that point.

loathsome dongeater@lemmygrad.ml · 3 days ago

Does anyone use local LLMs? I don’t use LLMs myself, just the occassional shooting shit with z.ai, but in mainstream discussion local LLMs are almost never brought up except as a potential hedge against the AI bubble bursting by people who has used local models for less than five minutes in their entire lives.

The hardware requirements make local models unlikely. Everyone who talks about trying (not using) local models seems to have macbook pros. If the bubble bursts, then the future will probably be large open source models that can be vendored by anyone.

☆ Yσɠƚԋσʂ ☆@lemmygrad.ml · 3 days ago

I run local models on a macbook pro incidentally. A 32bln param model can do a lot of useful stuff I find. The progress on making the models smaller and faster has been very rapid, and I fully expect that we’ll get to a point where you’d be able to run the equivalent of current frontier models on a local machine within a few years. On top of that, we see things like ASIC chips being developed that implement the model in hardware. These could become similar to GPU chips you just plug in your computer.

The tech industry has gone through many cycles of going from mainframe to personal computer over the years. As new tech appears, it requires a huge amount of computing power to run initially. But over time people figure out how to optimize it, hardware matures, and it becomes possible to run this stuff locally. I don’t see why this tech should be any different.

CriticalResist8@lemmygrad.ml · 3 days ago

that ASIC chip prototype is pretty impressive. You can try it on https://chatjimmy.ai/ without an account, ask it to write something big like an essay or guide - literally the longest you’ll wait is to get connected to the API, but the answer appears instantly.

Only limitation right now is they put a small llama 8b model on their chip, but it’s a prototype and proof of concept of course. I’m sure soon China will print a full Deepseek model on such a chip lol.

Right now there isn’t much interest in making AI more efficient to run but yeah there’s no reason we won’t find advances there. China is already doing a lot to squeeze models into smaller hardware.

I don’t run LLMs locally because what I’m limited to is not great (especially context size is limited) but the way things are going we will definitely start to see open options open up, I think. If only because academia requires it.

Che's Motorcycle@lemmygrad.ml · 1 day ago

Impressive speed! I only tried a couple small coding questions, but it’s faster than anything I’ve seen.

CriticalResist8@lemmygrad.ml · 1 day ago

Oh man you’re underselling it to the rest of the website haha.

But it’s tough to understate just how fast this is without seeing it. 15,749 tokens/s is what I get, and most responses from the big models might be a bit over 1000 tokens, maybe 2000 if they’re stretching it (including chain of thought). Longest I got deepseek to go recently was just a bit below 5000 tokens.

But at such speeds, all of these generation lengths - 1000, 2000, 5000 - are basically done in the blink of an eye. 5000 tokens will be written in a third of a second, or just slightly above the average reaction time.

Unfortunately Jimmy seems limited to ~1000 tokens generation so we won’t be able to really push it to the limits lol.

Che's Motorcycle@lemmygrad.ml · 17 hours ago

Fills out modified DPRK form for Jimmy, lol

CriticalResist8@lemmygrad.ml · 17 hours ago

I hope they can scale it and not only that but that others are able to replicate it. This definitely has potential and would bypass the entire GPU/TPU problem and the layer architecture which is very inefficient.

Speed is not the end-all be-all, but it’s not just the speed, it’s also being able to run this fully locally. Imagine a PCI card for these chips, and you can just switch out the chip for another when you want to switch the model.

I’m just hopium-posting mind you lol, they clearly ran into bottlenecks if all they can offer is a ‘tiny’ llama 8b model. The micro required to etch an 800b model on that chip is magnitudes above, and at that point it might cost as much as a new CPU. But, it does leave the GPU available for other things, and lets everyone run SOTA models.

Really hope this goes somewhere, or if not that, something similar enough.

Che's Motorcycle@lemmygrad.ml · 6 hours ago

I think that last bit is exactly right. It doesn’t have to be exactly this that catches on, but the model of massive data centers that run their chips into oblivion every 6-12 months is peak monopoly capital irrationality.

☆ Yσɠƚԋσʂ ☆@lemmygrad.ml · 3 days ago

That’s what I’m thinking too. There’s no reason why you couldn’t make a chip like this for a full blown Deepseek model, and then when new models come out you just print new chips for them. The really nice part is that their approach doesn’t need DRAM either because the state of each transistor acts as memory, it just needs a bit of SRAM which we don’t have a shortage of.

I’m fully convinced that the whole AI as a service business model is going to be very short lived. Ultimately, nobody really likes their data going out to some company, and to have to pay subscription fees to use the models. If we start getting these kinds of specialized chips, they’re going to be a game changer.

CriticalResist8@lemmygrad.ml · 1 day ago

I could however totally see an economy where the chips themselves while cheap to produce cost a premium based on model and number of parameters.

Because the tech is certainly impressive and they have proof of concept. I don’t know how scalable this is for them (or others), but it clearly works and shows immediate advantages. If it could integrate with existing consumer hardware, like say a PCI card you plug the chip into and switch them out when you want to change the model, anybody could easily have this at home.

But with capitalism we’d probably have to settle for DRM’d chips that self-destruct after X many tokens generated lol.

☆ Yσɠƚԋσʂ ☆@lemmygrad.ml · 20 hours ago

that’s disgustingly plausible scenario

certified sinonist@lemmygrad.ml · 3 days ago

I use local models. To me the entire point of AI falls apart unless you can run it independently.

I just think generating videos and stuff is more exciting and sensational. But if you get any use out of LLMs its a no brained to set one up instead of paying a subscription.

Is the Future of AI Local?

Is the Future of AI Local?

Is the Future of AI Local? | Tom Bedor's Blog