For context I created a video search engine last year, I shut it down and put the data online. You can read about it here: https://www.bendangelo.me/2024/07/16/failed-attempt-at-creating-a-video-search-engine/
I put that project on hold because of scaling issues, anyway I’m back with an other idea. I’ve been frustrated with how AI slop is ruining the internet and recently it’s been hitting Youitube pretty hard with AI videos. I’m brainstorming a tool for people to selfhost:
Self-hosted crawler: Pick which sites/videos to index (blogs, forums, YT channels, etc.). AI chat interface: Ask questions like, “Show me Rust tutorials from 2023” or “Summarize recent posts about homelab backups.” Optional sharing: Pool indexes with trusted friends/communities.
Why? No Google/YouTube spam—only content you choose. Works offline (archive forums, videos, docs). Local AI (Mistral) or cloud (paid) for smarter searches.
Would this be useful to you? What sites would you crawl? Any killer features I’m missing?
Prototype in progress—just testing interest!
I personally have zero interest in AI search, if you mean LLM. The fact that it can make stuff up, also means it can miss stuff as well. Neither are acceptable for a search engine.
If you mean some kind of deterministic algorithm for indexing and searching, then maybe.
Also, attempting to crawl sites locally sounds like a great way to get banned from those sites for looking like a bot.
I can’t imagine self hosting an LLM-based search engine would be too viable. The hardware demands, even for a relatively small quantised model, are considerable. Doubly so if you don’t have a GPU to accelerate with.
Yeah, absolutely. And running a GPU 24/7 to occasionally search is just a waste of power. I’m not convinced that google and bings AI search makes financial sense either, Google dropped live search (where the results updated as you typed realtime) because it was too expensive, how does LLM search end up cheaper than live search?!
Edit: This is the live search thing: https://searchengineland.com/test-google-updating-search-results-as-you-type-49116 ~~Annoyingly hard to find, and I can’t find the articles on its cancellation, but from memory it was related to expense. ~~
Edit2: Google Instant Search, and the death was blamed on mobile, and wanting to unify the mobile/desktop experience. I do vaguely remember expense being an unofficial/rumored reason, but I can’t back that up.
You realize the gpu site idle when not actively being used right?
It’d be cheaper if you host it locally, essentially just your normal electricity bill which is the entire point of what op is saying lol.
Idle is low power, not zero power. And it won’t be idle when its scraping and parsing the sites, so depending on how much scraping its doing, it could be significant non-idle energy usage.
The gpu is already running because it’s in the device, by this logic I shouldn’t have a GPU in my homelab until I want to use it for something, rip jellyfin and immich I guess.
I get the impression you don’t really understand how local LLMs work, you likely wouldn’t need a very large model to run basic scraping, just would depend on what OP has in mind really or what kind of schedule it runs on. You should consider the difference between a mega corps server farm compared to some rando using this locally on consumer hardware. (which seems to be the intent from OP)
I didn’t say you can’t have a GPU, but to me, its wasteful. I keep my jellyfin server off when not in use, and use WoL to start it when its needed.
I have played with local LLMs, and the models I used were unimpressive, but without knowing what the OP has in mind, we cant know how much power it will use. If it just spins up the GPU once a day for 20 minutes, probably okay, you won’t even notice it. But anyone like me who doesn’t already have a GPU in their lab will probably notice it quite clearly on their power bill.
A megacorps server farm is huge, but its also amortised over millions of users, they probably don’t need 1-1 GPU to customers, so the efficiency isnt necessarily bad. (Although at the moment, given megacorps are tripping over themselves to throw compute at LLM training, this may not be true)
You can run Deepseek on a Raspberry Pi.
At a level you’d need to use for a search engine ?
AI uses so much more resources than standard search engines and it comes at a time when the whole planet needs to slow down climate change
No. Take out the AI slop and I might be interested.
lol that’s fair. I’m just brainstorming here
Why the hell does everything have to be AI for you people to be happy? I just plain don’t understand it. We know that AI hurts your critical thinking and reasoning skills, and we continue to just pack AI into everything… Doesn’t make sense. Sooner or later you’re gonna need to ask ChatGPT whether or not you need to wipe your own ass or not.
There are various levels of AI here
Storing embeddings/vectors in a search index can make your searches smarter and more relevant. The embeddings squeeze related concepts closer together than pure keyword approaches, which if done well increases retrieval quality.
RAG tools and AI searches are just a layer on top of your index. When done well these can be really useful in annotating your results and speeding up finding things.
That’s useful when you’re searching say an error message and the AI is able to iterate on keywords and skim a Guthub issue about it and skip to the resolution.
Similarly it’s good when you’re researching something but don’t have the exact words, AI search can iterate and capture your intent, then run several queries based on that.
I don’t find the hallucination problem significant in practice with a lot of AI search tools, but I have found AI is vulnerable to certain types of SEO spam that a human would never fall for.
As an example most companies have a “comparison to” or “alternatives to” blogpost. The AI does not critically look at the fact that a service is hosting a blogpost shilling their own product. So asking search AI for options is actually poor quality because it will return the shilled results that appear in search first.
AI also search adds an additional silent layer of filtering, which you need to be conscious of.
But is a search engine we actually figured those out a few years ago, what advantage is AI going to bring? Do we also need ai wheels now?
This is the smart thing all over again, I don’t need a smart toilet or a smart toothbrush.
For the average consumer of AI, it’s a novelty at this point, even tho we have been using pieces parts of AI for a long while now. But it’s getting it’s stride in stuff like face swaps, neat tiktok videos, making weird pictures. I liken it to when ‘the cloud’ came to town. Hell, we’ve been uploading to servers and running apps on servers for a long while before ‘the cloud’ happened. Everyone and their brother trampled each other to move their entire operations to the cloud. Then, as the dust all settled, we started realizing that not everything that could be in the cloud, should be in the cloud, and so things got back to normal. But just the words ‘the cloud’ made CEOs jizz their pants at one time.
Sameie, sameie with AI. It’s a selling point. There was a thread here I believe, talking about an AI rice cooker. The ‘AI’ part sells it, even tho we’ve been making excellent rice for millennia. I use AI. I find it a faster way to cut through all the searches and give bulleted points to deviate from. I realize that it’s not best practice to rely on AI’s word, but use it as a springboard into further investigation.
No. Never would I self-host a search engine.
The crawler would eat up so much more ressources than I am ever willing to spend.
I’ve been thinking for years. Maybe there’s a way to do a collaborative crawler and indexer. In a similar way on how collaborative science is done. And probably using p2p protocols.
Get a bunch of people together to create the perfect search engine in these dire times.
Doesn’t something like this exist already? https://en.wikipedia.org/wiki/Distributed_search_engine
Potentially that would be a good application of federation and distributed computing
An Internet archive like distributed tool, that then feeds into local tokenization and indexing.
Alternatively a centralized service that generates indices and then locally they are queried would save a lot of energy.
No. I’m so bloody fed up with AI “search” solutions that return everything on the fucking planet except what I want. Text search has been a solved problem for a decade. All I want out of a search engine is to be deterministic, stable, and reliable, and to look in titles, descriptions, and keywords. Vibe processing is completely unnecessary and will only create issues.
If you really want to iNnoVAte, then consider creating an index with transcripts and summaries that users can search by keywords.
Seems nifty, bake in stuff like selecting your AI provider (support local llama, local openAI api, and if you have to use a third party I guess lol) make sure it’s dockerized (or is relatively easy to do, bonus points for including a compose)
OH being able to hook into a self host engine like searxng would be nice too, can do that with Oogabooga web search plug-in currently as an example.
No.
I don’t want AI slop from big corpo and you think I am gonna want AI slop that’s just as wasteful and harmful just because it’s “locally produced”? That’s Republican-ish crap line of thought.
No.
AI search offers me nothing that “normal” search doesn’t also offer.
But it uses a thousand times more resources.
10 years ago people were shocked by the size of Google’s server halls. Now imagine the increase in size/numbers through AI.
Fuck this shit. The internet isn’t what’s driving the climate catastrophe, it’s how people use it.
Not really. I could use some good selfhosted search engine. I mean all the existing projects (which is just YaCy, to my knowledge) are a bit dated. Nowadays we only got metasearch engines and we’re relying on Google, Bing etc.
But I don’t need any chatbot enhancements. That’s usually something I skip when using Google or Bing because it doesn’t work well. The AI summaries tend to be wrong, and it’s bad at looking up niche information, which is something I need a search engine to be able to find. The AI just cites the most common slop, or at best the Wikipedia article. But I don’t really need any fancy software to get there… So for me, we don’t need any AI augmentation.
And I think the old way of googling was fine. Just teach people to put in the words that are likely to be in the article they want to find. That’d be something like “Rust new features 2023” or “homelab backup blog”. Sure you can strap on a chatbot and put in entire natural language questions. But I think that’s completely unnecessary. We have brains and we’re perfectly able to translate our questions into search queries with little effort… If somebody teches us what to type into the search bar, and why.
If I wanted to self host a search engine, I’d just use a proper one that actually searches content rather than regurgitates bullshit.
Search engines worked just fine until Google and Microsoft decided that they wanted to sell their AI products.
I think SearXNG already has AI integration. Not sure how it works though. I don’t think that I would personally use AI for things other than summarising what I search but it is a useful feature to have
Does it? I haven’t seen that in my instance settings, will have to take another gander.
Sorry, I was wrong. I think I probably saw it in a blog post where they mentioned creating an AI search engine using SearXNG and Ollama. I don’t see any mention of native Ollama integration in the SearXNG docs
I think the really useful idea here is solving the scaling issue by limiting the source sites to a known good set. 95% of the time I am not looking for results from unknown sites. In fact I actively work to get information from the sites I trust.
Yeah that’s the idea. Let people build their own lists and share them.
AI has become an abbreviation for “bad” and I wouldn’t want that, but yes, I’ve been interested for a while in building language models into search engines, to give the queries more reach into the document semantics. Unfortunately, naive approaches like looking for matching vector embeddings instead of (or alongside) search terms seems near useless, and just clutters up the results.
I’d be interested in knowing what approaches you’re using. FOSS, I hope.