• 0 Posts
  • 42 Comments
Joined 2 years ago
cake
Cake day: August 29th, 2023

help-circle

  • So this blog post was framed positively towards LLM’s and is too generous in accepting many of the claims around them, but even so, the end conclusions are pretty harsh on practical LLM agents: https://utkarshkanwat.com/writing/betting-against-agents/

    Basically, the author has tried extensively, in multiple projects, to make LLM agents work in various useful ways, but in practice:

    The dirty secret of every production agent system is that the AI is doing maybe 30% of the work. The other 70% is tool engineering: designing feedback interfaces, managing context efficiently, handling partial failures, and building recovery mechanisms that the AI can actually understand and use.

    The author strips down and simplifies and sanitizes everything going into the LLMs and then implements both automated checks and human confirmation on everything they put out. At that point it makes you question what value you are even getting out of the LLM. (The real answer, which the author only indirectly acknowledges, is attracting idiotic VC funding and upper management approval).

    Even as critcal as they are, the author doesn’t acknowledge a lot of the bigger problems. The API cost is a major expense and design constraint on the LLM agents they have made, but the author doesn’t acknowledge the prices are likely to rise dramatically once VC subsidization runs out.


  • Is this ā€œnarrativeā€ in the room with us right now?

    I actually recall recently someone pro llm trying to push that sort of narrative (that it’s only already mentally ill people being pushed over the edge by chatGPT)…

    Where did I see it… oh yes, lesswrong! https://www.lesswrong.com/posts/f86hgR5ShiEj4beyZ/on-chatgpt-psychosis-and-llm-sycophancy

    This has all the hallmarks of a moral panic. ChatGPT has 122 million daily active users according to Demand Sage, that is something like a third the population of the United States. At that scale it’s pretty much inevitable that you’re going to get some real loonies on the platform. In fact at that scale it’s pretty much inevitable you’re going to get people whose first psychotic break lines up with when they started using ChatGPT. But even just stylistically it’s fairly obvious that journalists love this narrative. There’s nothing Western readers love more than a spooky story about technology gone awry or corrupting people, it reliably rakes in the clicks.

    The call narrative is coming from inside the house forum. Actually, this is even more of a deflection, not even trying to claim they were already on the edge but that the number of delusional people is at the base rate (with no actual stats on rates of psychotic breaks, because on lesswrong vibes are good enough).













  • I think we mocked this one back when it came out on /r/sneerclub, but I can’t find the thread. In general, I recall Yudkowsky went on a mini-podcast tour a few years back. I think the general trend was that he didn’t interview that well, even by lesswrong’s own standards. He tended to simultaneously assume too much background familiarity with his writing such that anyone not already familiar with it would be lost and fail to add anything actually new for anyone already familiar with his writing. And lots of circular arguments and repetitious discussion with the hosts. I guess that’s the downside of hanging around within your own echo chamber blog for decades instead of engaging with wider academia.


  • For purposes of something easily definable and legally valid that makes sense, but it is still so worthy of mockery and sneering. Also, even if they needed a benchmark like that for their bizarre legal arrangements, there was no reason besides marketing hype to call that threshold ā€œAGIā€.

    In general the definitional games around AGI are so transparent and stupid, yet people still fall for them. AGI means performing at least human level across all cognitive tasks. Not across all benchmarks of cognitive tasks, the tasks themselves. Not superhuman in some narrow domains and blatantly stupid in most others. To be fair, the definition might not be that useful, but it’s not really in question.




  • Gary Marcus has been a solid source of sneer material and debunking of LLM hype, but yeah, you’re right. Gary Marcus has been taking victory laps over a bar set so so low by promptfarmers and promptfondlers. Also, side note, his negativity towards LLM hype shouldn’t be misinterpreted as general skepticism towards all AI… in particular Gary Marcus is pretty optimistic about neurosymbolic hybrid approaches, it’s just his predictions and hypothesizing are pretty reasonable and grounded relative to the sheer insanity of LLM hypsters.

    Also, new possible source of sneers in the near future: Gary Marcus has made a lesswrong account and started directly engaging with them: https://www.lesswrong.com/posts/Q2PdrjowtXkYQ5whW/the-best-simple-argument-for-pausing-ai

    Predicting in advance: Gary Marcus will be dragged down by lesswrong, not lesswrong dragged up towards sanity. He’ll start to use lesswrong lingo and terminology and using P(some event) based on numbers pulled out of his ass. Maybe he’ll even start to be ā€œcharitableā€ to meet their norms and avoid down votes (I hope not, his snark and contempt are both enjoyable and deserved, but I’m not optimistic based on how the skeptics and critics within lesswrong itself learn to temper and moderate their criticism within the site). Lesswrong will moderately upvote his posts when he is sufficiently deferential to their norms and window of acceptable ideas, but won’t actually learn much from him.


  • Unlike with coding, there are no simple ā€œtestsā€ to try out whether an AI’s answer is correct or not.

    So for most actual practical software development, writing tests is in fact an entire job in and of itself and its a tricky one because covering even a fraction of the use cases and complexity the software will actually face when deployed is really hard. So simply letting the LLMs brute force trial-and-error their code through a bunch of tests won’t actually get you good working code.

    AlphaEvolve kind of did this, but it was testing very specific, well defined, well constrained algorithms that could have very specific evaluation written for them and it was using an evolutionary algorithm to guide the trial and error process. They don’t say exactly in their paper, but that probably meant generating code hundreds or thousands or even tens of thousands of times to generate relatively short sections of code.

    I’ve noticed a trend where people assume other fields have problems LLMs can handle, but the actually competent experts in that field know why LLMs fail at key pieces.