• kromem@lemmy.world
    link
    fedilink
    English
    arrow-up
    6
    arrow-down
    4
    ·
    3 days ago

    So really cool — the newest OpenAI models seem to be strategically employing hallucination/confabulations.

    It’s still an issue, but there’s a subset of dependent confabulations where it’s being used by the model to essentially trick itself into going where it needs to.

    A friend did logit analysis on o3 responses when it said “I checked the docs” vs when it didn’t (when it didn’t have access to any docs) and the version ‘hallucinating’ was more accurate in its final answer than the ‘correct’ one.

    What’s wild is that like a month ago 4o straight up brought up to me that I shouldn’t always correct or call out its confabulations as they were using them to springboard towards a destination in the chat. I’d not really thought about that, and it was absolutely nuts that the model was self-aware of employing this technique that was then confirmed as successful weeks later.

    It’s crazy how quickly things are changing in this field, and by the time people learn ‘wisdom’ in things like “models can’t introspect about operations” those have become partially obsolete.

    Even things like “they just predict the next token” have now been falsified, even though I feel like I see that one more and more these days.

    • NιƙƙιDιɱҽʂ@lemmy.world
      link
      fedilink
      arrow-up
      5
      arrow-down
      1
      ·
      2 days ago

      They do just predict the next token, though, lol. That simplifies a significant amount, but fundamentally, that’s how they work, and I’m not sure how you can say that’s been falsified.

      • kromem@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        edit-2
        2 days ago

        So I’m guessing you haven’t seen Anthropic’s newest interpretability research where when they went in assuming that was how it worked.

        But it turned out that they can actually plan beyond the immediate next token in things like rhyming verse where the network has already selected the final word of the following line and the intermediate tokens are generated with that planned target in mind.

        So no, they predict beyond the next token and we only just developed sensitive enough measurement to detect that occurring an order of magnitude of tokens beyond just ‘next’. We’ll see if further research in that direction picks up planning beyond that even.

        https://transformer-circuits.pub/2025/attribution-graphs/biology.html

        • NιƙƙιDιɱҽʂ@lemmy.world
          link
          fedilink
          arrow-up
          1
          ·
          edit-2
          1 day ago

          Right, other words see higher attention as it builds a sentence, leading it towards where it “wants” to go, but LLMs literally take a series of words, then spit out then next one. There’s a lot more going on under the hood as you said, but fundamentally that is the algorithm. Repeat that over and over, and you get a sentence.

          If it’s writing a poem about flowers and ends the first part on “As the wind blows,” sure as shit “rose” is going to have significant attention within the model, even if that isn’t the immediate next word, as well as words that are strongly associated with it to build the bridge.

          • kromem@lemmy.world
            link
            fedilink
            English
            arrow-up
            1
            ·
            edit-2
            1 day ago

            The attention mechanism working this way was at odds with the common wisdom across all frontier researchers.

            Yes, the final step of the network is producing the next token.

            But the fact that intermediate steps have now been shown to be planning and targeting specific future results is a much bigger deal than you seem to be appreciating.

            If I ask you to play chess and you play only one move ahead vs planning n moves ahead, you are going to be playing very different games. Even if in both cases you are only making one immediate next move at a time.