Stubsack: weekly thread for sneers not worth an entire post, week ending 17th August 2025

David Gerard@awful.systems · 5 months ago

Stubsack: weekly thread for sneers not worth an entire post, week ending 17th August 2025

BlueMonday1984@awful.systems · 5 months ago

aio@awful.systems · edit-2 5 months ago

I don’t really understand what point Zitron is making about each query requiring a “completely fresh static prompt”, nor about the relative ordering of the user and static prompts. Why would these things matter?

scruiser@awful.systems · edit-2 5 months ago

There are techniques for caching some of the steps involved with LLMs. Like I think you can cache the tokenization and maybe some of the work of the attention head is doing if you have a static, known, prompt? But I don’t see why you couldn’t just do that caching separately for each model your model router might direct things to? And if you have multiple prompts you just do a separate caching for each one? This creates a lot of memory usage overhead, but not more excessively more computation… well you do need to do the computation to generate each cache. I don’t find it that implausible that OpenAI couldn’t manage to screw all this up somehow, but I’m not quite sure the exact explanation of the problem Zitron has given fits together.

(The order of the prompts vs. user interactions does matter, especially for caching… but I think you could just cut and paste the user interactions to separate it from the old prompt and stick a new prompt on it in whatever order works best? You would get wildly varying quality in output generated as it switches between models and prompts, but this wouldn’t add in more computation…)

Zitron mentioned a scoop, so I hope/assume someone did some prompt hacking to get GPT-5 to spit out some of it’s behind the scenes prompts and he has solid proof about what he is saying. I wouldn’t put anything past OpenAI for certain.

Architeuthis@awful.systems · edit-2 5 months ago

And if you have multiple prompts you just do a separate caching for each one?

I think this hinges on the system prompt going after the user prompt, for some router-related non-obvious reason, meaning at each model change the input is always new and thus uncacheable.

Also going by the last Claude system prompt that leaked these things can be like 20.000 tokens long.