It’s not out of the question that we get emergent behaviour where the model can connect non-optimally mapped tokens and still translate them correctly, yeah.
The concern is that the model doesn’t actually see the world in terms of distinct hexadecimals, but instead as tokens of variable size - you can see this using the tiktokenizer-webapp: enter some text and it will split it into the series of tokens the model actually will process.
It’s not impossible for the model to work it out anyway, but it is a reason for this type of task to be a bit harder on LLMs.
It is a concern.
Check out https://tiktokenizer.vercel.app/?model=deepseek-ai%2FDeepSeek-R1 and try entering some freeform hexadecimal data - you’ll notice that it does not cleanly segment the hexadecimal numbers into individual tokens.
I’m well aware, but you don’t need to necessarily see each character to translate to bytes
It’s not out of the question that we get emergent behaviour where the model can connect non-optimally mapped tokens and still translate them correctly, yeah.
I’m confused, is the concern when the model doesn’t properly identify when it is using software to identify something like a hex pattern?
The concern is that the model doesn’t actually see the world in terms of distinct hexadecimals, but instead as tokens of variable size - you can see this using the tiktokenizer-webapp: enter some text and it will split it into the series of tokens the model actually will process.
It’s not impossible for the model to work it out anyway, but it is a reason for this type of task to be a bit harder on LLMs.