Yesterday I had a brilliant idea: why not parse the wiki of my favorite table top roleplaying game into yaml via an llm? I had tried the same with beautfifulsoup a couple of years ago, but the page is very inconsistent which makes it quite difficult to parse using traditional methods.
- https://dsa.ulisses-regelwiki.de/Kul_Auelfen.html
- https://dsa.ulisses-regelwiki.de/erw_zauber_sf.html?erw_zaubersf=Alchimieanalytiker
- https://dsa.ulisses-regelwiki.de/KSF_Alter_Adersin.html
However, my attempts where not very successful to parse with a local mistral model (the one you get with ollama pull mistral) as it first insisted on writing more than just the yaml code and later had troubles with more complex pages like https://dsa.ulisses-regelwiki.de/zauber.html?zauber=Abvenenum So I thought I had to give it some examples in the system prompts, but while one example helped a little, when I included more, it sometimes started to just return an example from the ones I gave to it via system prompt.
To give some idea: the bold stuff should be keys in the yaml structure, the part that follows the value. Sometimes values need to be parsed a bit more like separating pages from book names - I would give examples for all that.
Any idea what model to use for that or how to improve results?


i have not honestly been able to ensure that. it partially works by just putting in jsonc that this particular key is optional, but that is not a guarantee. more generally, i try to avoid adding optional keys, and mostly leave that upto llm to put any such line in a catch all miscellaneous section. We do manual checking afterwards, so some inaccuracy is accepted.
on a seperate note, larger, better models usually perform better.
Ah, I see. While I plan for manual checking, it has to be >>90% right to be a viable solution. Anyway, when I find the time to try it out, I will comeback with my results. If you any additional ideas, feel free to share them!