Yesterday I had a brilliant idea: why not parse the wiki of my favorite table top roleplaying game into yaml via an llm? I had tried the same with beautfifulsoup a couple of years ago, but the page is very inconsistent which makes it quite difficult to parse using traditional methods.

However, my attempts where not very successful to parse with a local mistral model (the one you get with ollama pull mistral) as it first insisted on writing more than just the yaml code and later had troubles with more complex pages like https://dsa.ulisses-regelwiki.de/zauber.html?zauber=Abvenenum So I thought I had to give it some examples in the system prompts, but while one example helped a little, when I included more, it sometimes started to just return an example from the ones I gave to it via system prompt.

To give some idea: the bold stuff should be keys in the yaml structure, the part that follows the value. Sometimes values need to be parsed a bit more like separating pages from book names - I would give examples for all that.

Any idea what model to use for that or how to improve results?

  • VoxAliorum@lemmy.mlOP
    link
    fedilink
    English
    arrow-up
    1
    ·
    edit-2
    7 months ago

    I tried feeding the html which didn’t work at all and then just the raw text of the tag with id main (so the text in the white area, but no html tags). It didn’t feel like the task was too difficult in the sense that it never produced good results but that it was too often deviating from the task talking about stuff or not sticking to the pattern once more than one pattern was introduced.

    Could you elaborate how userjs might help? Haven’t heard of it before but a quick google search didn’t make it immediately obvious. As I hinted before I tried using a python script with beautifulsoup parsing but due to the page being inconsistent, my results where debatable.

    • HelloRoot@lemy.lol
      link
      fedilink
      English
      arrow-up
      1
      ·
      edit-2
      7 months ago

      it was too often deviating from the task talking about stuff or not sticking to the pattern

      Yeah that sounds like it can’t keep a large enough context. Maybe try a beefier model.

      I just suggested userjs because it runs directly in the browser and can use js dom parsers. Also userjs could inject a button that downloads the yaml. Idk if thats desired.

      The page doesn’t seem too complex, as you said - you just have to find the tag with the bold text and then the following paragraphs. A simple loop based parser logic will do.