• who@feddit.org
    link
    fedilink
    English
    arrow-up
    18
    ·
    edit-2
    10 days ago

    Unfortunately, robots.txt cannot express rate limits, so it would be an overly blunt instrument for things like GP describes. HTTP 429 would be a better fit.

    • redjard@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      9
      ·
      9 days ago

      Crawl-delay is just that, a simple directive to add to robots.txt to set the maximum crawl frequency. It used to be widely followed by all but the worst crawlers …

      • who@feddit.org
        link
        fedilink
        English
        arrow-up
        2
        ·
        edit-2
        9 days ago

        Crawl-delay

        It’s a nonstandard extension without consistent semantics or wide support, but I suppose it’s good to know about anyway. Thanks for mentioning it.

    • S7rauss@discuss.tchncs.de
      link
      fedilink
      English
      arrow-up
      4
      ·
      10 days ago

      I was responding to their question if scraping the site is considered harmful. I would say as long as they are not ignoring robots they shouldn’t be contributing significant amounts of traffic if they’re really only pulling data once a day.