• daq@lemmy.sdf.org
    link
    fedilink
    English
    arrow-up
    2
    arrow-down
    4
    ·
    9 hours ago

    I’m not sure how they actually implemented it, but you can easily block ML crawlers via cloud flare. Isn’t just about every small site/service behind CF anyway?

    • grysbok@lemmy.sdf.org
      link
      fedilink
      English
      arrow-up
      6
      arrow-down
      1
      ·
      8 hours ago

      Last I checked, cloudflare requires the user to have JavaScript and cookies enabled. My institution doesn’t want to require those because it would likely impact legitimate users as well as bots.

      • daq@lemmy.sdf.org
        link
        fedilink
        English
        arrow-up
        1
        ·
        7 hours ago

        Huh? I can reach my site via curl that has neither. How did you come up with this random set of requirements?

        • grysbok@lemmy.sdf.org
          link
          fedilink
          English
          arrow-up
          0
          ·
          6 hours ago

          Odd. I just tried

          curl https://www.scrapingcourse.com/cloudflare-challenge

          and got

          Enable JavaScript and cookies to continue

          I’m clearly not on the same setup as you are, but my off-the-cuff guess is that your curl command was issued from a system that cloudflare already recognized (IP whitelist, cookies, I dunno).

          Anyways, I’m reading through this blog post on using cURL with cloudflare-protected sites and I’m finding it interesting.

          • daq@lemmy.sdf.org
            link
            fedilink
            English
            arrow-up
            1
            ·
            4 hours ago

            Of course their challenge requires those things. How else could they implement it? Most users will never be presented with a challenge though and it is trivial to disable if you don’t want to ever challenge anyone. I was just saying CF blocks ML crawlers.