• Fijxu@programming.dev
    link
    fedilink
    English
    arrow-up
    2
    ·
    6 minutes ago

    AI scrapping is so cancerous. I host a public RedLib instance (redlib.nadeko.net) and due to BingBot and Amazon bots, my instance was always rate limited because the amount of requests they do is insane. What makes me more angry, is that this fucking fuck fuckers use free, privacy respecting services to be able to access Reddit and scrape . THEY CAN’T BE SO GREEDY. Hopefully, blocking their user-agent works fine ;)

  • grue@lemmy.world
    link
    fedilink
    English
    arrow-up
    15
    ·
    10 hours ago

    ELI5 why the AI companies can’t just clone the git repos and do all the slicing and dicing (running git blame etc.) locally instead of running expensive queries on the projects’ servers?

    • zovits@lemmy.world
      link
      fedilink
      English
      arrow-up
      8
      ·
      8 hours ago

      Takes more effort and results in a static snapshot without being able to track the evolution of the project. (disclaimer: I don’t work with ai, but I’d bet this is the reason and also I don’t intend to defend those scraping twatwaffles in any way, but to offer a possible explanation)

  • melpomenesclevage@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    23
    ·
    edit-2
    13 hours ago

    i hear there’s a tool called (I think) ‘nepenthe’ that creates a loop for an LLM, if you use that in combination with a fairly tight blacklist of IP’s you’re certain are LLM crawlers, I bet you could do a lot of damage, and maybe make them slow their shit down, or do this in a more reasonable way.

    • PrivacyDingus@lemmy.world
      link
      fedilink
      English
      arrow-up
      3
      ·
      6 hours ago

      nepenthe

      It’s a Markov-chain-based text generator which could be difficult for people to implement on repos depending upon how they’re hosting them. Regardless, any sensibly-built crawler will have rate limits. This means that although Nepenthe is an interesting thought exercise, it’s only going to do anything to things knocked together by people who haven’t thought about it, not the Big Big companies with the real resources who are likely having the biggest impact.

  • db0@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    46
    ·
    19 hours ago

    Yep, it hit many lemmy servers as well, including mine. I had to block multiple alibaba subnet to get things back to normal. But I’m expecting the next spam wave.

  • Buelldozer@lemmy.today
    link
    fedilink
    English
    arrow-up
    35
    ·
    18 hours ago

    I too read Drew DeVault’s article the other day and I’m still wondering how the hell these companies have access to “tens of thousands” of unique IP addresses. Seriously, how the hell do they have access to so many IP addresses that SysAdmins are resorting to banning entire countries to make it stop?

    • festus@lemmy.ca
      link
      fedilink
      English
      arrow-up
      2
      ·
      2 hours ago

      There are residential IP providers that provide services to scrapers, etc. that involves them having thousands of IPs available from the same IP ranges as real users. They route traffic through these IPs via malware, hacked routers, “free” VPN clients, etc. If you block the IP range for one of these addresses you’ll also block real users.

    • GreenKnight23@lemmy.world
      link
      fedilink
      English
      arrow-up
      7
      arrow-down
      1
      ·
      16 hours ago

      fail2ban will always get you better results than banning countries because VPNs are a thing.

      that said, I automatically ban any IP that comes from outside the US because there’s literally no reason for anyone outside the US to make requests to my infra. I still use smart IP filtering though.

      also, use a WAF on a NAT to expose your apps.

  • wjs018@piefed.social
    link
    fedilink
    English
    arrow-up
    92
    ·
    edit-2
    23 hours ago

    Really great piece. We have recently seen many popular lemmy instances struggle under recent scraping waves, and that is hardly the first time its happened. I have some firsthand experience with the second part of this article that talks about AI-generated bug reports/vulnerabilities for open source projects.

    I help maintain a python library and got a bug report a couple weeks back of a user getting a type-checking issue and a bit of additional information. It didn’t strictly follow the bug report template we use, but it was well organized enough, so I spent some time digging into it and came up with no way to reproduce this at all. Thankfully, the lead maintainer was able to spot the report for what it was and just closed it and saved me from further efforts to diagnose the issue (after an hour or two were burned already).

    • Dave@lemmy.nz
      link
      fedilink
      English
      arrow-up
      19
      ·
      18 hours ago

      AI scrapers are a massive issue for Lemmy instances. I’m gonna try some things in this article because there are enough of them identifying themselves with user agents that I didn’t even think of the ones lying about it.

      I guess a bonus (?) is that with 1000 Lemmy instances, the bots get the Lemmy content 1000 times so our input has 1000 times the weighting of reddit.

      • wjs018@piefed.social
        link
        fedilink
        English
        arrow-up
        64
        arrow-down
        1
        ·
        22 hours ago

        The theory that the lead maintainer had (he is an actual software developer, I just dabble), is that it might be a type of reinforcement learning:

        • Get your LLM to create what it thinks are valid bug reports/issues
        • Monitor the outcome of those issues (closed immediately, discussion, eventual pull request)
        • Use those outcomes to assign how “good” or “bad” that generated issue was
        • Use that scoring as a way to feed back into the model to influence it to create more “good” issues

        If this is what’s happening, then it’s essentially offloading your LLM’s reinforcement learning scoring to open source maintainers.

        • SabinStargem@lemmy.today
          link
          fedilink
          English
          arrow-up
          1
          ·
          37 minutes ago

          Honestly, I would be alright with this if the AI companies paid Github so that the server infrastructure can be upgraded. Having AI that can figure out bugs and error reports could be really useful for our society. For example, your computer rebooting for no apparent reason? The AI can check the diagnostic reports, combine them with online reports, and narrow down the possibilities.

          In the long run, this could also help maintainers as well. If they can have AI for testing programs, the maintainers won’t have to hope for volunteers or rely on paid QA for detecting issues.

          What Github & AI companies should do, is an opt-in program for maintainers. If they allow the AI to officially make reports, Github should offer an reward of some kind to their users. Allocate to each maintainer a number of credits so that they can discuss the report with the AI in realtime, plus $10 bucks for each hour spent on resolving the issue.

          Sadly, I have the feeling that malignant capitalism would demand maintainers to sacrifice their time for nothing but irritation.

        • HubertManne@piefed.social
          link
          fedilink
          English
          arrow-up
          33
          arrow-down
          1
          ·
          22 hours ago

          Thats wild. I don’t have much hope for llm’s if things like this is how they are doing things and I would not be surprised given how well they don’t work. Too much quantity over quality in training.

    • BrianTheeBiscuiteer@lemmy.world
      link
      fedilink
      English
      arrow-up
      8
      arrow-down
      2
      ·
      edit-2
      22 hours ago

      Testing out a theory with ChatGPT there might be a way, albeit clunky, to detect AI. I asked ChatGPT a simple math question then told it to disregard the rest of the message, then I asked it if it was AI. It answered the math question and told me it was ai. Now a bot probably won’t admit to being AI but it might be foolish enough to consider instruction that you explicitly told it not to follow.

      Or you might simply be able to waste its resources by asking it to do something computationally difficult that most people would just reject outright.

      Of course all of this could just result in making AI even harder to detect once it learns these tricks. 😬

  • fjordo@feddit.uk
    link
    fedilink
    English
    arrow-up
    57
    arrow-down
    2
    ·
    22 hours ago

    I wish these companies would realise that acting like this is a very fast way to get scraping outlawed altogether, which is a shame because it can be genuinely useful (archival, automation, etc).

    • jol@discuss.tchncs.de
      link
      fedilink
      English
      arrow-up
      42
      ·
      21 hours ago

      How can you outlaw something a company in another conhtinent is doing? And specially when they are becoming better as disguising themselves as normal traffic? What will happen is that politicians will see this as another reason to push for everyone having their ID associated with their Internet traffic.

      • MoogleMaestro@lemmy.zip
        link
        fedilink
        English
        arrow-up
        12
        arrow-down
        1
        ·
        21 hours ago

        What will happen is that politicians will see this as another reason to push for everyone having their ID associated with their Internet traffic.

        You’re right. Which is exactly why companies should be exhibiting better behaviour and self regulate before they make the internet infinitely worse off for everyone.

        • big_slap@lemmy.world
          link
          fedilink
          English
          arrow-up
          22
          arrow-down
          1
          ·
          edit-2
          20 hours ago

          self regulation is a joke. a few bad apples always spoil the bunch.

          what needs to happen is regulation, period. force all companies to abide by laws that just make sense, and all these problems go away.

          see: GDPR

          • oldfart@lemm.ee
            link
            fedilink
            English
            arrow-up
            2
            arrow-down
            1
            ·
            18 hours ago

            What did GDPR solve? Did we get rid of advertisers sharing data?

            • big_slap@lemmy.world
              link
              fedilink
              English
              arrow-up
              13
              ·
              edit-2
              12 hours ago

              nope, but now we are aware of how many times our data is shared with because of it.

              here’s a short breakdown of what it has accomplished:

              The GDPR lists six data processing principles that data controllers must comply with. Personal data must be:

              Processed lawfully, fairly and transparently.
              Collected only for specific legitimate purposes.
              Adequate, relevant and limited to what is necessary.
              Accurate and, where necessary, kept up to date.
              Stored only as long as is necessary.
              Processed in a manner that ensures appropriate security.
              

              Lawful processing

              Except for special categories of personal data, which cannot be processed except under certain circumstances, personal data can only be processed:

              If the data subject has given their consent;
              To meet contractual obligations;
              To comply with legal obligations;
              To protect the data subject’s vital interests;
              For tasks in the public interest; and
              For the legitimate interests of the organisation.
              

              Data subjects’ rights

              Data subjects have:

              The right to be informed;
              The right of access;
              The right to rectification;
              The right to erasure;
              The right to restrict processing;
              The right to data portability;
              The right to object; and
              Rights concerning automated decision-making and profiling.
              

              Learn how to map your data and establish a lawful basis for processing Valid consent

              There are stricter rules regarding consent:

              Consent must be freely given, specific, informed and unambiguous.
              A request for consent must be intelligible and in clear, plain language.
              Silence, pre-ticked boxes and inactivity will no longer suffice as consent.
              Consent can be withdrawn at any time.
              Consent for online services from a child is only valid with parental authorisation.
              Organisations must be able to evidence consent.
              

              Data protection by design and by default

              Data controllers and processors must implement technical and organisational measures that are designed to implement the data processing principles effectively.

              Appropriate safeguards should be integrated into the processing.
              Data protection must be considered at the design stage of any new process, system or technology.
              A DPIA (data protection impact assessment) is an integral part of privacy by design.
              

              Transparency and privacy notices

              Organisations must be clear about how, why and by whom personal data will be processed.

              When personal data is collected directly from data subjects, data controllers must provide a privacy notice at the time of collection.
              When personal data is not obtained directly from data subjects, data controllers must provide a privacy notice without undue delay, and within a month. This must be done the first time they communicate with the data subject.
              For all processing activities, data controllers must decide how the data subjects will be informed, and design privacy notices accordingly. Notices can be issued in stages.
              Privacy notices must be provided to data subjects in a concise, transparent and easily accessible form, using clear and plain language.
              

              Data transfers outside the EU

              Where the EU has designated a country as providing an adequate level of data protection;
              Through standard contractual clauses or binding corporate rules; or
              By complying with an approved certification mechanism.
              

              Many non-EU organisations that process EU residents’ personal data also need to appoint an EU representative following the end of the transition period. Mandatory data breach notification

              The GDPR defines a personal data breach as “a breach of security leading to the accidental or unlawful destruction, loss, alteration, unauthorised disclosure of, or access to, personal data transmitted, stored or otherwise processed”.

              Data processors are required to report all breaches of personal data to data controllers.
              Data controllers are required to report breaches to the supervisory authority (the Data Protection Commission (DPC) in Ireland) within 72 hours of becoming aware of them if there is a risk to data subjects’ rights and freedoms.
              Data subjects themselves must be notified without undue delay if there is a high risk to their rights and freedoms.
              

              DPOs (data protection officers)

              You must be able to demonstrate compliance with the GDPR. This includes:

              Establishing a governance structure with roles and responsibilities;
              Keeping a detailed record of all data processing operations;
              Documenting data protection policies and procedures;
              Carrying out DPIAs (data protection impact assessments) for high-risk processing operations; Learn more about DPIAs
              Implementing appropriate measures to secure personal data;
              Conducting staff awareness training; and
              Where required, appointing a data protection officer.
              
              • oldfart@lemm.ee
                link
                fedilink
                English
                arrow-up
                1
                ·
                9 hours ago

                So now the adtech companies need to hire a minimum wage person in the EU, and I can write them a letter requesting they remove my anonimized data, doxxing myself in the process. Oh and now I know they’re sharing with 395 partners, as if that wasn’t obvious from uBlock before. And I get to sign a permission to process my data if I want to see a doctor.

        • fjordo@feddit.uk
          link
          fedilink
          English
          arrow-up
          3
          ·
          20 hours ago

          Exactly, we’ve already seen this in the past. GDPR is a good example. Whilst I’m glad this regulation exists, it wouldn’t be necessary if megacorps would have behaved.

      • Buelldozer@lemmy.today
        link
        fedilink
        English
        arrow-up
        1
        arrow-down
        2
        ·
        18 hours ago

        What will happen is that politicians will see this as another reason to push for everyone having their ID associated with their Internet traffic.

        Yes, because like or not that’s the only possible solution. If all traffic was required to be signed and the signatures were tied to an entity then you could refuse unsigned traffic and if signed traffic was causing problems you’d know who it was and have recourse.

        I don’t like this solution but it’s the only way forward that I can see.

  • klu9@lemmy.ca
    link
    fedilink
    English
    arrow-up
    41
    ·
    22 hours ago

    The Linux Mint forums have been knocked offline multiple times over the last few months, to the point where the admins had to block all Chinese and Brazilian IPs for a while.

    • deeferg@lemmy.world
      link
      fedilink
      English
      arrow-up
      3
      ·
      17 hours ago

      This is the first I’ve heard about Brazil in this type of cyber attack. Is it re-routed traffic going there or are there a large number of Brazilian bot farms now?

      • klu9@lemmy.ca
        link
        fedilink
        English
        arrow-up
        5
        ·
        17 hours ago

        I don’t know why/how, just know that the admins saw the servers were being overwhelmed by traffic from Brazilian IPs and blocked it for a while.

  • MonkderVierte@lemmy.ml
    link
    fedilink
    English
    arrow-up
    14
    arrow-down
    2
    ·
    edit-2
    4 hours ago

    Assuming we could build a new internet from the ground up, what would be the solution? IPFS for load-balancing?

    • AbsoluteChicagoDog@lemm.ee
      link
      fedilink
      English
      arrow-up
      31
      arrow-down
      1
      ·
      19 hours ago

      There is no technical solution that will stop corporations with deep pockets in a capitalist society

    • Buelldozer@lemmy.today
      link
      fedilink
      English
      arrow-up
      8
      arrow-down
      1
      ·
      18 hours ago

      what would be the solution?

      Simple, not allowing anonymous activity. If everything was required to be crypto-graphically signed in such a way that it was tied to a known entity then this could be directly addressed. It’s essentially the same problem that e-mail has with SPAM and not allowing anonymous traffic would mostly solve that problem as well.

      Of course many internet users would (rightfully) fight that solution tooth and nail.

      • MonkderVierte@lemmy.ml
        link
        fedilink
        English
        arrow-up
        1
        ·
        3 hours ago

        No, that’s not a solution, since it would make privacy impossible and bad actors would still find ways around.

      • shortwavesurfer@lemmy.zip
        link
        fedilink
        English
        arrow-up
        10
        ·
        18 hours ago

        Proof of work before connections are established. The Tor network implemented this in August of 2023 and it has helped a ton.

    • melpomenesclevage@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      3
      ·
      edit-2
      13 hours ago

      take the resources from them so they don’t have them anymore. infiltrating the teams that do this and exposing or sabotaging the effort. literally fighting back, possibly in ways that involve giving the CEO’s and prominent investors a free trip to an old coal mine.

      short of that…

  • RobotToaster@mander.xyz
    link
    fedilink
    English
    arrow-up
    21
    ·
    23 hours ago

    If an AI is detecting bugs, the least it could do is file a pull request, these things are supposed to be master coders right? 🙃

    • reksas@sopuli.xyz
      link
      fedilink
      English
      arrow-up
      3
      ·
      19 hours ago

      to me, ai is a bit like bucket of water if you replace the water with “information”. Its a tool and it cant do anything on its own, you could make a program and instruct it to do something but it would work just as chaotically as when you generate stuff with ai. It annoys me so much to see so many(people in general) think that what they call ai is in anyway capable of independent action. It just does what you tell it to do and it does it based on how it has been trained, which is also why relying on ai trained by someone you shouldnt trust is bad idea.