Lemmy newb here, not sure if this is right for this /c.

An article I found from someone who hosts their own website and micro-social network, and their experience with web-scraping robots who refuse to respect robots.txt, and how they deal with them.

  • drkt@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    1
    ·
    3 months ago

    I have plenty of spare bandwidth and babysitting-resources so my approach is largely to waste their time. If they poke my honeypot they get poked back and have to escape a tarpit specifically designed to waste their bandwidth above all. It costs me nothing because of my circumstances but I know it costs them because their connections are metered. I also know it works because they largely stop crawling my domains I employ this on. I am essentially making my domains appear hostile.

    It does mean that my residential IP ends up on various blocklists but I’m just at a point in my life where I don’t give an unwiped asshole about it. I can’t access your site? I’m not going to your site, then. Fuck you. I’m not even gonna email you about the false-positive.

    It is also fun to keep a log of which IPs have poked the honeypot have open ports, and to automate a process of siphoning information out of those ports. Finding a lot of hacked NVR’s recently I think are part of some IoT botnet to scrape the internet.

    • melroy@kbin.melroy.org
      link
      fedilink
      arrow-up
      1
      ·
      3 months ago

      I found a very large botnet in Brazil mainly and several other countries. And abuseipdb.com is not marking those IPs are a thread. We need a better solution.

      I think a honeypot is a good way. Another way is to use proof of work basically on the client side. Or we need a better place to share all stupid web scraping bot IPs.