You are here

Web Access Optimization

I host web services (for fun) which receive essentially zero attention from the public (which is fine). Consequently most of the traffic to my host are bots. I have limited upstream bandwidth and so I would like to understand and (where necessary) filter that traffic.

The focus here is not comprehensive security for a web host. But rather, how to manage traffic that is enabled to your site after a comprehensive implementation of the host has been completed.

Levels of Protection

  1. Network
  2. Host (OS/Distro)
  3. Host (fail2ban)
  4. Host Service (robots.txt)
  5. Host Service (modules)
  6. Middleware
  7. Application

Types of Bad Actors
SEO scrapers
SEO (search engine optimization) scrapers. This can go either way if you are running a business, but generally they scrape your site even if you are _not_ paying for their service. In the end they make money off of your content without compensating you for it.

Vulnerability Scanners
Many IPs once they are aware of the existence of your website will attempt to find vulnerabilities in your implementation by scanning for a bunch of different known vulnerabilities. This is obvious in your logs when a client will make a bunch of random requests for URLs looking for specific files or trying weird things that can crash/compromise your host.

ATO account take over
Similarly, many scanners will try to register clients or take over clients.

SPAM posters
Thirdly, the role of some clients is solely to post SPAM to your website. Usually in the form of comments to existing content.

DDOS
Finally at it's worst, a botnet can be pointed at your IP in an attempt to saturate your links and make your site inaccessible.

Types of Good Actors
Search Engine scrapers
Related to above, legit search engines also need to scrape content to make them available to search. They do monetize your content but they can also drive traffic to your website as well (which you can monetize).

AI scrapers
These fall into two categories: AI training and SEO (search engine optimization) scrapers. They both scrape the content but for different reasons. In the end they make money off of your content without compensating you for it.

User Traffic
Finally, we get to the real user traffic to your site. This is the best kind of access.

Implementation

  • Network
  • Passive and active measures at the network level provide a base level of protection:

    • Eliminate access from invalid or illegal types of traffic (per protocol)
    • Minimize your attack surface (e.g. firewall)
    • IDS/IPS protection which bans known bad actors (e.g. specific IPs)
    • IDS/IPS protection which bans known bad traffic (e.g. specific patterns)
  • Host (OS/Distro)
  • Ensure your system is patched and secured properly against unauthorized access.

    Ensure your web stack is patched and secured properly against compromise.

    When bad traffic gets through you should not be vulnerable to known exploits.

  • Host (fail2ban)
  • fail2ban is a great resource for punishing rude scrapers (those that make too many requests) with a pattern match and forced limiting (ie temporarily blocking IPs).

    This service is also great for pattern matching known exploits that you do not have patches for or other types of predictable traffic that are inappropriate.

  • Host Service (robots.txt)
  • Some bots honor robots.txt. Most do not. For those that do you can tune their access here to block or rate limit them here.

  • Host Service (modules)
  • tbd tbd tbd Apache modules

  • Middleware
  • tbd tbd tbd PHP

  • Application
  • Drupal has plugins that can help slow down overposting or block bad actors. The advantage here is that you can provide feedback to end users whereas fail2ban just outright blocks and drops traffic.

    There are also provisions you would expect on any platform for preventing inappropriate traffic:

    • No anonymous posting
    • No anonymous commenting
    • Require CAPTCHA for all content
    • Require approval for all content

    Summary
    This web stack offers a number of ways to protect your system, limit unauthorized (or inappropriate) access and ease maintenance.

    State of the Art
    I employ a lot of network level provisions to prevent traffic. This does not catch as much traffic as I would like. I think this is due in part to the fact that I use open source lists for blocking bad traffic.

    I use fail2ban and robots.txt a lot. This does slow down greedy actors and in some cases blocks bad actors or actors with buggy scrapers (looking at meta here).

    Finally, I use CAPTCHA, approvals and user accounts to ensure that posts and comments that make it to the site are actually real people (in other words me).

    In the future I would like to take a closer look at Drupal modules and I would also like to deep dive PHP and Apache a little bit closer and understanding them a bit better before I write about them.

    For the most part I get a LOT of spam comments that I have to delete and I do see quite a bit of brute force hacking attempts that I would like to block at the IP level (or perhaps detecting them at the app level).

    Add new comment

    Filtered HTML

    • Web page addresses and e-mail addresses turn into links automatically.
    • Allowed HTML tags: <a> <em> <strong> <cite> <code> <table> <tr> <td> <ul> <ol> <li> <dl> <dt> <pre> <dd> <img> <sub> <sup>
    • Lines and paragraphs break automatically.

    Plain text

    • No HTML tags allowed.
    • Web page addresses and e-mail addresses turn into links automatically.
    • Lines and paragraphs break automatically.
    CAPTCHA
    This question is for testing whether you are a human visitor and to prevent automated spam submissions.
    6 + 5 =
    Solve this simple math problem and enter the result. E.g. for 1+3, enter 4.