You are here

Web Access Optimization

I host web services (for fun) which receive essentially zero attention from the public (which is fine). Consequently most of the traffic to my host are bots. I have limited upstream bandwidth and so I would like to understand and (where necessary) filter that traffic.

The focus here is not comprehensive security for a web host. But rather, how to manage traffic that is enabled to your site after a comprehensive implementation of the host has been completed.

Levels of Protection

  1. Network
  2. Operating System
  3. Host (fail2ban)
  4. Host Service (robots)
  5. Host Service (modules)
  6. Middleware (PHP)
  7. Application (drupal)

Types of Bad Actors
SEO scrapers
SEO (search engine optimization) scrapers. This can go either way if you are running a business, but generally they scrape your site even if you are _not_ paying for their service. In the end they make money off of your content without compensating you for it.

Vulnerability Scanners
Many IPs once they are aware of the existence of your website will attempt to find vulnerabilities in your implementation by scanning for a bunch of different known vulnerabilities. This is obvious in your logs when a client will make a bunch of random requests for URLs looking for specific files or trying weird things that can crash/compromise your host.

ATO account take over
Similarly, many scanners will try to register clients or take over clients.

SPAM posters
Thirdly, the role of some clients is solely to post SPAM to your website. Usually in the form of comments to existing content.

DDOS
Finally at it's worst, a botnet can be pointed at your IP in an attempt to saturate your links and make your site inaccessible.

Types of Good Actors
Search Engine scrapers
Related to above, legit search engines also need to scrape content to make them available to search. They do monetize your content but they can also drive traffic to your website as well (which you can monetize).

AI scrapers
These fall into two categories: AI training and SEO (search engine optimization) scrapers. They both scrape the content but for different reasons. In the end they make money off of your content without compensating you for it.

User Traffic
Finally, we get to the real user traffic to your site. This is the best kind of access.

Implementation

  • Network
  • Passive and active measures at the network level provide a base level of protection:

    • Eliminate access from invalid or illegal types of traffic (per protocol)
    • IDS/IPS protection which bans known bad sources of traffic
  • Operating System
  • Ensure your system is patched and secured properly against unauthorized access.

    Ensure your web stack is patched and secured properly against compromise.

  • Host (fail2ban)
  • tbd tbd tbd

  • Host Service (robots.txt)
  • Some bots honor robots.txt. Most do not. For those that do you can tune their access here to block or rate limit them here.

  • Host Service (Apache modules)
  • tbd tbd tbd

  • Middleware (PHP)
  • tbd tbd tbd

  • Application (drupal)
  • There are many measures at the application level. Mostly these are spam prevention measures.

    • No anonymous posting
    • Require approval for posts
    • Require CAPTCHA for posts

    Summary
    This web stack offers a number of ways to protect your system, limit unauthorized (or inappropriate) access and ease maintenance.

    Add new comment

    Filtered HTML

    • Web page addresses and e-mail addresses turn into links automatically.
    • Allowed HTML tags: <a> <em> <strong> <cite> <code> <table> <tr> <td> <ul> <ol> <li> <dl> <dt> <pre> <dd> <img> <sub> <sup>
    • Lines and paragraphs break automatically.

    Plain text

    • No HTML tags allowed.
    • Web page addresses and e-mail addresses turn into links automatically.
    • Lines and paragraphs break automatically.
    CAPTCHA
    This question is for testing whether you are a human visitor and to prevent automated spam submissions.
    2 + 2 =
    Solve this simple math problem and enter the result. E.g. for 1+3, enter 4.