Sinkholing - Face the Threat, Beat the Threat

Part II - Implementation

In this second in a three part series on Sinkholing, we will dive into the details of implementation.

In part one of our series on Sinkholing, we got into general ideas about how Sinkholing works, why it is needed, and how it fits into our infrastructure. Now, we dig deeper into the details. First, let’s do a short recap of Sinkholing’s general architecture.

Architecture recap

Every request that lands on our platform first hits a HAProxy on our frontend machines and is sent to Varnish, which handles routing to our backend microservices. All logs from this chain are shipped to Elasticsearch, and ElastAlert then watches the logs for predefined events, saving information about found offenders into Redis. The circle closes with HAProxy’s fetching the information from Redis to update its internal filters used for blocking of further events from unwanted sources.

High-level Sinkholing overview

High-level Sinkholing overview
Source: Showmax

Find them all

ElastAlert

Let’s start with the ElastAlert, as it ties the whole sinkholing process together - it’s the place where the blocking begins. As mentioned previously in this series, ElastAlert is a tool that periodically queries Elasticsearch with a specified set of queries, raising alerts when it finds a match. Rules for ElastAlert are configured via YAML configuration files, one per each rule.

Rules

Imagine that you would like to ban an IP based on a number of unsuccessful login attempts. What would such a rule look like? ElastAlert has a rich set of predefined rule types. For this example, a rule called Frequency can be used. It raises an alert when there is at least a certain number of events in a given time frame. Here is an example rule config for the scenario outlined above:

index: logstash-*
type: frequency
name: Blogpost Test Rule Banning Unsuccessful Logins
num_events: 20
timeframe:
  minutes: 1

filter:
  - term:
      normalized_url: "/login"
  - range:
      http_code:
        from: 400
        to: 499

query_key:
  - client_ip
  - normalized_url


# Alerts from this custom alerter are stored into Redis
alert: "elastalert_alerts.showmax_alerts.sinkholing_alerts.BananaAlerter"

# Options required by Alerter
redis_ban_time: 900            # 15 minutes
redis_passwd: <some pass>      # leave empty for no password
redis_port: 6380
redis_host: <redis host>
redis_timeout: 1               # [s] - socket timeout
Password guessing detection


This rule will trigger an alert when there are more than 20 requests within a minute that failed with a 4xx HTTP error code and were made from the same IP to the same API endpoint (an url normalized to /login in this case). Using normalized urls is important here. It prevents hackers from bypassing such rule by adding some random query parameters. The alert option lets us choose any alerting mechanism that suits our needs. There are already many predefined alert types, including email, various chat services, HTTP POST, etc; but we decided to write our own.

Custom Alerter

The process of writing a custom alerter is quite simple. ElastAlert is written in Python, so you just need to subclass the Alerter class and implement alert and get_info methods. The former receives a list of documents from ElasticSearch that satisfy the rule (the matches parameter). It is then possible to extract all the info you need from the document. For the needs of sinkholing, we extract and ship the source IP, User Agent, and timestamp to Redis as a JSON. We also set a Redis key expiration (configurable via the redis_ban_time parameter in the rule definition) so we are automatically unblocking the offenders after a certain time period has passed. A redacted version of the alert function might look like this:

def alert(self, matches):

    """
    This function is called when matches are found.
    Matches are converted to JSON and sent to Redis
    """

    red = self.connect_redis()
    pipe = red.pipeline()

    for match in matches:
        # extract info from ES document
        ip = get_ip(match)
        url = get_url(match)

        if not ip:
            logger.info('No IP available, skipping match.')
            continue

        red_key = 'sm:banana:{}'.format(ip)

        data = {
            'client_ip4': ip,
            'reason': self.rule.get('name', 'Unknown elastalert rule'),
            'request_id': match.get('request_id', ''),
            'timestamp': match.get('@timestamp', ''),
            'ua': get_ua(match),
            'url': url,
            'ver': 1,
        }

        red_val = json.dumps(data)

        pipe.set(red_key, red_val, self.rule['redis_ban_time'])
        pipe.execute()
Custom alerter for ban info extraction


Ban them all

Pushing the records to Redis triggers a chain reaction, but before we take a look at the rest of the path of the ban info, we need to familiarize ourselves with how HAProxy works and how it can be used for banning.

HAProxy - the Great Sinkholing Barrier

HAProxy serves as a main gateway for every request trying to find its way into our infrastructure. The basic structure of HAProxy configuration has several sections:

  • defaults
  • frontend
  • backend
  • listen

While defaults is pretty self-explanatory, the frontend and backend sections are the most important for us and deserve some attention. All requests land in the frontend section, where their fate is decided according to a predefined set of rules. The frontend section can be used for SSL stripping, header management, URL rewrites, redirections, and more. The request, if not refused, continues to the selected backend that defines where the requests go when they leave the proxy. HAProxy’s backends are equipped with a rich set of features for load-balancing, health-checking, etc. Now that we roughly know the flow of the requests in HAProxy, we should take a look at the most interesting part - on how to send rogue requests to a sinkhole.

ACLs

HAProxy uses so-called ACLs (Access Control Lists) that provide a configurable way to make decisions based on facts extracted from requests, responses, and any environmental status. We are primarily concerned with client IP addresses, although sinkholing could be configured to ban based on other client information such as User Agent. Actions on requests are taken according the results of the tests performed by ACLs. The basic syntax of ACLs is as follows:

acl <aclname> <criterion> [flags] [operator] [<value(s)>] …

To check whether a request comes from certain IP (let’s say 123.123.123.123), one could produce the following ACL:

acl ip_ban src -m ip -n 123.123.123.123

Let’s break it down:

  • acl - the ACL definition keyword
  • ip_ban - the name of the ACL
  • src - specifies the part of the request to look at (source of information)
  • -m ip - use matching method for IP detection
  • -n - disables DNS resolution

So this ACL will resolve to true for every request that comes from IP 123.123.123.123 (note that multiple IPs - separated by spaces - can be specified). Reading this, you may get worried that this is not really an ideal way to specify IPs to be banned - hardcoding them into config file. If that’s what you’re thinking, you’re correct. Changing the list would require a reload of HAProxy every time the list of banned offenders changes. Fortunately, HAProxy allows for the changing of ACL values via it’s admin socket interface. To use this feature, we must assign an unique ID to the ACL via the -u flag. We will assign an ID 0 as specifying IPs directly in the config is no longer required. The final rule will look like this:

acl ip_ban src -u 0 -m ip -n

This, however, raises another issue. How do we update ACL values via the socket? Fortunately, it’s not rocket science :) There are three basic commands one can use:

  • add acl #<ID> <value>
  • show acl #<ID>
  • clear acl #<ID>

Fox example:

> for i in {1..10}; do echo "add acl #0 127.0.0.${i}"| socat /run/haproxy/admin.sock stdio; done

> echo "show acl #0" | socat /run/haproxy/admin.sock stdio
0x1566a30 127.0.0.1
0x15d9f50 127.0.0.2
0x15d9ff0 127.0.0.3
0x15da090 127.0.0.4
0x15da130 127.0.0.5
0x15da1d0 127.0.0.6
0x15da270 127.0.0.7
0x15da310 127.0.0.8
0x15da3b0 127.0.0.9
0x15da450 127.0.0.10

> echo "clear acl #0" | socat /run/haproxy/admin.sock stdio
Updating ACL via socket


The only disadvantage of this approach is that individual IPs cannot be removed from the ACL. If one IP needs to be removed, the whole ACL must be cleared and then refilled with the rest of the IPs. Your tooling needs to be ready for this :)

The Sinkhole

Now that we know how to mark requests that should be banned, let’s send them to the sinkholing backend! For this, HAProxy uses a use_backend clause that can be bundled with a condition:

use_backend bk_sinkholing if ip_ban

That says to go to backend bk_sinkholing if the ip_ban acl resolves to true.

The only thing left to configure is the backend itself. Backends in HAProxy are configured using a backend keyword followed by its name. The backend in our case is essentially empty, with no backend services specified, because we don’t want the request to go anywhere else. This is where the actual sinkhole is :). The only thing the backend now does is send back an error message to let users know that they were banned.

To actually send a synthetic response and status, we use a neat trick. HAProxy generates a 503 error because there was no backend server available within the bk_sinkholing backend. So we can use the fact that it is possible to pass to HAProxy a custom error file for each error code. It will contain the message for the user together with a more appropriate error code (429 - Too many requests). In the future, we do plan to be nice and create a Sorry Page for banned users that would allow them to be unbanned if they pass a test such as solving a CAPTCHA. The HAProxy configuration is ready for it, we just need to route the requests from the backend section to the sorry page.

To sum it up, a very basic config for HAProxy that could be used for sinkholing may look like this:

global
     stats socket /var/run/haproxy/haproxy.sock mode 770 level admin
     # <shortened>

defaults
     # <shortened>

frontend showmax
     <shortened>
    # Sinkholing
    acl ip_ban src -u 0 -m ip -n
    use_backend bk_sinkholing if ip_ban
    default_backend <some other backend>

backend bk_sinkholing
    errorfile 503 /etc/haproxy/errors/429.http

Where the content of the error file is:

HTTP/1.0 429 Too Many Requests
Cache-Control: no-cache
Connection: close
Content-Type: application/json

{
    "error_code": "HAP1007",
    "lang": "eng",
    "message": "Too many requests. You have been banned. Please slow down a bit..."
}

The HAProxy updater

If you paid good attention, you know that there is only one thing left to solve - and that is the synchronization of information about bans from Redis to HAProxy’s ACLs. For that, we’ve developed a neat piece of Python code to regularly (every minute, roughly) fetch info about new bans from Redis and to populate the ACLs of HAProxy with the banned IPs.

The workflow is quite easy. We created a systemd service that starts a script containing a main loop that runs every minute, fetching all sinkholing-related keys from Redis. Redis iterates over all its keys, returning those that match the sinkholing pattern sm:banana:*. Then the list is compared to the set of currently-banned IPs in HAProxy. If some bans have expired, the HAProxy’s ACL needs to be cleaned, as does the currently-banned list. For communication with the HAProxy socket we use the haproxyadmin library. Bans that are to be added are fetched from Redis and pushed to the HAProxy one-by-one. This script also provides rich logging and exports some metrics to Prometheus and ElasticSearch.

An interesting story comes with this piece of code. We deployed it in our DC in Europe and everything went smoothly. But, when we got to deployment in South Africa, suddenly it took ages for sinkholing to apply new bans. That was a serious problem, because it effectively disabled sinkholing; by the time it finished, the current ban set was completely different. Fortunately, with bit of debugging, we found the cause.

We use standard library for Redis that is available in PyPI, in which there is a function called scan_iter that is able to fetch keys according to the passed pattern. It iteratively asks Redis for keys in small bunches (you may already know where I’m going here). As it takes a loooong time for packets to travel from South Africa to Europe (nearly 200ms), scanning all keys in Redis by chunks of 10 keys takes its toll. Using the default value of 10, it took more than 10 minutes (which is nearly the time it takes for bans to expire :D ) to scan it all.

After some tuning, we ended up with a value of 1000, which sped up the process considerably - five seconds is something we can be satisfied with. The lesson: Always send data in big chunks over high-latency networks, if possible. It all may seem fast on your local network, but when you step out to the real world, things may get muuuuch slower :)

Here’s the example implementation below:

def update_haproxy(haproxy_banned, config):
    hap = haproxy.HAProxy(socket_dir=config.get('MAIN', 'HAPROXY_SOCKET_DIR'))
    banana_limit = config.getint('MAIN', 'BANANA_LIMIT')

    redis_banned = set()
    for key in redis_conn.scan_iter(match='sm:banana:*', count=1000):
        redis_banned.add(key)

    # Set clear_acl flag if
    #   - first iteration of the service (empty banned set)
    #   - at least one value disappeared/expired from Redis
    #     we have to purge all ACLs and reload all remaining values
    clear_acl = bool(haproxy_banned - redis_banned) or not haproxy_banned

    ban_candidates = redis_banned
    previously_haproxy_banned = haproxy_banned.copy()

    # Decide how many values will be fetched from Redis and put into
    # HAProxy ACL
    if clear_acl:
        haproxy_banned = set()
        redis_fetch_val_amount = banana_limit
    else:
        redis_fetch_val_amount = banana_limit - len(haproxy_banned)
        ban_candidates -= haproxy_banned

    redis_pipe = redis_conn.pipeline()

    assert banana_limit >= len(haproxy_banned)

    for key in tuple(ban_candidates)[:redis_fetch_val_amount]:
        haproxy_banned.add(key)
        redis_pipe.get(key)

    bananas = [json.loads(banana) for banana in redis_pipe.execute()]

    if clear_acl:
        hap.clear_acl(IP_ACL)

    for banana in bananas:
        hap.add_acl(IP_ACL, banana['client_ip4'])

    return haproxy_banned

Next up

That’s it for the installment in our sinkholing series. Next time we’ll take a look at some rules that we use and how we monitor sinkholing.

Please check the original version of this article at