Scrape X.com (Twitter) Tweet Pages Using Python

Crawlbase - Aug 2 - - Dev Community

This blog was originally posted to Crawlbase Blog

X.com (formerly Twitter) is still a great platform for real-time info and public sentiment analysis. With millions of users posting daily X.com is a treasure trove for data lovers looking to get insights into trends, opinions, and behavior. Despite the recent changes to the platform, scraping tweet data from X.com can still be super valuable for researchers, marketers, and developers.

According to recent stats, X.com has over 500 million tweets per day and 611 million monthly active users. It’s a goldmine of real time data and a perfect target for web scraping projects to get info on trending topics, user sentiment and more.

Let’s get started scraping Twitter tweet pages with Python. We’ll show you how to set up your environment, build a scraper, and optimize your scraping process with Crawlbase Smart Proxy.

Table Of Contents

Why Scrape X.com (Twitter) Tweet Pages?

Scraping tweet pages can provide immense value for various applications. Here are a few reasons why you might want to scrape X.com:

  1. Trend Analysis: With millions of tweets daily, X.com is a goldmine for trending topics and emerging topics. Scraping tweets can help you track trending hashtags, topics, and events in real time.
  2. Sentiment Analysis: Tweets contain public opinions and sentiments about products, services, political events, and more. Businesses and researchers can gain insights into public sentiment and make informed decisions.
  3. Market Research: Companies can use tweet data to understand consumer behavior, preferences, and feedback. This is useful for product development, marketing strategies, and customer service improvements.
  4. Academic Research: Scholars and researchers use tweet data for various academic purposes like studying social behavior, political movements and public health trends. X.com data can be a rich dataset for qualitative and quantitative research.
  5. Content Curation: Content creators and bloggers can use scraped tweet data to curate relevant and trending content for their audience. This can help in generating fresh and up to date content that resonates with readers.
  6. Monitoring and Alerts: Scraping tweets can be used to monitor specific keywords, hashtags, or user accounts for important updates or alerts. This is useful for tracking industry news, competitor activities, or any specific topic of interest.

X.com tweet pages hold a lot of data that can be used for many purposes. Below, we will walk you through setting up your environment, creating a tweet page scraper, and optimizing your scrape using Crawlbase Smart Proxy.

Setting Up the Environment

Before we start scraping X.com tweet pages, we need to set up our development environment. This involves installing necessary libraries and tools to make the scraping process efficient and effective. Here's how you can get started:

Install Python

If you still need to install Python, download and install it from the official Python website. Make sure to add Python to your system's PATH during installation.

Install Required Libraries

We'll be using Playwright for browser automation and Pandas, a popular library for data manipulation and analysis. Install these libraries using pip:

pip install playwright pandas
Enter fullscreen mode Exit fullscreen mode

Set Up Playwright

Playwright requires a one-time setup to install browser binaries. Run the following command to complete the setup:

python -m playwright install
Enter fullscreen mode Exit fullscreen mode

Set Up Your IDE

Using a good IDE (Integrated Development Environment) can make a big difference to your development experience. Some popular IDEs for Python development are:

  • PyCharm: A powerful and popular IDE with many features for professional developers. Download it from here.
  • VS Code: A lightweight and flexible editor with great Python support. Download it from here.
  • Jupyter Notebook: An open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text. Install it using pip install notebook.

Create the Scraper Script

Next, we'll create a script named tweet_page_scraper.py in your preferred IDE. We will write our Python code in this script to scrape tweet pages from X.com.

Now you have your environment set up, let’s start building your scraper. In the next section, we will go into how X.com renders data and how we can scrape tweet details.

Scraping Twitter Tweet Pages

How X.com Renders Data

To scrape Twitter (X.com) tweet pages effectively, it's essential to understand how X.com renders its data.

X.com is a JavaScript-heavy application that loads content dynamically through background requests, known as XHR (XMLHttpRequest) requests. When you visit a tweet page, the initial HTML loads, and then JavaScript fetches the tweet details through these XHR requests. To scrape this data, we will use a headless browser to capture these requests and extract the data.

Creating Tweet Page Scraper

To create a scraper for X.com tweet pages we will use Playwright, a browser automation library. This scraper will load the tweet page, capture the XHR requests and extract the tweet details from these requests.

Here's the code to create the scraper:

from playwright.sync_api import sync_playwright
import json

def intercept_response(response):
    """Capture all background requests and save those containing tweet data."""
    if "TweetResultByRestId" in response.url:
        try:
            return response.json()
        except Exception as e:
            print(f"Error processing response: {e}")
            return None

def scrape_tweet(url: str) -> dict:
    """Scrape a single tweet page for tweet data."""
    tweet_data = {}

    with sync_playwright() as pw:
        browser = pw.chromium.launch(headless=True)
        context = browser.new_context(viewport={"width": 1920, "height": 1080})
        page = context.new_page()

        page.on("response", lambda response: tweet_data.update(intercept_response(response)))
        page.goto(url)
        page.wait_for_selector("[data-testid='tweet']")

        tweet_calls = [xhr for xhr in tweet_data if "TweetResultByRestId" in xhr.url]
        for xhr in tweet_calls:
            data = xhr.json()
            return data['data']['tweetResult']['result']

if __name__ == "__main__":
    tweet_url = "https://x.com/BillGates/status/1352662770416664577"
    tweet_data = scrape_tweet(tweet_url)
    print(json.dumps(tweet_data, indent=4))
Enter fullscreen mode Exit fullscreen mode

The intercept_response function filters these requests, specifically looking for URLs containing "TweetResultByRestId" and returning their JSON content. The main function, scrape_tweet, launches a headless browser session, navigates to the specified tweet URL, and captures the necessary data from the responses. It then extracts the tweet details from the background XHR requests and returns them as a dictionary.

Example Output:

{
  "data": {
    "tweetResult": {
      "result": {
        "__typename": "Tweet",
        "rest_id": "1352662770416664577",
        "core": {
          "user_results": {
            "result": {
              "__typename": "User",
              "id": "VXNlcjo1MDM5Mzk2MA==",
              "rest_id": "50393960",
              "affiliates_highlighted_label": {},
              "is_blue_verified": true,
              "profile_image_shape": "Circle",
              "legacy": {
                "created_at": "Wed Jun 24 18:44:10 +0000 2009",
                "default_profile": false,
                "default_profile_image": false,
                "description": "Sharing things I'm learning through my foundation work and other interests.",
                "entities": {
                  "description": {
                    "urls": []
                  },
                  "url": {
                    "urls": [
                      {
                        "display_url": "gatesnot.es/blog",
                        "expanded_url": "https://gatesnot.es/blog",
                        "url": "https://t.co/UkvHzxDwkH",
                        "indices": [0, 23]
                      }
                    ]
                  }
                },
                "fast_followers_count": 0,
                "favourites_count": 560,
                "followers_count": 65199662,
                "friends_count": 588,
                "has_custom_timelines": true,
                "is_translator": false,
                "listed_count": 119964,
                "location": "Seattle, WA",
                "media_count": 1521,
                "name": "Bill Gates",
                "normal_followers_count": 65199662,
                "pinned_tweet_ids_str": [],
                "possibly_sensitive": false,
                "profile_banner_url": "https://pbs.twimg.com/profile_banners/50393960/1672784571",
                "profile_image_url_https": "https://pbs.twimg.com/profile_images/1674815862879178752/nTGMV1Eo_normal.jpg",
                "profile_interstitial_type": "",
                "screen_name": "BillGates",
                "statuses_count": 4479,
                "translator_type": "regular",
                "url": "https://t.co/UkvHzxDwkH",
                "verified": false,
                "withheld_in_countries": []
              },
              "tipjar_settings": {
                "is_enabled": false,
                "bandcamp_handle": "",
                "bitcoin_handle": "",
                "cash_app_handle": "",
                "ethereum_handle": "",
                "gofundme_handle": "",
                "patreon_handle": "",
                "pay_pal_handle": "",
                "venmo_handle": ""
              }
            }
          }
        },
        "unmention_data": {},
        "edit_control": {
          "edit_tweet_ids": ["1352662770416664577"],
          "editable_until_msecs": "1611336710383",
          "is_edit_eligible": true,
          "edits_remaining": "5"
        },
        "is_translatable": false,
        "views": {
          "state": "Enabled"
        },
        "source": "<a href=\"https://mobile.twitter.com\" rel=\"nofollow\">Twitter Web App</a>",
        "legacy": {
          "bookmark_count": 279,
          "bookmarked": false,
          "created_at": "Fri Jan 22 17:01:50 +0000 2021",
          "conversation_control": {
            "policy": "Community",
            "conversation_owner_results": {
              "result": {
                "__typename": "User",
                "legacy": {
                  "screen_name": "BillGates"
                }
              }
            }
          },
          "conversation_id_str": "1352662770416664577",
          "display_text_range": [0, 254],
          "entities": {
            "hashtags": [],
            "media": [
              {
                "display_url": "pic.x.com/67sifrg1yd",
                "expanded_url": "https://twitter.com/BillGates/status/1352662770416664577/photo/1",
                "id_str": "1352656486099423232",
                "indices": [255, 278],
                "media_key": "3_1352656486099423232",
                "media_url_https": "https://pbs.twimg.com/media/EsWZ6E0VkAA_Zgh.jpg",
                "type": "photo",
                "url": "https://t.co/67SIfrG1Yd",
                "ext_media_availability": {
                  "status": "Available"
                },
                "features": {
                  "large": {
                    "faces": []
                  },
                  "medium": {
                    "faces": []
                  },
                  "small": {
                    "faces": []
                  },
                  "orig": {
                    "faces": []
                  }
                },
                "sizes": {
                  "large": {
                    "h": 698,
                    "w": 698,
                    "resize": "fit"
                  },
                  "medium": {
                    "h": 698,
                    "w": 698,
                    "resize": "fit"
                  },
                  "small": {
                    "h": 680,
                    "w": 680,
                    "resize": "fit"
                  },
                  "thumb": {
                    "h": 150,
                    "w": 150,
                    "resize": "crop"
                  }
                },
                "original_info": {
                  "height": 698,
                  "width": 698,
                  "focus_rects": [
                    {
                      "x": 0,
                      "y": 206,
                      "w": 698,
                      "h": 391
                    },
                    {
                      "x": 0,
                      "y": 0,
                      "w": 698,
                      "h": 698
                    },
                    {
                      "x": 86,
                      "y": 0,
                      "w": 612,
                      "h": 698
                    },
                    {
                      "x": 262,
                      "y": 0,
                      "w": 349,
                      "h": 698
                    },
                    {
                      "x": 0,
                      "y": 0,
                      "w": 698,
                      "h": 698
                    }
                  ]
                },
                "media_results": {
                  "result": {
                    "media_key": "3_1352656486099423232"
                  }
                }
              }
            ],
            "symbols": [],
            "timestamps": [],
            "urls": [],
            "user_mentions": []
          },
          "extended_entities": {
            "media": [
              {
                "display_url": "pic.twitter.com/67SIfrG1Yd",
                "expanded_url": "https://twitter.com/BillGates/status/1352662770416664577/photo/1",
                "id_str": "1352656486099423232",
                "indices": [255, 278],
                "media_key": "3_1352656486099423232",
                "media_url_https": "https://pbs.twimg.com/media/EsWZ6E0VkAA_Zgh.jpg",
                "type": "photo",
                "url": "https://t.co/67SIfrG1Yd",
                "ext_media_availability": {
                  "status": "Available"
                },
                "features": {
                  "large": {
                    "faces": []
                  },
                  "medium": {
                    "faces": []
                  },
                  "small": {
                    "faces": []
                  },
                  "orig": {
                    "faces": []
                  }
                },
                "sizes": {
                  "large": {
                    "h": 698,
                    "w": 698,
                    "resize": "fit"
                  },
                  "medium": {
                    "h": 698,
                    "w": 698,
                    "resize": "fit"
                  },
                  "small": {
                    "h": 680,
                    "w": 680,
                    "resize": "fit"
                  },
                  "thumb": {
                    "h": 150,
                    "w": 150,
                    "resize": "crop"
                  }
                },
                "original_info": {
                  "height": 698,
                  "width": 698,
                  "focus_rects": [
                    {
                      "x": 0,
                      "y": 206,
                      "w": 698,
                      "h": 391
                    },
                    {
                      "x": 0,
                      "y": 0,
                      "w": 698,
                      "h": 698
                    },
                    {
                      "x": 86,
                      "y": 0,
                      "w": 612,
                      "h": 698
                    },
                    {
                      "x": 262,
                      "y": 0,
                      "w": 349,
                      "h": 698
                    },
                    {
                      "x": 0,
                      "y": 0,
                      "w": 698,
                      "h": 698
                    }
                  ]
                },
                "media_results": {
                  "result": {
                    "media_key": "3_1352656486099423232"
                  }
                }
              }
            ]
          },
          "favorite_count": 63988,
          "favorited": false,
          "full_text": "One of the benefits of being 65 is that I’m eligible for the COVID-19 vaccine. I got my first dose this week, and I feel great. Thank you to all of the scientists, trial participants, regulators, and frontline healthcare workers who got us to this point. https://t.co/67SIfrG1Yd",
          "is_quote_status": false,
          "lang": "en",
          "possibly_sensitive": false,
          "possibly_sensitive_editable": true,
          "quote_count": 7545,
          "reply_count": 0,
          "retweet_count": 5895,
          "retweeted": false,
          "user_id_str": "50393960",
          "id_str": "1352662770416664577"
        }
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Parsing Tweet Dataset

The JSON data we capture from X.com's XHR requests can be quite complex. We will parse this JSON data to extract key information such as the tweet content, author details, and engagement metrics.

Here's a function to parse the tweet data:

def parse_tweet(data: dict) -> dict:
    """Parse X.com tweet JSON dataset for the most important fields."""
    result = {
        "created_at": data.get("legacy", {}).get("created_at"),
        "attached_urls": [url["expanded_url"] for url in data.get("legacy", {}).get("entities", {}).get("urls", [])],
        "attached_media": [media["media_url_https"] for media in data.get("legacy", {}).get("entities", {}).get("media", [])],
        "tagged_users": [mention["screen_name"] for mention in data.get("legacy", {}).get("entities", {}).get("user_mentions", [])],
        "tagged_hashtags": [hashtag["text"] for hashtag in data.get("legacy", {}).get("entities", {}).get("hashtags", [])],
        "favorite_count": data.get("legacy", {}).get("favorite_count"),
        "retweet_count": data.get("legacy", {}).get("retweet_count"),
        "reply_count": data.get("legacy", {}).get("reply_count"),
        "text": data.get("legacy", {}).get("full_text"),
        "user_id": data.get("legacy", {}).get("user_id_str"),
        "tweet_id": data.get("legacy", {}).get("id_str"),
        "conversation_id": data.get("legacy", {}).get("conversation_id_str"),
        "language": data.get("legacy", {}).get("lang"),
        "source": data.get("source"),
        "views": data.get("views", {}).get("count")
    }
    return result
Enter fullscreen mode Exit fullscreen mode

Saving Data

Finally, we'll save the parsed tweet data to a CSV file using the pandas library for easy analysis and storage.

Here's the function to save the data:

import pandas as pd

def save_to_csv(tweet_data: dict, filename: str):
    """Save the parsed tweet data to a CSV file."""
    df = pd.DataFrame([tweet_data])
    df.to_csv(filename, index=False)
Enter fullscreen mode Exit fullscreen mode

Complete Code

Here is the complete code combining all the steps:

from playwright.sync_api import sync_playwright
import json
import pandas as pd

def intercept_response(response):
    """Capture all background requests and save those containing tweet data."""
    if "TweetResultByRestId" in response.url:
        try:
            return response.json()
        except Exception as e:
            print(f"Error processing response: {e}")
            return None

def scrape_tweet(url: str) -> dict:
    """Scrape a single tweet page for tweet data."""
    tweet_data = {}

    with sync_playwright() as pw:
        browser = pw.chromium.launch(headless=True)
        context = browser.new_context(viewport={"width": 1920, "height": 1080})
        page = context.new_page()

        page.on("response", lambda response: tweet_data.update(intercept_response(response)))
        page.goto(url)
        page.wait_for_selector("[data-testid='tweet']")

        tweet_calls = [xhr for xhr in tweet_data if "TweetResultByRestId" in xhr.url]
        for xhr in tweet_calls:
            data = xhr.json()
            return data['data']['tweetResult']['result']

def parse_tweet(data: dict) -> dict:
    """Parse X.com tweet JSON dataset for the most important fields."""
    result = {
        "created_at": data.get("legacy", {}).get("created_at"),
        "attached_urls": [url["expanded_url"] for url in data.get("legacy", {}).get("entities", {}).get("urls", [])],
        "attached_media": [media["media_url_https"] for media in data.get("legacy", {}).get("entities", {}).get("media", [])],
        "tagged_users": [mention["screen_name"] for mention in data.get("legacy", {}).get("entities", {}).get("user_mentions", [])],
        "tagged_hashtags": [hashtag["text"] for hashtag in data.get("legacy", {}).get("entities", {}).get("hashtags", [])],
        "favorite_count": data.get("legacy", {}).get("favorite_count"),
        "retweet_count": data.get("legacy", {}).get("retweet_count"),
        "reply_count": data.get("legacy", {}).get("reply_count"),
        "text": data.get("legacy", {}).get("full_text"),
        "user_id": data.get("legacy", {}).get("user_id_str"),
        "tweet_id": data.get("legacy", {}).get("id_str"),
        "conversation_id": data.get("legacy", {}).get("conversation_id_str"),
        "language": data.get("legacy", {}).get("lang"),
        "source": data.get("source"),
        "views": data.get("views", {}).get("count")
    }
    return result

def save_to_csv(tweet_data: dict, filename: str):
    """Save the parsed tweet data to a CSV file."""
    df = pd.DataFrame([tweet_data])
    df.to_csv(filename, index=False)

if __name__ == "__main__":
    tweet_url = "https://x.com/BillGates/status/1352662770416664577"
    tweet_data = scrape_tweet(tweet_url)
    parsed_data = parse_tweet(tweet_data)
    save_to_csv(parsed_data, "tweet_data.csv")
    print(f"Tweet data saved to tweet_data.csv")
Enter fullscreen mode Exit fullscreen mode

By following these steps, you can effectively scrape and save tweet data from X.com using Python. In the next section, we'll look at how to optimize this process with Crawlbase Smart Proxy to handle anti-scraping measures.

Optimizing with Crawlbase Smart Proxy

When scraping X.com, you may run into anti-scraping measures like IP blocking and rate limiting. To get around these restrictions, using a proxy like Crawlbase Smart Proxy can be very effective. Crawlbase Smart Proxy rotates IP addresses and manages request rates so your scraping stays undetected and uninterrupted.

Why Use Crawlbase Smart Proxy?

  1. IP Rotation: Crawlbase rotates IP addresses for each request, making it difficult for X.com to detect and block your scraper.
  2. Request Management: Crawlbase handles request rates to avoid triggering anti-scraping mechanisms.
  3. Reliability: Using a proxy service ensures consistent and reliable access to data, even for large-scale scraping projects.

Integrating Crawlbase Smart Proxy with Playwright

To integrate Crawlbase Smart Proxy with our existing Playwright setup, we need to configure the proxy settings. Here’s how you can do it:

Sign Up for Crawlbase: First, sign up for an account on Crawlbase and obtain your API token.

Configure Proxy in Playwright: Update the Playwright settings to use the Crawlbase Smart Proxy.

Here's how you can configure Playwright to use Crawlbase Smart Proxy:

from playwright.sync_api import sync_playwright

# Replace USER_TOKEN placeholder with your token
CRAWLBASE_PROXY = "http://USER_TOKEN:@smartproxy.crawlbase.com:8012"

def intercept_response(response):
    """Capture all background requests and save those containing tweet data."""
    if "TweetResultByRestId" in response.url:
        try:
            return response.json()
        except Exception as e:
            print(f"Error processing response: {e}")
            return None

def scrape_tweet_with_proxy(url: str) -> dict:
    """Scrape a single tweet page using Crawlbase Smart Proxy."""
    tweet_data = {}

    with sync_playwright() as pw:
        browser = pw.chromium.launch(headless=True, proxy={"server": CRAWLBASE_PROXY})
        context = browser.new_context(viewport={"width": 1920, "height": 1080})
        page = context.new_page()

        page.on("response", lambda response: intercept_response(response))
        page.goto(url)
        page.wait_for_selector("[data-testid='tweet']")

        tweet_calls = [xhr for xhr in tweet_data if "TweetResultByRestId" in xhr.url]
        for xhr in tweet_calls:
            data = xhr.json()
            return data['data']['tweetResult']['result']

if __name__ == "__main__":
    tweet_url = "https://x.com/BillGates/status/1352662770416664577"
    tweet_data = scrape_tweet_with_proxy(tweet_url)
    print(json.dumps(tweet_data, indent=4))
Enter fullscreen mode Exit fullscreen mode

In this updated script, we’ve added the CRAWLBASE_PROXY variable containing the proxy server details. When launching the Playwright browser, we include the proxy parameter to route all requests through Crawlbase Smart Proxy.

Benefits of Using Crawlbase Smart Proxy

  1. Enhanced Scraping Efficiency: By rotating IP addresses, Crawlbase helps maintain high scraping efficiency without interruptions.
  2. Increased Data Access: Avoiding IP bans ensures continuous access to X.com tweet data.
  3. Simplified Setup: Integrating Crawlbase with Playwright is straightforward and requires minimal code changes.

By using Crawlbase Smart Proxy, you can optimize your X.com scraping process, ensuring reliable and efficient data collection. In the next section, we'll conclude our guide and answer some frequently asked questions about scraping X.com tweet pages.

Scrape Twitter Tweet Pages with Crawlbase

Scraping Twitter tweet pages can be a great way to get data for research, analysis and other purposes. By knowing how X.com renders data and using Playwright for browser automation you can get tweet details. Adding Crawlbase Smart Proxy to the mix makes your scraping even more powerful by bypassing anti-scraping measures and uninterrupted data collection.

If you're looking to expand your web scraping capabilities, consider exploring our following guides on scraping other social media platforms.

📜 How to Scrape Facebook
📜 How to Scrape Linkedin
📜 How to Scrape Reddit
📜 How to Scrape Instagram
📜 How to Scrape Youtube

If you have any questions or feedback, our support team is always available to assist you on your web scraping journey. Happy Scraping!

Frequently Asked Questions

Q: Is web scraping X.com legal?

Scraping X.com is mostly legal if the website’s terms of service allow it, the data being scraped is publicly available, and how you use that data. You must review X.com’s terms of service to ensure you comply with their policies. Scraping for personal use or publicly available data is less likely to be an issue. Scraping for commercial use without permission from the website can lead to big legal problems. To avoid any legal risks it’s highly recommended to consult a lawyer before doing extensive web scraping.

Q: Why should I use a headless browser like Playwright for scraping X.com?

X.com is a JavaScript-heavy website that loads content through background requests (XHR), so it’s hard to scrape with traditional HTTP requests. A headless browser like Playwright is built to handle this kind of complexity. Playwright can execute JavaScript, render web pages like a real browser and capture background requests that contain the data you want. This is perfect for X.com as it allows you to extract data from dynamically loaded content.

Q: What is Crawlbase Smart Proxy, and why should I use it?

Crawlbase Smart Proxy is an advanced proxy service that makes web scraping more powerful by rotating IP addresses and managing request rates. This service helps you avoid IP blocking and rate limiting which are common issues in web scraping. By distributing your requests across multiple IP addresses Crawlbase Smart Proxy makes your scraping activities undetected and uninterrupted. This means more consistent and reliable access to data from websites like X.com. Adding Crawlbase Smart Proxy to your scraping workflow makes your data collection more successful and efficient.

Q: How do I handle large JSON datasets from X.com scraping?

Large JSON datasets from X.com scraping can be messy and hard to manage. To manage these datasets, you can use Python’s JSON library to parse and reshape the data into a more manageable format. This means extracting only the most important fields and organizing the data in a simpler structure. By doing so, you can focus on the important data and streamline your data processing tasks. Also, using data manipulation libraries like pandas can make you more efficient in cleaning, transforming, and analyzing big datasets. This way you can get insights from the scraped data without being overwhelmed by the complexity.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Terabox Video Player