This workflow is a powerful web scraping and data extraction automation using Selenium and OpenAI. It allows you to collect structured data from almost any website, whether public or behind a login, while handling anti-bot protections and analyzing scraped pages with AI.
It supports:
- Running in a Selenium container with optional proxy configuration.
- Scraping with or without authentication (via session cookies).
- Automatic screenshot capture and AI-based content extraction.
- Handling of blocked pages, errors, and fallback logic.
🚀 Features
- Webhook Trigger: Accepts JSON input with subject, domain, target URL, and data fields.
- Google Search + Smart URL Extraction: Finds the most relevant page from a given domain using query + AI filtering.
- Selenium Browser Control:
- Launches and manages Chrome sessions inside a Dockerized Selenium container.
- Supports proxy configuration for bypassing restrictions.
- Can inject cookies for scraping logged-in pages.
- Anti-Bot Evasion: Modifies WebDriver fingerprints to avoid detection.
- Dynamic Page Handling: Resizes browser window, refreshes pages, and ensures page load stability.
- AI-Powered Data Extraction:
- Uses OpenAI GPT-4o / GPT-4o-mini to analyze screenshots and extract structured data.
- Extracts multiple attributes (up to 5 custom data points).
- Handles cases where no relevant data is found.
- Error & Block Handling:
- Returns clear JSON responses if the request is blocked, cookies don’t match, or pages crash.
- Captures screenshots for debugging when issues occur.
- Proxy Debugging: Built-in flow to verify your scraping IP via ip-api.com.