Full Stack Software Engineer

Knokr Fetch

Intelligent Event Data Extraction & Verification System

The Problem

Knokr tracks 52K artists across 1,400+ festivals, but tour date data lives on thousands of individual artist websites in wildly different formats — JSON-LD structured data, Bandsintown widgets, Songkick embeds, Seated iframes, custom HTML tables, React SPAs, and plain text listings. No single API covers the full catalog. Information is fragmented across dozens of platforms, leading to duplicates, missing data, and inconsistency. Manual data collection doesn't scale. The system needed to automatically detect the page format, extract structured event data, handle JavaScript-rendered content, and keep data current without re-processing unchanged pages.

What I Built

A production-grade Python system built on FastAPI that implements a hierarchical, multi-strategy extraction pipeline. The core fetcher (4,142 lines) adapts its approach based on what it finds on each artist's website — trying the most reliable extraction method first and falling back through progressively less deterministic options.

The system runs headless Chrome via Selenium for JavaScript-rendered widgets, uses the Anthropic Claude API as a last-resort fallback for unstructured pages, and processes all extracted events through a rigorous cleaning pipeline that handles 13+ international date formats and deduplicates by content rather than URL. Parallel job processing with asyncio, WebSocket progress streaming, and subprocess isolation ensure one failing artist site can't affect a batch of thousands.

Extraction Strategy Cascade

The fetcher implements a six-level hierarchy, exhausting free and deterministic methods before reaching for AI:

JSON-LD Structured Data — The most reliable path. Parses application/ld+json tags for schema.org MusicEvent and Event types. Falls back to microdata when JSON-LD isn't present. No JavaScript rendering needed.
Bandsintown Extraction — Detects widgets via script/iframe src attributes and CSS class patterns. Handles iframe content switching and shadow DOM traversal. Can call the REST API directly.
Songkick Extraction — Identifies widgets through platform-specific classes and attributes. Parses event containers with custom wait conditions.
Seated Extraction — Detects embeds and calls the widget API. Extracts sold-out status from CSS patterns, merges tour name metadata.
Custom HTML Parsing — Generic pattern matching for sites that don't use standard widgets. CSS selectors and regex to extract date/venue pairs from raw HTML.
LLM Fallback (Claude API) — When all structured approaches fail, preprocessed HTML is sent to Claude. Cost-optimized with stripped tags, haiku model default, CSV output format, and automatic escalation to sonnet for complex pages.

Key Technical Decisions

Deduplication by content, not URL — Events are keyed on (artist_id, venue, date, time) rather than source URL. A hard-won lesson: 25 Ed Sheeran events all shared one URL and were being collapsed into one record. Content-based keying catches true duplicates regardless of page structure.
Process isolation per artist — Each artist fetch runs as an independent subprocess with a 5-minute timeout. A crash or unexpected HTML on one site cannot affect the batch. The parallel job manager uses asyncio.Semaphore for concurrency control (1-20 simultaneous) and asyncio.to_thread() to bridge blocking Selenium calls to the async event loop.
LLM as last resort, not first choice — The Claude fallback is powerful but costs money and time. The system exhausts free, deterministic extraction methods first. HTML is preprocessed to strip non-content elements, reducing token cost. Default model is haiku for cost efficiency with automatic fallback to sonnet for complex pages.
Selenium for JavaScript rendering — Many artist sites render event widgets client-side. Headless Chrome handles iframe extraction, content switching, and SPAs with platform-specific wait strategies: Bandsintown (4-7s), Seated (up to 12s on retry), Songkick (4-7s).
13+ date format normalization — ISO, US, EU, textual, relative dates (Today, Tomorrow), and TBA handling. Location parsing splits raw strings into city/state/country with normalization for all 50 US states, 40+ countries, and common aliases.

Python Architecture

Component	Implementation
Web Server	FastAPI + Uvicorn, async request handling, CORS, WebSocket streaming
Job Orchestration	asyncio.Semaphore concurrency, asyncio.to_thread() bridging, dataclass job modeling
Core Fetcher	4,142-line multi-strategy engine, 6 extraction methods, platform-specific parsers
Browser Automation	Selenium WebDriver, headless Chrome, iframe traversal, configurable wait strategies
LLM Integration	Anthropic SDK, HTML preprocessing, model fallback (haiku → sonnet), rate limit handling
Data Pipeline	Parse → Clean → Deduplicate → Normalize → Export CSV → Stage to queue
Persistence	PostgreSQL (psycopg2) for production data, SQLite for local job history

Accuracy

Source	Accuracy
JSON-LD / Widget APIs	99%+
LLM date extraction	~85%
LLM venue/location data	~90%
Date format handling	13+ formats (ISO, US, EU, textual, relative)

Known Limitations & Next Steps

Some artist tour pages use third-party ticketing iframes that don't expose event data in the accessible HTML. Date format ambiguity (MM/DD vs DD/MM) requires locale-aware parsing not fully implemented for all regions. LLM extraction costs scale with artists lacking widget integrations. Selenium browser instances consume significant memory at high concurrency. Planned improvements include headless browser pooling for memory efficiency, expanded widget detection for newer ticketing platforms, and a validation feedback loop where downstream data quality metrics inform upstream extraction tuning.

Technology Stack

Python
FastAPI
asyncio
Selenium
BeautifulSoup 4
Anthropic Claude API
httpx
psycopg2
SQLite
WebSocket
PostgreSQL