Technical methodology

How Chain Breaker works

A technical overview of the data pipeline: from web scraping through AI analysis to anonymized research outputs. Every step is designed for reproducibility, auditability, and privacy preservation.

1. Data collection

Chain Breaker monitors 24+ adult service websites across 20 countries. For each website, two automated processes run on independent schedules:

Crawlers

Navigate listing pages to discover advertisement URLs. Crawlers follow pagination links, apply country and category filters, and collect every ad link visible on the site. Each crawler is a Docker container running on AWS Fargate, triggered by EventBridge Scheduler on per-website cron schedules.

Scrapers

Extract structured data from each advertisement detail page: phone numbers, text descriptions, images, rates, reviews, location, and physical attributes. Scrapers are triggered after crawlers complete and write raw JSONL to S3.

2. Data processing

A daily ETL (Extract, Transform, Load) pipeline processes raw JSONL into structured, queryable Parquet tables. The pipeline runs at 7:00 AM UTC via AWS Step Functions and processes all 24+ websites in parallel.

Phone number extraction and normalization

Phone numbers are extracted from ad text using regex patterns covering 200+ international formats. Each number is parsed, validated for checksum correctness, and normalized to E.164 format (e.g., +1 555 0123 4567). Country calling codes are cross-referenced against the ad's declared country to catch mismatches.

Two-tier hashing

Every ad row receives two cryptographic hashes. The content_hash (SHA-256 of title, ad ID, phone, country, link, source URL, and text) detects whether an ad's content has changed across days. The ad_hash (SHA-256 of content_hash plus date_retrieved) serves as the primary key for table joins. This system separates entity identity from row identity, enabling change detection and deduplication.

Standardization

Categorical columns (nationality, ethnicity, gender, eye color, hair color, sexual orientation, and others) are fuzzy-matched against curated master tables to correct misspellings, synonyms, and inconsistent formatting. Country and city names are standardized to ISO 3166 codes. Currency codes in rates are matched against ISO 4217 (155 currencies). State and province names are expanded to ISO 3166-2 subdivision codes.

3. AI analysis

Text embeddings

Ad titles and text are embedded using Cohere's embed-multilingual-v3 model via Amazon Bedrock. Embeddings are cached by content_hash (7-day TTL) so unchanged ads skip recomputation. These vector representations enable semantic similarity search across languages and detection of ads with near-identical text posted under different phone numbers.

Community detection

Communities of related ads are detected through a multi-modal graph approach. Phone numbers and email addresses form the graph edges: two ads that share a contact method are linked. Community structure is identified through:

  • Contact graph clustering: Phone number co-occurrence across ads and websites
  • UMAP dimensionality reduction: Embedding vectors projected to 2D for density-based clustering
  • HDBSCAN clustering: Density-based clustering that handles noise and varying cluster sizes

Phone risk scoring

Each phone number receives a risk score computed from its community context. The score incorporates:

  • Community size: How many ads share this phone number
  • Cross-platform spread: How many different websites the number appears on
  • Geographic spread: How many cities or countries the number spans
  • Temporal patterns: Growth rate of the number's community over time
  • Community tightness: Density of interconnections within the number's network

Scores are written to DynamoDB for real-time lookup via the Phone Search API and recalculated daily to reflect new data.

4. Anonymization

Before any data is shared with researchers, a rigorous anonymization pipeline removes or generalizes all identifying information. This pipeline produces the datasets published on Harvard Dataverse.

Direct identifier removal

Phone numbers, email addresses, ad URLs, raw text content, and images are stripped entirely. The ad_hash is replaced with anon_hash (HMAC-SHA256 of ad_hash truncated to 16 bytes). Without the secret key, reversing anon_hash to the original ad is cryptographically infeasible.

k-anonymity (k ≥ 5)

Quasi-identifiers (age, height, weight, city) are generalized into bands: ages become 5-year ranges, heights and weights become decile bands, cities are aggregated to region or country level. Dates are truncated to month granularity. Rows with rare quasi-identifier combinations that would make an individual distinguishable among fewer than 5 records are suppressed entirely.

This guarantees that for any combination of quasi-identifiers in the published dataset, there are at least 5 matching records — making re-identification statistically unreliable.

Data Use Agreement

All researchers accessing the anonymized dataset must sign a Data Use Agreement that prohibits re-identification attempts, restricts redistribution, requires citation, and mandates secure data handling. The agreement is enforced through Harvard Dataverse's restricted access system.

5. Outputs and access

Chain Breaker produces three tiers of output for different audiences:

Investigation tools

Phone number search, network graph exploration, case management, and court-ready PDF evidence exports. Available to vetted law enforcement and NGO partners.

Observability dashboard

Pipeline health monitoring, data quality metrics, and aggregate statistics. Used internally and shared with partners to demonstrate data coverage and freshness.

Research dataset

Anonymized Parquet tables on Harvard Dataverse with ads, rates, reviews, and embeddings. Available to academic researchers under a signed Data Use Agreement.