Production-ReadySenior

TOPIA — Job Aggregator Pipeline

Production-deployed full-stack job aggregation pipeline that collects, normalizes, deduplicates, and serves remote job listings using a distributed architecture with MongoDB Atlas, FastAPI, and a locally scheduled scraper running on a residential IP to bypass cloud IP bans.

Description

Building a fault-tolerant, idempotent data pipeline that separates write and read paths while handling external API inconsistencies and cloud deployment limitations.

Problem Solved

Building a fault-tolerant, idempotent data pipeline that separates write and read paths while handling external API inconsistencies and cloud deployment limitations.

Architecture

Local Scheduler (Windows Task Scheduler) -> Python Scraper (Residential IP) -> MongoDB Atlas (Cloud DB) -> FastAPI (Railway - Read Only) -> React Frontend (Vercel). Architecture highlights full separation of concerns, stateless API, and persistent external DB.

Tech Stack

PythonFastAPIMongoDB AtlasReactViteTypeScriptRailwayVercelWindows Task Scheduler

Scalability

Stateless backendIdempotent pipelineFault-tolerant designRead-Optimized Backend System

Designed for fault tolerance and distributed workload segregation. By detaching the write path (residential scraper) from the read path (ephemeral cloud API), the system scales API instances independently while maintaining strict idempotency against a centralized Atlas cluster.

Architecture Breakdown

System Architecture

Local Scheduler (Windows Task Scheduler) [Write Path]

Python Scraper (Residential IP) [Write Path]

MongoDB Atlas (Cloud DB)

FastAPI (Railway - Read Only) [Read Path]

React Frontend (Vercel) [Read Path]

Engineering Decisions

▹Architecture Transformations: SQLite → MongoDB Atlas (fix data loss)
▹Architecture Transformations: Cloud scraping → local scraping (fix IP ban)
▹Architecture Transformations: Monolithic flow → separated pipeline

Production Readiness

✔ Production deployed

✔ Stateless backend

✔ Idempotent pipeline

✔ Externalized database

✔ Fault-tolerant design

✔ Read/write separation

⚠ No Redis caching

⚠ No CI/CD pipeline

⚠ Local scraper dependency

⚠ No APM metrics