We built a batch processing pipeline that pulls Open Educational Resources from over 50 repositories and ranks them by quality. Faculty can find free textbook replacements without spending a weekend searching.
Faculty across the institution wanted to adopt Open Educational Resources to cut student costs, but the ecosystem was a mess. Dozens of OER repositories existed (OpenStax, OER Commons, MERLOT, MIT OpenCourseWare, and others), each with different interfaces, metadata schemas, and quality levels. Finding a replacement for a $200 textbook meant manually searching each platform one at a time.
Even when faculty found OER materials, there was no good way to judge quality. A resource might look promising from its title but turn out to be a poorly formatted PDF with outdated content. Without any shared evaluation system, every faculty member started from scratch, repeating the same frustrating search their colleagues had already given up on.
The institution needed a way to pull resources from every major OER repository, normalize the metadata into one searchable index, and surface the best materials automatically. Faculty should be teaching, not searching.
Dozens of OER sources with different interfaces and metadata formats. No single place to search them all.
Faculty couldn't distinguish high-quality OER from outdated or poorly structured materials without downloading and reviewing each one.
Students were spending hundreds per semester on required textbooks. Free alternatives existed but were nearly impossible to find at scale.
Each faculty member spent hours repeating the same searches across the same repositories. None of that work could be reused.
We built a Python batch processing system with modular connectors for 50+ OER repositories. Each connector handles a source's unique API or scraping requirements and pulls resource metadata, download links, and PDFs into a unified staging area.
Every source uses different metadata formats. Some follow Dublin Core, others use custom schemas, and many have inconsistent or missing fields. We built a schema mapping layer that normalizes everything into a consistent structure: subject, format, license, author, publication date, and content type.
I integrated OpenAI to analyze downloaded PDFs and resource descriptions. The system scores each resource on content depth, organization, recency, citation density, and pedagogical value. Faculty can trust the top-ranked results without reviewing every PDF themselves.
We built a PHP web interface backed by MySQL where faculty can browse, search, and filter indexed resources by subject, format, quality score, and license type. A companion CLI tool handles batch harvesting, re-indexing, and re-scoring on a schedule.
I can walk you through how an OER pipeline could help your faculty find free course materials without the manual search.
Start a ConversationAutomated connectors pull resources from 50+ OER repositories (OpenStax, MERLOT, OER Commons, MIT OCW, and others) into a single searchable index.
Downloads and analyzes the actual PDF content, not just metadata, to rank resources by depth, structure, and pedagogical value.
OpenAI evaluates each resource on content depth, organization, recency, and pedagogical design. The result is a composite quality score faculty can actually trust.
Normalizes metadata from Dublin Core, custom APIs, and inconsistent fields into a unified schema so cross-repository search and comparison actually works.
PHP web interface lets faculty browse by subject, format, license, and quality score. Instant search across thousands of indexed resources with faceted filtering.
Recommends OER alternatives matched to specific textbooks and courses, with quality comparisons and adoption guidance for faculty.
Faculty search across 50+ sources at once. They can filter by subject, format, quality score, and license type.
Monitor harvesting progress across all connected repositories, with per-source status and error tracking.
AI-generated quality breakdown for each resource, covering content depth, organization, and pedagogical value.
Before this tool, I spent an entire weekend searching five different OER sites for one psychology textbook replacement. Now I type in my subject, sort by quality score, and have three solid options in under a minute. My students saved over $150 each this semester.
There's no shortage of free educational material online. The real problem is that every repository describes its resources differently. Building the schema mapping layer took longer than building the harvesting connectors, but it's what made the whole system usable. Without normalized metadata, search across sources is meaningless.
Early versions only scored resources based on their metadata descriptions. Faculty didn't trust it because descriptions are often vague or overly optimistic. When I added PDF-first analysis that actually reads the content, quality scores became meaningful and adoption followed. Faculty need to know the system looked at the same thing they would.
The initial CLI-only harvesting tool ran silently for hours. Faculty and administrators had no idea whether it was working, stuck, or finished. Adding a progress dashboard with per-source status and error counts turned a black box into something the team could monitor and trust to run on schedule.
Tell us about your OER goals. We'd like to hear how a centralized resource pipeline could save your faculty time and your students money.
No pitch. No pressure. Just a conversation about what might work.