A structured cross-cultural mythology and history corpus. Built as the training substrate for AI models that need to reason about human civilizations, not just pattern-match on Wikipedia summaries.
Most mythology datasets are thin. A few hundred Wikipedia articles, some public domain translations, maybe a Reddit scrape. They're fine for trivia. They're not fine for training a model to reason about civilizations, belief systems, cultural continuity, or the way ideas spread across ancient trade routes.
The Library of Alexandria is what I built instead. 84,000+ source documents. 52GB of structured text. Spanning twelve major traditions from the ancient world through early modernity. Curated for depth, not breadth. The goal is a model that understands what the Egyptians believed about death, not just that they had pyramids.
Each tradition goes deeper than surface mythology. Primary texts, academic commentary, archaeological context, religious practice, economic structure. The Minoan section alone has more depth than most dedicated mythology databases.
Manual curation doesn't scale to 84,000 files. Blackreach handled the collection. Ghost Hand handled targeted acquisition from academic databases, archive.org, digitized manuscript collections, and specialized repositories.
The pipeline isn't just a scraper. Each document goes through format normalization, deduplication, quality filtering, and metadata tagging before it's added to the corpus. Bad sources get dropped. Duplicates get merged. The structure matters as much as the content.
Right now the corpus feeds a local RAG system using LlamaIndex and ChromaDB. Query the corpus, get back grounded, cited answers from primary sources rather than hallucinated summaries.
The longer-term use is fine-tuning. A model trained on this corpus should be able to reason about ancient civilizations the way a domain specialist does, not the way a language model does. That's the target.