← All projects
Building

Library of Alexandria

A structured cross-cultural mythology and history corpus. Built as the training substrate for AI models that need to reason about human civilizations, not just pattern-match on Wikipedia summaries.

84k+
source files
52GB
raw corpus
12+
traditions
RAG
retrieval ready

What it is

Most mythology datasets are thin. A few hundred Wikipedia articles, some public domain translations, maybe a Reddit scrape. They're fine for trivia. They're not fine for training a model to reason about civilizations, belief systems, cultural continuity, or the way ideas spread across ancient trade routes.

The Library of Alexandria is what I built instead. 84,000+ source documents. 52GB of structured text. Spanning twelve major traditions from the ancient world through early modernity. Curated for depth, not breadth. The goal is a model that understands what the Egyptians believed about death, not just that they had pyramids.

Traditions covered

Norse
Greek
Egyptian
Minoan
Mesopotamian
Sanskrit / Vedic
Semitic
Chinese
Celtic
Persian / Zoroastrian
Mesoamerican
Slavic

Each tradition goes deeper than surface mythology. Primary texts, academic commentary, archaeological context, religious practice, economic structure. The Minoan section alone has more depth than most dedicated mythology databases.

How it was built

Manual curation doesn't scale to 84,000 files. Blackreach handled the collection. Ghost Hand handled targeted acquisition from academic databases, archive.org, digitized manuscript collections, and specialized repositories.

The pipeline isn't just a scraper. Each document goes through format normalization, deduplication, quality filtering, and metadata tagging before it's added to the corpus. Bad sources get dropped. Duplicates get merged. The structure matters as much as the content.

The corpus is still growing. Some traditions are complete. Others are partial. The Minoan and Mesopotamian sections are the most developed right now, partly because those are the areas I'm actively using for research.

What it powers

Right now the corpus feeds a local RAG system using LlamaIndex and ChromaDB. Query the corpus, get back grounded, cited answers from primary sources rather than hallucinated summaries.

The longer-term use is fine-tuning. A model trained on this corpus should be able to reason about ancient civilizations the way a domain specialist does, not the way a language model does. That's the target.

Key design decisions

Depth over coverage
Fewer traditions done well beats every tradition done badly. Each section targets specialist-level depth.
Primary sources first
Original texts, translations, and academic commentary take priority over encyclopedia summaries.
Structured metadata
Every document is tagged by tradition, era, type, and source quality. The structure makes retrieval useful.
RAG-optimized chunking
Documents are chunked at semantic boundaries, not arbitrary token counts. Context survives chunking.

// private
The corpus is not publicly available. It represents a significant amount of collection and curation work and is the training substrate for my own models. If you're doing serious research in any of these traditions and want to talk, reach out.