Pretraining corpora
Domain-specific raw text at scale — sourced, deduplicated, quality-filtered, and license-tagged. Delivered as Parquet or sharded JSONL with metadata sidecars.
LLM Training Data
Domain-specific datasets sourced from public web data — license-filtered, deduplicated, quality-scored, and delivered in JSONL, Parquet, or HuggingFace formats. For teams training, fine-tuning, evaluating, or augmenting language models with their own data.
Documents per pipeline
100M+
Output formats
JSONL · Parquet
License filtering
Built-in
Starting from
$500
Dataset types
Domain-specific raw text at scale — sourced, deduplicated, quality-filtered, and license-tagged. Delivered as Parquet or sharded JSONL with metadata sidecars.
Instruction-response pairs for supervised fine-tuning, sourced from forums, docs, and Q&A sites where licensing permits. JSONL format with role-tagged conversations.
Chunked passages with source metadata, embeddings-ready, and topic-clustered for retrieval-augmented generation pipelines. Optimized for vector-DB ingestion.
Chosen-vs-rejected pairs sourced from upvoted-vs-downvoted public content, ranked answers, and editorial preference signals. For DPO, RLHF, and preference-tuning.
Domain-grounded eval sets with verified answers, source citations, and difficulty labels. For internal benchmarking, regression testing, and model-comparison workflows.
Seed-and-amplify pipelines that take a small human-curated seed and expand it with LLM-generated variations, then filter for quality, diversity, and toxicity.
Sources we use
Every source is license-tagged at the document level. Excluded sources are logged. Custom whitelists/blacklists per engagement.
Per-document metadata
Why custom LLM data matters
Common Crawl gives every team in the world the same starting point. If your competitive edge is supposed to come from your model in your domain, training on the same generic corpus as everyone else doesn\'t get you there. Domain-specific corpora — narrow but deep — produce models that win on the benchmarks you actually run.
The same logic applies to fine-tuning. A 10K-example instruction set sourced from your exact target domain (legal, medical, financial, retail, industrial) consistently outperforms a 100K-example generic instruction set when measured on domain tasks. Quality, coverage, and license-cleanliness beat raw volume.
We build both. Pretraining corpora at the hundreds-of-millions-of-documents scale for foundation-model teams. Fine-tuning, preference, and eval datasets at the tens-of-thousands scale for applied teams shipping production LLMs. Same underlying extraction infrastructure, different curation logic.
License handling is non-negotiable for production training. Every source is tagged, every restriction is logged, every excluded URL is auditable. Your legal team gets a defensible record alongside the dataset — not a black-box CSV.
Process
01
We define the dataset shape — pretraining vs fine-tuning vs RAG vs eval — the target domain, languages, source list, license requirements, and volume target.
02
Multi-source scrapers with license-tagging at the document level. Robots.txt and ToS respected. Excluded sources logged for auditability.
03
Boilerplate removal, language detection, dedup (MinHash + URL canonicalization), perplexity-based quality scoring, PII detection, and safety filters.
04
JSONL, Parquet, or HuggingFace dataset format with proper splits. Delivered to S3, HuggingFace Hub, or your training infrastructure with documentation.
We do not bypass authentication, scrape paywalled content, or collect copyrighted material outside permissive license terms. Every dataset ships with per-document license metadata, an exclusion log, and source-level documentation. For engagements requiring specific publisher licenses or non-public sources, we work with your legal team to scope the arrangement before data collection begins.
FAQ
Pretraining corpora (raw text at scale), supervised fine-tuning datasets (instruction-response pairs), RAG/knowledge-base datasets (chunked passages with metadata), preference datasets for DPO/RLHF (chosen/rejected pairs), evaluation sets (golden answers), and embedding corpora (text → vector pipelines). Each is scoped to a domain and use case.
JSONL is the standard for fine-tuning and instruction data. Parquet for large-scale pretraining or embedding corpora. HuggingFace Datasets format with proper splits (train/val/test) is also supported. Raw text with metadata sidecars is available for custom training pipelines.
Every source is filtered through a license-aware pipeline. We collect from public-domain sources, CC-licensed content, and explicitly-permissive websites; we exclude paywalled, copyrighted-restricted, and non-commercial-only content unless your engagement has an explicit license arrangement. License metadata is preserved at the document level so your legal team can audit.
Yes. Near-duplicate detection (MinHash/SimHash), URL canonicalization, and content-hash deduplication run on every dataset. For pretraining corpora we also apply quality scoring (perplexity-based, classifier-based) to remove low-quality boilerplate, navigation text, and SEO-spam content.
PII detection (regex + classifier-based) is run on every document. Toxic content filtering, NSFW filtering, and language detection are optional layers depending on your safety requirements. Filter thresholds and exclusion lists are configurable per engagement.
We have run pipelines producing hundreds of millions of documents for pretraining corpora and tens-of-thousands of instruction pairs for fine-tuning. Scale is bounded by source availability and license scope, not by infrastructure.
Related work
Train on data only you have
Smaller validation datasets from $500. Pretraining-scale corpora scoped per engagement based on domain, languages, license posture, and volume target.