LLM Training Data

Custom domain datasets for pretraining, fine-tuning, RAG, and evaluation.

Domain-specific datasets sourced from public web data — license-filtered, deduplicated, quality-scored, and delivered in JSONL, Parquet, or HuggingFace formats. For teams training, fine-tuning, evaluating, or augmenting language models with their own data.

Documents per pipeline

100M+

Output formats

JSONL · Parquet

License filtering

Built-in

Starting from

$500

Dataset types

Six shapes of LLM dataset, each scoped to your training goal.

Pretraining corpora

Domain-specific raw text at scale — sourced, deduplicated, quality-filtered, and license-tagged. Delivered as Parquet or sharded JSONL with metadata sidecars.

Instruction & fine-tuning datasets

Instruction-response pairs for supervised fine-tuning, sourced from forums, docs, and Q&A sites where licensing permits. JSONL format with role-tagged conversations.

RAG knowledge corpora

Chunked passages with source metadata, embeddings-ready, and topic-clustered for retrieval-augmented generation pipelines. Optimized for vector-DB ingestion.

Preference & DPO datasets

Chosen-vs-rejected pairs sourced from upvoted-vs-downvoted public content, ranked answers, and editorial preference signals. For DPO, RLHF, and preference-tuning.

Evaluation & golden sets

Domain-grounded eval sets with verified answers, source citations, and difficulty labels. For internal benchmarking, regression testing, and model-comparison workflows.

Synthetic data pipelines

Seed-and-amplify pipelines that take a small human-curated seed and expand it with LLM-generated variations, then filter for quality, diversity, and toxicity.

Sources we use

License-aware sourcing across the open web.

Every source is license-tagged at the document level. Excluded sources are logged. Custom whitelists/blacklists per engagement.

Public documentation sitesOpen knowledge basesPublic forums (license-permitted)Government & open dataAcademic preprintsCC-licensed mediaPublic Q&A archivesOpen-source code reposIndustry-specific public docsConference proceedings (open)Public legal & regulatory textMultilingual web (CC-OSCAR-style)

Per-document metadata

Every record ships with the metadata your training pipeline needs.

Document text & cleaned content
Source URL & domain
License & permission status
Language & confidence score
Document type & section structure
Quality score (perplexity / classifier)
PII detection flags
Toxicity / safety scores
Topic cluster ID
Extracted entities & metadata
Token count & length
Dedup hash & near-dup cluster
Embeddings (optional)
Train/val/test split label

Why custom LLM data matters

Generic corpora train generic models.

Common Crawl gives every team in the world the same starting point. If your competitive edge is supposed to come from your model in your domain, training on the same generic corpus as everyone else doesn\'t get you there. Domain-specific corpora — narrow but deep — produce models that win on the benchmarks you actually run.

The same logic applies to fine-tuning. A 10K-example instruction set sourced from your exact target domain (legal, medical, financial, retail, industrial) consistently outperforms a 100K-example generic instruction set when measured on domain tasks. Quality, coverage, and license-cleanliness beat raw volume.

We build both. Pretraining corpora at the hundreds-of-millions-of-documents scale for foundation-model teams. Fine-tuning, preference, and eval datasets at the tens-of-thousands scale for applied teams shipping production LLMs. Same underlying extraction infrastructure, different curation logic.

License handling is non-negotiable for production training. Every source is tagged, every restriction is logged, every excluded URL is auditable. Your legal team gets a defensible record alongside the dataset — not a black-box CSV.

Process

From scope to HuggingFace-ready delivery.

01

Domain & dataset scoping

We define the dataset shape — pretraining vs fine-tuning vs RAG vs eval — the target domain, languages, source list, license requirements, and volume target.

02

License-aware extraction

Multi-source scrapers with license-tagging at the document level. Robots.txt and ToS respected. Excluded sources logged for auditability.

03

Cleaning & quality filtering

Boilerplate removal, language detection, dedup (MinHash + URL canonicalization), perplexity-based quality scoring, PII detection, and safety filters.

04

Format & delivery

JSONL, Parquet, or HuggingFace dataset format with proper splits. Delivered to S3, HuggingFace Hub, or your training infrastructure with documentation.

License & compliance posture

We do not bypass authentication, scrape paywalled content, or collect copyrighted material outside permissive license terms. Every dataset ships with per-document license metadata, an exclusion log, and source-level documentation. For engagements requiring specific publisher licenses or non-public sources, we work with your legal team to scope the arrangement before data collection begins.

FAQ

LLM training data FAQ.

What kinds of LLM datasets do you build?

Pretraining corpora (raw text at scale), supervised fine-tuning datasets (instruction-response pairs), RAG/knowledge-base datasets (chunked passages with metadata), preference datasets for DPO/RLHF (chosen/rejected pairs), evaluation sets (golden answers), and embedding corpora (text → vector pipelines). Each is scoped to a domain and use case.

What format do you deliver the data in?

JSONL is the standard for fine-tuning and instruction data. Parquet for large-scale pretraining or embedding corpora. HuggingFace Datasets format with proper splits (train/val/test) is also supported. Raw text with metadata sidecars is available for custom training pipelines.

How do you handle copyright and licensing?

Every source is filtered through a license-aware pipeline. We collect from public-domain sources, CC-licensed content, and explicitly-permissive websites; we exclude paywalled, copyrighted-restricted, and non-commercial-only content unless your engagement has an explicit license arrangement. License metadata is preserved at the document level so your legal team can audit.

Do you handle deduplication?

Yes. Near-duplicate detection (MinHash/SimHash), URL canonicalization, and content-hash deduplication run on every dataset. For pretraining corpora we also apply quality scoring (perplexity-based, classifier-based) to remove low-quality boilerplate, navigation text, and SEO-spam content.

What about PII and safety filtering?

PII detection (regex + classifier-based) is run on every document. Toxic content filtering, NSFW filtering, and language detection are optional layers depending on your safety requirements. Filter thresholds and exclusion lists are configurable per engagement.

How big are the datasets you can build?

We have run pipelines producing hundreds of millions of documents for pretraining corpora and tens-of-thousands of instruction pairs for fine-tuning. Scale is bounded by source availability and license scope, not by infrastructure.

Train on data only you have

Start with a scoped dataset and a target evaluation.

Smaller validation datasets from $500. Pretraining-scale corpora scoped per engagement based on domain, languages, license posture, and volume target.