Datasets
Datasets are essential building blocks for fine-tuning models in Factory. This guide explains how to prepare, load, and manage datasets for your machine learning workflows.
Dataset Structure Requirements
Factory works with datasets from the popular Hugging Face datasets
library. Your dataset must meet these requirements:
- Must be a
DatasetDict
object with at leasttrain
andtest
splits - Each split should contain examples relevant to your fine-tuning task
- Factory automatically generates fingerprints to track dataset versions
A typical dataset structure looks like this:
Loading Datasets in Factory
Factory provides flexible options for working with datasets:
Option 1: Loading from Hugging Face
This method:
- Loads a dataset from Hugging Face's repository
- Splits it into train and test sets
- Registers it with Factory under a specific name
- Creates a unique fingerprint for version tracking
Option 2: Working with Local Datasets
For datasets you've already prepared:
How Dataset Management Works
When you register a dataset with Factory:
- Fingerprinting: Factory creates a unique fingerprint to identify the dataset version
- Revision Control: Similar to models, datasets have metadata and specific revisions
- Preview Generation: Factory automatically creates previews of your data for inspection
- Test Sample Creation: Factory extracts sample data for quick testing
- Storage: The dataset is efficiently stored and linked to your Factory account
If you update your dataset and upload it again, Factory will detect the changes and create a new revision while maintaining the connection to previous versions.
Dataset Versioning
Factory tracks dataset versions with two main objects:
- DatasetMeta: The main reference for a dataset under a specific name
- DatasetRevision: A concrete version of a dataset with specific examples
This versioning system ensures:
- Reproducibility: Experiments can reference specific dataset versions
- Tracking: See how dataset changes affect model performance
- Lineage: Understand which dataset version was used for which model training run
Best Practices
- Use consistent preprocessing: Apply the same transforms to train and test splits
- Choose appropriate splits: Typically 80-90% train, 10-20% test
- Include diverse examples: Ensure your test set represents the full distribution
- Keep feature names consistent: Factory tracks features for schema validation
- Provide descriptive names: Use clear naming for easier dataset management
Example Workflow
Here's a complete example of preparing a dataset for fine-tuning: