Datasets

Datasets are essential building blocks for fine-tuning models in Factory. This guide explains how to prepare, load, and manage datasets for your machine learning workflows.

Dataset Structure Requirements

Factory works with datasets from the popular Hugging Face datasets library. Your dataset must meet these requirements:

Must be a DatasetDict object with at least train and test splits
Each split should contain examples relevant to your fine-tuning task
Factory automatically generates fingerprints to track dataset versions

A typical dataset structure looks like this:

DatasetDict({
    'train': Dataset({
        features: ['text', 'label', ...],
        num_rows: 10000
    }),
    'test': Dataset({
        features: ['text', 'label', ...],
        num_rows: 1000
    })
})

Loading Datasets in Factory

Factory provides flexible options for working with datasets:

Option 1: Loading from Hugging Face

from factory_sdk import FactoryClient
from datasets import load_dataset
 
# Initialize the Factory client
factory = FactoryClient(
    tenant="your_tenant_name",
    project="your_project_name",
    token="your_api_key",
)
 
# Load a dataset from Hugging Face
hf_data = load_dataset("takala/financial_phrasebank", "sentences_allagree")
data = hf_data["train"].train_test_split(test_size=0.1, seed=42)
 
# Register with Factory
dataset = factory.dataset.with_name("financial-phrases") \
    .from_local(data) \
    .save_or_fetch()

This method:

Loads a dataset from Hugging Face's repository
Splits it into train and test sets
Registers it with Factory under a specific name
Creates a unique fingerprint for version tracking

Option 2: Working with Local Datasets

For datasets you've already prepared:

import pandas as pd
from datasets import Dataset, DatasetDict
 
# Create dataset from local data
train_df = pd.read_csv("train_data.csv")
test_df = pd.read_csv("test_data.csv")
 
data = DatasetDict({
    "train": Dataset.from_pandas(train_df),
    "test": Dataset.from_pandas(test_df)
})
 
# Register with Factory
dataset = factory.dataset.with_name("custom-dataset") \
    .from_local(data) \
    .save_or_fetch()

How Dataset Management Works

When you register a dataset with Factory:

Fingerprinting: Factory creates a unique fingerprint to identify the dataset version
Revision Control: Similar to models, datasets have metadata and specific revisions
Preview Generation: Factory automatically creates previews of your data for inspection
Test Sample Creation: Factory extracts sample data for quick testing
Storage: The dataset is efficiently stored and linked to your Factory account

If you update your dataset and upload it again, Factory will detect the changes and create a new revision while maintaining the connection to previous versions.

Dataset Versioning

Factory tracks dataset versions with two main objects:

DatasetMeta: The main reference for a dataset under a specific name
DatasetRevision: A concrete version of a dataset with specific examples

This versioning system ensures:

Reproducibility: Experiments can reference specific dataset versions
Tracking: See how dataset changes affect model performance
Lineage: Understand which dataset version was used for which model training run

Best Practices

Use consistent preprocessing: Apply the same transforms to train and test splits
Choose appropriate splits: Typically 80-90% train, 10-20% test
Include diverse examples: Ensure your test set represents the full distribution
Keep feature names consistent: Factory tracks features for schema validation
Provide descriptive names: Use clear naming for easier dataset management

Example Workflow

Here's a complete example of preparing a dataset for fine-tuning:

from factory_sdk import FactoryClient
from datasets import load_dataset
 
# Initialize Factory client
factory = FactoryClient(
    tenant="your_tenant_name",
    project="your_project_name",
    token="your_api_key",
)
 
# Load and prepare dataset
data = load_dataset("imdb")
splits = data.train_test_split(test_size=0.1, seed=42)
 
# Register with Factory
dataset = factory.dataset.with_name("sentiment-data") \
    .from_local(splits) \
    .save_or_fetch()
 
# Now you can use this dataset for recipe creation and model fine-tuning
recipe = factory.recipe \
    .with_name("sentiment-analysis") \
    .using_dataset(dataset) \
    .with_preprocessor(your_processor_function) \
    .save_or_fetch()

Datasets

On this page