FactoryFactory

Datasets

Datasets are essential building blocks for fine-tuning models in Factory. This guide explains how to prepare, load, and manage datasets for your machine learning workflows.

Dataset Structure Requirements

Factory works with datasets from the popular Hugging Face datasets library. Your dataset must meet these requirements:

  • Must be a DatasetDict object with at least train and test splits
  • Each split should contain examples relevant to your fine-tuning task
  • Factory automatically generates fingerprints to track dataset versions

A typical dataset structure looks like this:

DatasetDict({
    'train': Dataset({
        features: ['text', 'label', ...],
        num_rows: 10000
    }),
    'test': Dataset({
        features: ['text', 'label', ...],
        num_rows: 1000
    })
})

Loading Datasets in Factory

Factory provides flexible options for working with datasets:

Option 1: Loading from Hugging Face

from factory_sdk import FactoryClient
from datasets import load_dataset
 
# Initialize the Factory client
factory = FactoryClient(
    tenant="your_tenant_name",
    project="your_project_name",
    token="your_api_key",
)
 
# Load a dataset from Hugging Face
hf_data = load_dataset("takala/financial_phrasebank", "sentences_allagree")
data = hf_data["train"].train_test_split(test_size=0.1, seed=42)
 
# Register with Factory
dataset = factory.dataset.with_name("financial-phrases") \
    .from_local(data) \
    .save_or_fetch()

This method:

  1. Loads a dataset from Hugging Face's repository
  2. Splits it into train and test sets
  3. Registers it with Factory under a specific name
  4. Creates a unique fingerprint for version tracking

Option 2: Working with Local Datasets

For datasets you've already prepared:

import pandas as pd
from datasets import Dataset, DatasetDict
 
# Create dataset from local data
train_df = pd.read_csv("train_data.csv")
test_df = pd.read_csv("test_data.csv")
 
data = DatasetDict({
    "train": Dataset.from_pandas(train_df),
    "test": Dataset.from_pandas(test_df)
})
 
# Register with Factory
dataset = factory.dataset.with_name("custom-dataset") \
    .from_local(data) \
    .save_or_fetch()

How Dataset Management Works

When you register a dataset with Factory:

  1. Fingerprinting: Factory creates a unique fingerprint to identify the dataset version
  2. Revision Control: Similar to models, datasets have metadata and specific revisions
  3. Preview Generation: Factory automatically creates previews of your data for inspection
  4. Test Sample Creation: Factory extracts sample data for quick testing
  5. Storage: The dataset is efficiently stored and linked to your Factory account

If you update your dataset and upload it again, Factory will detect the changes and create a new revision while maintaining the connection to previous versions.

Dataset Versioning

Factory tracks dataset versions with two main objects:

  1. DatasetMeta: The main reference for a dataset under a specific name
  2. DatasetRevision: A concrete version of a dataset with specific examples

This versioning system ensures:

  • Reproducibility: Experiments can reference specific dataset versions
  • Tracking: See how dataset changes affect model performance
  • Lineage: Understand which dataset version was used for which model training run

Best Practices

  • Use consistent preprocessing: Apply the same transforms to train and test splits
  • Choose appropriate splits: Typically 80-90% train, 10-20% test
  • Include diverse examples: Ensure your test set represents the full distribution
  • Keep feature names consistent: Factory tracks features for schema validation
  • Provide descriptive names: Use clear naming for easier dataset management

Example Workflow

Here's a complete example of preparing a dataset for fine-tuning:

from factory_sdk import FactoryClient
from datasets import load_dataset
 
# Initialize Factory client
factory = FactoryClient(
    tenant="your_tenant_name",
    project="your_project_name",
    token="your_api_key",
)
 
# Load and prepare dataset
data = load_dataset("imdb")
splits = data.train_test_split(test_size=0.1, seed=42)
 
# Register with Factory
dataset = factory.dataset.with_name("sentiment-data") \
    .from_local(splits) \
    .save_or_fetch()
 
# Now you can use this dataset for recipe creation and model fine-tuning
recipe = factory.recipe \
    .with_name("sentiment-analysis") \
    .using_dataset(dataset) \
    .with_preprocessor(your_processor_function) \
    .save_or_fetch()

On this page