Recipes
Recipes are a crucial component in the Factory workflow, serving as the bridge between your raw datasets and model training. A recipe defines precisely how to transform raw data examples into the format required for fine-tuning language models.
What is a Recipe?
A recipe consists of two key components:
- Dataset Reference: Points to the dataset that contains your raw data
- Preprocessor Function: A Python function that transforms each example into the format expected by the model
The preprocessor function is applied to each example in your dataset, converting it into a ModelChatInput
object that can be directly used for training.
Creating a Recipe
Creating a recipe in Factory is straightforward:
How Recipes Work Behind the Scenes
When you create a recipe, Factory:
- Analyzes your preprocessor code: Factory extracts only the necessary code, dependencies, and imports
- Generates samples: Applies your preprocessor to the dataset and saves example inputs/outputs
- Computes embeddings: Creates vector embeddings of your examples to analyze data distribution
- Tests for IID conditions: Checks if your training and test data follow similar distributions
- Stores distribution analysis: Saves this information for later monitoring and drift detection
All of this analysis is automatically visualized in the Factory Hub for inspection.
The Preprocessor Function
The preprocessor function is the heart of a recipe. It should:
- Take a single dataset example as input
- Transform it into a
ModelChatInput
object - Be deterministic (same input always produces same output)
- Handle all possible input variations without errors
Here's a typical structure:
Best Practices for Recipes
Recipes work best when following these guidelines:
DO:
- Keep preprocessors simple: Focus on mapping data fields to chat format
- Use for format conversion: Transform dataset fields into standard chat format
- Implement deterministic transformations: Same input should always yield same output
- Validate your data fields: Make sure required fields exist in your dataset
DON'T:
- Perform complex preprocessing: Heavy data cleaning, tokenization, or augmentation should be done at the dataset creation stage
- Add randomness: Avoid random operations that could produce different outputs
- Fetch external data: The preprocessor should only work with the input example
- Modify the dataset structure: Create a new dataset if you need significant changes
Monitoring and Drift Detection
A key benefit of Factory's recipe system is automatic drift detection:
- During recipe creation, Factory analyzes the distribution of your training and test data
- When models are deployed, Factory compares new inputs against the original distribution
- If new data drifts from the training distribution, Factory alerts you to potential issues
This monitoring happens automatically and is visible in the Factory Hub.
Recipe Versioning
Like models and datasets, recipes in Factory have a versioning system:
- RecipeMeta: The main reference for a recipe under a specific name
- RecipeRevision: A concrete version of a recipe
When you update a recipe, Factory creates a new revision while maintaining the connection to the previous versions, ensuring reproducibility and tracking.