Recipes

Recipes are a crucial component in the Factory workflow, serving as the bridge between your raw datasets and model training. A recipe defines precisely how to transform raw data examples into the format required for fine-tuning language models.

What is a Recipe?

A recipe consists of two key components:

Dataset Reference: Points to the dataset that contains your raw data
Preprocessor Function: A Python function that transforms each example into the format expected by the model

The preprocessor function is applied to each example in your dataset, converting it into a ModelChatInput object that can be directly used for training.

Creating a Recipe

Creating a recipe in Factory is straightforward:

from factory_sdk import FactoryClient, ModelChatInput, Role, Message
 
# Initialize the Factory client
factory = FactoryClient(
    tenant="your_tenant_name",
    project="your_project_name",
    token="your_api_key",
)
 
# Define the preprocessor function
def processor(x):
    return ModelChatInput(
        messages=[
            Message(content=x["sentence"], role=Role.USER),
            Message(content="The answer is: " + str(x["label"]), role=Role.ASSISTANT)
        ]
    )
 
# Create and register the recipe
recipe = factory.recipe \
    .with_name("financial-phrases") \
    .using_dataset(dataset) \
    .with_preprocessor(processor) \
    .save_or_fetch()

How Recipes Work Behind the Scenes

When you create a recipe, Factory:

Analyzes your preprocessor code: Factory extracts only the necessary code, dependencies, and imports
Generates samples: Applies your preprocessor to the dataset and saves example inputs/outputs
Computes embeddings: Creates vector embeddings of your examples to analyze data distribution
Tests for IID conditions: Checks if your training and test data follow similar distributions
Stores distribution analysis: Saves this information for later monitoring and drift detection

All of this analysis is automatically visualized in the Factory Hub for inspection.

The Preprocessor Function

The preprocessor function is the heart of a recipe. It should:

Take a single dataset example as input
Transform it into a ModelChatInput object
Be deterministic (same input always produces same output)
Handle all possible input variations without errors

Here's a typical structure:

from factory_sdk import ModelChatInput, Role, Message
 
def preprocessor(example):
    # Transform the example into a chat format
    return ModelChatInput(
        messages=[
            Message(content=example["input_field"], role=Role.USER),
            Message(content=example["output_field"], role=Role.ASSISTANT)
        ]
    )

Best Practices for Recipes

Recipes work best when following these guidelines:

DO:

Keep preprocessors simple: Focus on mapping data fields to chat format
Use for format conversion: Transform dataset fields into standard chat format
Implement deterministic transformations: Same input should always yield same output
Validate your data fields: Make sure required fields exist in your dataset

DON'T:

Perform complex preprocessing: Heavy data cleaning, tokenization, or augmentation should be done at the dataset creation stage
Add randomness: Avoid random operations that could produce different outputs
Fetch external data: The preprocessor should only work with the input example
Modify the dataset structure: Create a new dataset if you need significant changes

Monitoring and Drift Detection

A key benefit of Factory's recipe system is automatic drift detection:

During recipe creation, Factory analyzes the distribution of your training and test data
When models are deployed, Factory compares new inputs against the original distribution
If new data drifts from the training distribution, Factory alerts you to potential issues

This monitoring happens automatically and is visible in the Factory Hub.

Recipe Versioning

Like models and datasets, recipes in Factory have a versioning system:

RecipeMeta: The main reference for a recipe under a specific name
RecipeRevision: A concrete version of a recipe

When you update a recipe, Factory creates a new revision while maintaining the connection to the previous versions, ensuring reproducibility and tracking.

Common Recipe Patterns

Classification Tasks

def processor(x):
    return ModelChatInput(
        messages=[
            Message(content=f"Classify this text: {x['text']}", role=Role.USER),
            Message(content=f"The class is {x['label']}", role=Role.ASSISTANT)
        ]
    )

Question Answering

def processor(x):
    return ModelChatInput(
        messages=[
            Message(content=f"Question: {x['question']}\nContext: {x['context']}", role=Role.USER),
            Message(content=x['answer'], role=Role.ASSISTANT)
        ]
    )

Summarization

def processor(x):
    return ModelChatInput(
        messages=[
            Message(content=f"Summarize this text: {x['document']}", role=Role.USER),
            Message(content=x['summary'], role=Role.ASSISTANT)
        ]
    )

Multi-turn Conversations

def processor(x):
    messages = []
    for turn in x['conversation']:
        if turn['speaker'] == 'user':
            messages.append(Message(content=turn['text'], role=Role.USER))
        else:
            messages.append(Message(content=turn['text'], role=Role.ASSISTANT))
    return ModelChatInput(messages=messages)

Recipes

On this page