FactoryFactory

Recipes

Recipes are a crucial component in the Factory workflow, serving as the bridge between your raw datasets and model training. A recipe defines precisely how to transform raw data examples into the format required for fine-tuning language models.

What is a Recipe?

A recipe consists of two key components:

  1. Dataset Reference: Points to the dataset that contains your raw data
  2. Preprocessor Function: A Python function that transforms each example into the format expected by the model

The preprocessor function is applied to each example in your dataset, converting it into a ModelChatInput object that can be directly used for training.

Creating a Recipe

Creating a recipe in Factory is straightforward:

from factory_sdk import FactoryClient, ModelChatInput, Role, Message
 
# Initialize the Factory client
factory = FactoryClient(
    tenant="your_tenant_name",
    project="your_project_name",
    token="your_api_key",
)
 
# Define the preprocessor function
def processor(x):
    return ModelChatInput(
        messages=[
            Message(content=x["sentence"], role=Role.USER),
            Message(content="The answer is: " + str(x["label"]), role=Role.ASSISTANT)
        ]
    )
 
# Create and register the recipe
recipe = factory.recipe \
    .with_name("financial-phrases") \
    .using_dataset(dataset) \
    .with_preprocessor(processor) \
    .save_or_fetch()

How Recipes Work Behind the Scenes

When you create a recipe, Factory:

  1. Analyzes your preprocessor code: Factory extracts only the necessary code, dependencies, and imports
  2. Generates samples: Applies your preprocessor to the dataset and saves example inputs/outputs
  3. Computes embeddings: Creates vector embeddings of your examples to analyze data distribution
  4. Tests for IID conditions: Checks if your training and test data follow similar distributions
  5. Stores distribution analysis: Saves this information for later monitoring and drift detection

All of this analysis is automatically visualized in the Factory Hub for inspection.

The Preprocessor Function

The preprocessor function is the heart of a recipe. It should:

  1. Take a single dataset example as input
  2. Transform it into a ModelChatInput object
  3. Be deterministic (same input always produces same output)
  4. Handle all possible input variations without errors

Here's a typical structure:

from factory_sdk import ModelChatInput, Role, Message
 
def preprocessor(example):
    # Transform the example into a chat format
    return ModelChatInput(
        messages=[
            Message(content=example["input_field"], role=Role.USER),
            Message(content=example["output_field"], role=Role.ASSISTANT)
        ]
    )

Best Practices for Recipes

Recipes work best when following these guidelines:

DO:

  • Keep preprocessors simple: Focus on mapping data fields to chat format
  • Use for format conversion: Transform dataset fields into standard chat format
  • Implement deterministic transformations: Same input should always yield same output
  • Validate your data fields: Make sure required fields exist in your dataset

DON'T:

  • Perform complex preprocessing: Heavy data cleaning, tokenization, or augmentation should be done at the dataset creation stage
  • Add randomness: Avoid random operations that could produce different outputs
  • Fetch external data: The preprocessor should only work with the input example
  • Modify the dataset structure: Create a new dataset if you need significant changes

Monitoring and Drift Detection

A key benefit of Factory's recipe system is automatic drift detection:

  1. During recipe creation, Factory analyzes the distribution of your training and test data
  2. When models are deployed, Factory compares new inputs against the original distribution
  3. If new data drifts from the training distribution, Factory alerts you to potential issues

This monitoring happens automatically and is visible in the Factory Hub.

Recipe Versioning

Like models and datasets, recipes in Factory have a versioning system:

  1. RecipeMeta: The main reference for a recipe under a specific name
  2. RecipeRevision: A concrete version of a recipe

When you update a recipe, Factory creates a new revision while maintaining the connection to the previous versions, ensuring reproducibility and tracking.

Common Recipe Patterns

Classification Tasks

def processor(x):
    return ModelChatInput(
        messages=[
            Message(content=f"Classify this text: {x['text']}", role=Role.USER),
            Message(content=f"The class is {x['label']}", role=Role.ASSISTANT)
        ]
    )

Question Answering

def processor(x):
    return ModelChatInput(
        messages=[
            Message(content=f"Question: {x['question']}\nContext: {x['context']}", role=Role.USER),
            Message(content=x['answer'], role=Role.ASSISTANT)
        ]
    )

Summarization

def processor(x):
    return ModelChatInput(
        messages=[
            Message(content=f"Summarize this text: {x['document']}", role=Role.USER),
            Message(content=x['summary'], role=Role.ASSISTANT)
        ]
    )

Multi-turn Conversations

def processor(x):
    messages = []
    for turn in x['conversation']:
        if turn['speaker'] == 'user':
            messages.append(Message(content=turn['text'], role=Role.USER))
        else:
            messages.append(Message(content=turn['text'], role=Role.ASSISTANT))
    return ModelChatInput(messages=messages)

On this page