FactoryFactory

Create Deployment

After training and evaluating your adapter, the next step is deploying it for inference. Factory provides a streamlined deployment system that allows you to serve your fine-tuned models with production-grade features including:

  • OpenAI-compatible API
  • Real-time monitoring
  • Data drift detection
  • Multi-adapter support
  • Optimized inference

Deployment Workflow

graph LR
    A[Train Adapter] --> B[Evaluate Adapter]
    B --> C[Deploy Adapter]
    C --> D[Monitor Performance]
    D --> E[Detect Data Drift]

Creating a Deployment

To deploy an adapter in Factory:

from factory_sdk import FactoryClient, DeploymentArgs
 
# Initialize the Factory client
factory = FactoryClient(
    tenant="your_tenant_name",
    project="your_project_name",
    token="your_api_key",
)
 
# Deploy your adapter
deployment = factory.deployment \
    .with_name("sentiment-api") \
    .for_adapter(adapter) \
    .with_config(DeploymentArgs(
        port=8000,
        dtype="fp16",
        max_memory_utilization=0.8,
        swap_space=4
    )) \
    .run(daemon=True)
 
print(f"Deployment is running on http://localhost:8000")

This deployment will:

  1. Start a local API server on port 8000
  2. Load your fine-tuned adapter
  3. Expose an OpenAI-compatible interface
  4. Stream metrics and analytics to the Factory Hub

Deployment Configuration

Control your deployment with these parameters:

ParameterDescriptionDefault
portPort number for the API server9000
quantization_bitsNumber of bits for quantization (4 for 4-bit precision)4
dtypeData type for inference ("fp16", "bf16", "fp32", or "auto")"auto"
max_seq_len_to_captureMaximum sequence length to analyze for drift1024
max_seq_lenMaximum sequence length for generation1024
max_batched_tokensMaximum number of tokens to batch4096
max_memory_utilizationFraction of GPU memory to use (0.0-1.0)0.8
swap_spaceSize of CPU swap space in GB4

Interacting with Deployed Models

Factory deployments expose an OpenAI-compatible API, making them easy to integrate with existing tools and applications:

from openai import OpenAI
 
# Configure the OpenAI client to use your Factory deployment
client = OpenAI(
    api_key="EMPTY",  # Not required for Factory deployments
    base_url="http://localhost:8000/v1"
)
 
# Make a request to your deployed model
response = client.chat.completions.create(
    model="sentiment-api",  # Use the deployment name
    messages=[
        {"role": "user", "content": "Analyze the sentiment of: I love this product!"}
    ],
    temperature=0.1
)
 
print(response.choices[0].message.content)

Multi-Adapter Deployments

Factory supports deploying multiple adapters in a single deployment:

# Deploy multiple adapters
deployment = factory.deployment \
    .with_name("multi-model-api") \
    .for_adapter(sentiment_adapter, name="sentiment") \
    .for_adapter(summarization_adapter, name="summarization") \
    .with_config(DeploymentArgs(port=8000)) \
    .run(daemon=True)

Access specific adapters by name:

# Access the sentiment adapter
sentiment_response = client.chat.completions.create(
    model="sentiment",
    messages=[{"role": "user", "content": "How was your day?"}]
)
 
# Access the summarization adapter
summary_response = client.chat.completions.create(
    model="summarization",
    messages=[{"role": "user", "content": "Summarize this long article..."}]
)

Real-Time Monitoring

Factory automatically collects metrics from your deployment:

  • Performance Metrics:

    • Request throughput
    • Token generation speed
    • Latency percentiles
    • GPU/CPU utilization
  • Distribution Metrics:

    • Input data distribution
    • Drift detection results
    • Embedding space visualization

All metrics are streamed to the Factory Hub for real-time monitoring. Factory will automatically analyze your live traffic against the training data distribution to detect potential data drift.

Data Drift Detection

A key feature of Factory deployments is automatic data drift detection:

  1. Recipe Integration: Uses the same recipe from training to embed incoming requests
  2. Embedding Analysis: Compares production traffic embeddings to training data
  3. Statistical Testing: Applies statistical tests to detect distribution shifts
  4. Alerting: Notifies you when data drift exceeds thresholds

This helps identify when your model is receiving inputs that differ significantly from its training data, which may affect performance.

Deployment Best Practices

  • Start with quantization: Enable 4-bit quantization to reduce memory usage
  • Adjust memory utilization: Set based on your GPU's available memory
  • Monitor latency: Check Factory Hub for response time metrics
  • Watch for drift: Regularly review data drift visualizations
  • Scale horizontally: Deploy multiple instances for high-traffic applications

Example: Production Deployment

Here's a complete example for a production deployment:

from factory_sdk import FactoryClient, DeploymentArgs
 
# Initialize client
factory = FactoryClient(
    tenant="your_tenant_name",
    project="your_project_name",
    token="your_api_key",
)
 
# Load your fine-tuned adapter
adapter = factory.adapter.with_name("financial-sentiment").load()
 
# Create optimal deployment configuration
deployment_config = DeploymentArgs(
    port=8080,
    dtype="bf16",             # Use bfloat16 for modern GPUs
    quantization_bits=4,      # Enable 4-bit quantization
    max_seq_len=2048,         # Support longer sequences
    max_batched_tokens=8192,  # Increase batch size for throughput
    max_memory_utilization=0.9,  # Use more GPU memory
    swap_space=8              # Allocate more CPU swap if needed
)
 
# Deploy the model
deployment = factory.deployment \
    .with_name("financial-api") \
    .for_adapter(adapter) \
    .with_config(deployment_config) \
    .run(daemon=True)
 
print(f"Financial sentiment API running at http://localhost:8080/v1")

On this page