Create Deployment

After training and evaluating your adapter, the next step is deploying it for inference. Factory provides a streamlined deployment system that allows you to serve your fine-tuned models with production-grade features including:

OpenAI-compatible API
Real-time monitoring
Data drift detection
Multi-adapter support
Optimized inference

Deployment Workflow

Creating a Deployment

To deploy an adapter in Factory:

from factory_sdk import FactoryClient, DeploymentArgs
 
# Initialize the Factory client
factory = FactoryClient(
    tenant="your_tenant_name",
    project="your_project_name",
    token="your_api_key",
)
 
# Deploy your adapter
deployment = factory.deployment \
    .with_name("sentiment-api") \
    .for_adapter(adapter) \
    .with_config(DeploymentArgs(
        port=8000,
        dtype="fp16",
        max_memory_utilization=0.8,
        swap_space=4
    )) \
    .run(daemon=True)
 
print(f"Deployment is running on http://localhost:8000")

This deployment will:

Start a local API server on port 8000
Load your fine-tuned adapter
Expose an OpenAI-compatible interface
Stream metrics and analytics to the Factory Hub

Deployment Configuration

Control your deployment with these parameters:

Parameter	Description	Default
`port`	Port number for the API server	9000
`quantization_bits`	Number of bits for quantization (4 for 4-bit precision)	4
`dtype`	Data type for inference ("fp16", "bf16", "fp32", or "auto")	"auto"
`max_seq_len_to_capture`	Maximum sequence length to analyze for drift	1024
`max_seq_len`	Maximum sequence length for generation	1024
`max_batched_tokens`	Maximum number of tokens to batch	4096
`max_memory_utilization`	Fraction of GPU memory to use (0.0-1.0)	0.8
`swap_space`	Size of CPU swap space in GB	4

Interacting with Deployed Models

Factory deployments expose an OpenAI-compatible API, making them easy to integrate with existing tools and applications:

from openai import OpenAI
 
# Configure the OpenAI client to use your Factory deployment
client = OpenAI(
    api_key="EMPTY",  # Not required for Factory deployments
    base_url="http://localhost:8000/v1"
)
 
# Make a request to your deployed model
response = client.chat.completions.create(
    model="sentiment-api",  # Use the deployment name
    messages=[
        {"role": "user", "content": "Analyze the sentiment of: I love this product!"}
    ],
    temperature=0.1
)
 
print(response.choices[0].message.content)

Multi-Adapter Deployments

Factory supports deploying multiple adapters in a single deployment:

# Deploy multiple adapters
deployment = factory.deployment \
    .with_name("multi-model-api") \
    .for_adapter(sentiment_adapter, name="sentiment") \
    .for_adapter(summarization_adapter, name="summarization") \
    .with_config(DeploymentArgs(port=8000)) \
    .run(daemon=True)

Access specific adapters by name:

# Access the sentiment adapter
sentiment_response = client.chat.completions.create(
    model="sentiment",
    messages=[{"role": "user", "content": "How was your day?"}]
)
 
# Access the summarization adapter
summary_response = client.chat.completions.create(
    model="summarization",
    messages=[{"role": "user", "content": "Summarize this long article..."}]
)

Real-Time Monitoring

Factory automatically collects metrics from your deployment:

Performance Metrics:
- Request throughput
- Token generation speed
- Latency percentiles
- GPU/CPU utilization
Distribution Metrics:
- Input data distribution
- Drift detection results
- Embedding space visualization

All metrics are streamed to the Factory Hub for real-time monitoring. Factory will automatically analyze your live traffic against the training data distribution to detect potential data drift.

Data Drift Detection

A key feature of Factory deployments is automatic data drift detection:

Recipe Integration: Uses the same recipe from training to embed incoming requests
Embedding Analysis: Compares production traffic embeddings to training data
Statistical Testing: Applies statistical tests to detect distribution shifts
Alerting: Notifies you when data drift exceeds thresholds

This helps identify when your model is receiving inputs that differ significantly from its training data, which may affect performance.

Deployment Best Practices

Start with quantization: Enable 4-bit quantization to reduce memory usage
Adjust memory utilization: Set based on your GPU's available memory
Monitor latency: Check Factory Hub for response time metrics
Watch for drift: Regularly review data drift visualizations
Scale horizontally: Deploy multiple instances for high-traffic applications

Example: Production Deployment

Here's a complete example for a production deployment:

from factory_sdk import FactoryClient, DeploymentArgs
 
# Initialize client
factory = FactoryClient(
    tenant="your_tenant_name",
    project="your_project_name",
    token="your_api_key",
)
 
# Load your fine-tuned adapter
adapter = factory.adapter.with_name("financial-sentiment").load()
 
# Create optimal deployment configuration
deployment_config = DeploymentArgs(
    port=8080,
    dtype="bf16",             # Use bfloat16 for modern GPUs
    quantization_bits=4,      # Enable 4-bit quantization
    max_seq_len=2048,         # Support longer sequences
    max_batched_tokens=8192,  # Increase batch size for throughput
    max_memory_utilization=0.9,  # Use more GPU memory
    swap_space=8              # Allocate more CPU swap if needed
)
 
# Deploy the model
deployment = factory.deployment \
    .with_name("financial-api") \
    .for_adapter(adapter) \
    .with_config(deployment_config) \
    .run(daemon=True)
 
print(f"Financial sentiment API running at http://localhost:8080/v1")

Create Deployment

On this page