Create Deployment
After training and evaluating your adapter, the next step is deploying it for inference. Factory provides a streamlined deployment system that allows you to serve your fine-tuned models with production-grade features including:
- OpenAI-compatible API
- Real-time monitoring
- Data drift detection
- Multi-adapter support
- Optimized inference
Deployment Workflow
Creating a Deployment
To deploy an adapter in Factory:
This deployment will:
- Start a local API server on port 8000
- Load your fine-tuned adapter
- Expose an OpenAI-compatible interface
- Stream metrics and analytics to the Factory Hub
Deployment Configuration
Control your deployment with these parameters:
Parameter | Description | Default |
---|---|---|
port | Port number for the API server | 9000 |
quantization_bits | Number of bits for quantization (4 for 4-bit precision) | 4 |
dtype | Data type for inference ("fp16", "bf16", "fp32", or "auto") | "auto" |
max_seq_len_to_capture | Maximum sequence length to analyze for drift | 1024 |
max_seq_len | Maximum sequence length for generation | 1024 |
max_batched_tokens | Maximum number of tokens to batch | 4096 |
max_memory_utilization | Fraction of GPU memory to use (0.0-1.0) | 0.8 |
swap_space | Size of CPU swap space in GB | 4 |
Interacting with Deployed Models
Factory deployments expose an OpenAI-compatible API, making them easy to integrate with existing tools and applications:
Multi-Adapter Deployments
Factory supports deploying multiple adapters in a single deployment:
Access specific adapters by name:
Real-Time Monitoring
Factory automatically collects metrics from your deployment:
-
Performance Metrics:
- Request throughput
- Token generation speed
- Latency percentiles
- GPU/CPU utilization
-
Distribution Metrics:
- Input data distribution
- Drift detection results
- Embedding space visualization
All metrics are streamed to the Factory Hub for real-time monitoring. Factory will automatically analyze your live traffic against the training data distribution to detect potential data drift.
Data Drift Detection
A key feature of Factory deployments is automatic data drift detection:
- Recipe Integration: Uses the same recipe from training to embed incoming requests
- Embedding Analysis: Compares production traffic embeddings to training data
- Statistical Testing: Applies statistical tests to detect distribution shifts
- Alerting: Notifies you when data drift exceeds thresholds
This helps identify when your model is receiving inputs that differ significantly from its training data, which may affect performance.
Deployment Best Practices
- Start with quantization: Enable 4-bit quantization to reduce memory usage
- Adjust memory utilization: Set based on your GPU's available memory
- Monitor latency: Check Factory Hub for response time metrics
- Watch for drift: Regularly review data drift visualizations
- Scale horizontally: Deploy multiple instances for high-traffic applications
Example: Production Deployment
Here's a complete example for a production deployment: