Datasets
Create datasets for fine-tuning
Datasets are the foundation for training AI models. The statistical variance in the provided data is learned and represented by the resulting model.
Create a Dataset in ModelStudio
There are two methods to create datasets:
- Uploading a complete dataset
- Generating a completely synthetic dataset
Upload a Complete Dataset
Full datasets need to be provided in JSONL format, meaning a JSON file that only consists of one single line.
Example of a multi-turn conversation in JSONL format:
{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is machine learning?"}, {"role": "assistant", "content": "Machine learning is a subset of artificial intelligence..."}]}The messages array contains the conversation turns with different roles:
system: Context or behavior instructions (optional)user: User inputassistant: Expected model response
Generate a Completely Synthetic Dataset
This method allows you to create a dataset from scratch without any existing data. Large AI models are used to generate training examples based on your description.
The beta version of ModelStudio will only allow the creation of up to 250 training examples.
Best Practice: The more detailed your use case description, the better the quality of the generated data.
Limitations: Currently, downloading synthetic datasets is not supported during the beta period.
Privacy Guidelines
ModelStudio adheres to GDPR requirements. For enterprise deployments, refer to the Data Processing Agreement for detailed information on data handling and compliance.
Data Security
All infrastructure operates in German data centers meeting stringent security standards. ModelStudio can also be deployed fully on-premises for maximum data control.
Next Steps
After creating your dataset:
- Select a Base Model - Choose the appropriate foundation model
- Configure Fine-Tuning - Set up your training job