ModelStudio Logo
ModelStudio

Datasets

Create datasets for fine-tuning

Datasets are the foundation for training AI models. The statistical variance in the provided data is learned and represented by the resulting model.

Create a Dataset in ModelStudio

There are two methods to create datasets:

  1. Uploading a complete dataset
  2. Generating a completely synthetic dataset

Upload a Complete Dataset

Full datasets need to be provided in JSONL format, meaning a JSON file that only consists of one single line.

Example of a multi-turn conversation in JSONL format:

{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is machine learning?"}, {"role": "assistant", "content": "Machine learning is a subset of artificial intelligence..."}]}

The messages array contains the conversation turns with different roles:

  • system: Context or behavior instructions (optional)
  • user: User input
  • assistant: Expected model response

Generate a Completely Synthetic Dataset

This method allows you to create a dataset from scratch without any existing data. Large AI models are used to generate training examples based on your description.

The beta version of ModelStudio will only allow the creation of up to 250 training examples.

Best Practice: The more detailed your use case description, the better the quality of the generated data.

Limitations: Currently, downloading synthetic datasets is not supported during the beta period.

Privacy Guidelines

ModelStudio adheres to GDPR requirements. For enterprise deployments, refer to the Data Processing Agreement for detailed information on data handling and compliance.

Data Security

All infrastructure operates in German data centers meeting stringent security standards. ModelStudio can also be deployed fully on-premises for maximum data control.

Next Steps

After creating your dataset: