Training Data Transparency Statement
Last updated: December 22, 2025
Pursuant to California Civil Code Section 3111, the below summary describes the datasets used to develop Inflection AI, Inc.’s large language models:
1. Sources and Owners of Datasets
Inflection AI, Inc. (“Inflection AI”) trains language models using publicly accessible, licensed, and proprietary datasets. Inflection AI’s proprietary datasets include both datasets commissioned by Inflection AI as well as datasets generated using synthetic distilled data.
2. Dataset Purpose and Intended Use
The datasets used to train and develop Inflection AI’s large language models furthers Inflection AI’s public benefit mission to create human-centered AI models that unite emotional intelligence (EQ) and raw intelligence (IQ).
3. Scale of Training Data
Inflection AI estimates its datasets consist of 4 to 5 petabytes’ worth of data, in which thousands to hundreds of thousands of examples may be used per dataset.
4. Types of Data Points
The datasets used by Inflection AI consist primarily of text data. In certain cases, the datasets may also include users’ voice recordings. Data points within the datasets were labeled to describe qualities of such data points (e.g. friendly or positive, repetitive or negative).
5. Intellectual Property Status
Inflection AI datasets contain a mixture of content, some of which may be protected by copyright, trademark, or patent and some of which is in the public domain.
6. Data Acquisition Methods
Inflection AI’s datasets consist of a mix of publicly available, licensed, and proprietary datasets.
7. Personal Information
Personal information may incidentally be included in the data sources mentioned above. However, it is not our intention to train our models on personal information. Inflection AI is committed to privacy-by-design safeguards and training its models not to disclose individuals’ personal information. Please see Inflection AI’s Notice on Model Training for more detail.
8. Aggregate Consumer Information
To Inflection AI’s knowledge, the datasets used to train Inflection AI’s models do not include aggregate consumer information as defined in California Civil Code Section 1798.140(b), although such information may incidentally be included in the data sources mentioned above.
9. Data Processing and Modification
Inflection AI conducts extensive cleaning, processing, and modification of its datasets to train the model, including:
Deduplication;
Quality filtering;
Dataset mixing;
Content filtering; and
Format standardization.
10. Data Collection Timeline
Inflection AI began collecting external datasets for pre-training when Inflection AI was founded in 2022. Inflection AI continues to improve its services, including by post-training and fine-tuning models, based on information from our products and services and Inflection AI’s proprietary datasets. Data collection for these purposes is ongoing.
11. Development Timeline
The datasets were first used to train Inflection AI’s large language models when Inflection AI was founded in 2022.
12. Synthetic Data Generation
Inflection AI uses synthetic data generation in order to augment data, address training gaps, enhance model safety and performance, and enable specialized capabilities like reasoning and tool use.