Training Data Transparency Statement

Last updated: December 22, 2025

Pursuant to California Civil Code Section 3111, the below summary describes the datasets used to develop Inflection AI, Inc.’s large language models:

1. Sources and Owners of Datasets

Inflection AI, Inc. (“Inflection AI”) trains language models using publicly accessible, licensed, and proprietary datasets. Inflection AI’s proprietary datasets include both datasets commissioned by Inflection AI as well as datasets generated using synthetic distilled data. 

2. Dataset Purpose and Intended Use

The datasets used to train and develop Inflection AI’s large language models furthers Inflection AI’s public benefit mission to create human-centered AI models that unite emotional intelligence (EQ) and raw intelligence (IQ).

3. Scale of Training Data

Inflection AI estimates its datasets consist of 4 to 5 petabytes’ worth of data, in which thousands to hundreds of thousands of examples may be used per dataset.

4. Types of Data Points

The datasets used by Inflection AI consist primarily of text data. In certain cases, the datasets may also include users’ voice recordings. Data points within the datasets were labeled to describe qualities of such data points (e.g. friendly or positive, repetitive or negative). 

5. Intellectual Property Status

Inflection AI datasets contain a mixture of content, some of which may be protected by copyright, trademark, or patent and some of which is in the public domain.

6. Data Acquisition Methods

Inflection AI’s datasets consist of a mix of publicly available, licensed, and proprietary datasets.

7. Personal Information

Personal information may incidentally be included in the data sources mentioned above. However, it is not our intention to train our models on personal information. Inflection AI is committed to privacy-by-design safeguards and training its models not to disclose individuals’ personal information. Please see Inflection AI’s Notice on Model Training for more detail. 

8. Aggregate Consumer Information

To Inflection AI’s knowledge, the datasets used to train Inflection AI’s models do not include aggregate consumer information as defined in California Civil Code Section 1798.140(b), although such information may incidentally be included in the data sources mentioned above.

9. Data Processing and Modification

Inflection AI conducts extensive cleaning, processing, and modification of its datasets to train the model, including:

  • Deduplication; 

  • Quality filtering;

  • Dataset mixing; 

  • Content filtering; and

  • Format standardization.

10. Data Collection Timeline

Inflection AI began collecting external datasets for pre-training when Inflection AI was founded in 2022. Inflection AI continues to improve its services, including by post-training and fine-tuning models, based on information from our products and services and Inflection AI’s proprietary datasets. Data collection for these purposes is ongoing.

11. Development Timeline

The datasets were first used to train Inflection AI’s large language models when Inflection AI was founded in 2022. 

12. Synthetic Data Generation

Inflection AI uses synthetic data generation in order to augment data, address training gaps, enhance model safety and performance, and enable specialized capabilities like reasoning and tool use.