Inflection

Inflection-2.5: meet the world's best personal AI

Palo Alto, CA – March 7, 2024

At Inflection, our mission is to create a personal AI for everyone. Last May, we released Pi—a personal AI, designed to be empathetic, helpful, and safe. In November we announced a new major foundation model, Inflection-2, the second best LLM in the world at the time.

Now we are adding IQ to Pi’s exceptional EQ.

We are launching Inflection-2.5, our upgraded in-house model that is competitive with all the world's leading LLMs like GPT-4 and Gemini. It couples raw capability with our signature personality and unique empathetic fine-tuning. Inflection-2.5 is available to all Pi's users today, at pi.ai, on iOS, on Android, or our new desktop app.

We achieved this milestone with incredible efficiency: Inflection-2.5 approaches GPT-4’s performance, but used only 40% of the amount of compute for training.

We’ve made particular strides in areas of IQ like coding and mathematics. This translates into concrete improvements on key industry benchmarks, ensuring Pi always pushes at the technological frontier. Pi now also incorporates world-class real-time web search capabilities to ensure users get high-quality breaking news and up-to-date information.

We’ve already rolled out Inflection-2.5 to our users, and they are really enjoying Pi! We’ve seen a very significant impact on user sentiment, engagement, and retention accelerating our organic user growth.

Our one million daily and six million monthly active users have now exchanged more than four billion messages with Pi.

An average conversation with Pi lasts 33 minutes and one in ten lasts over an hour each day. About 60% of people who talk to Pi on any given week return the following week and we see higher monthly stickiness than leading competitors.

With Inflection-2.5’s powerful capabilities, users are talking to Pi about a greater range of topics than ever: discussing current events, getting local restaurant recommendations, studying for a biology exam, drafting a business plan, coding, preparing for an important conversation, or just having fun discussing a hobby. We can’t wait to show you what Pi can do.

Technical results

Below, we show a series of results on key industry benchmarks. For the sake of simplicity, we compare Inflection-2.5 to GPT-4. These results show how Pi now incorporates IQ capabilities comparable with acknowledged industry leaders. Due to differences in reporting format, we are careful to note the format used for evaluation.

Inflection-1 used approximately 4% the training FLOPs of GPT-4 and, on average, performed at approximately 72% GPT-4 level on a diverse range of IQ-oriented tasks. Inflection-2.5, now powering Pi, achieves more than 94% the average performance of GPT-4 despite using only 40% the training FLOPs. We see a significant improvement in performance across the board, with the largest gains coming in STEM areas.

Inflection-2.5 shows substantial gains over Inflection-1 on the MMLU benchmark, a diverse benchmark measuring performance across a wide range of tasks from high school to professional-level difficulty. We also evaluate on the GPQA Diamond benchmark, an extremely difficult expert-level benchmark.

We also include results on two different STEM examinations: the Hungarian Math exam along with performance on the Physics GRE—a graduate entrance exam in physics.

For Hungarian mathematics, we use the few-shot prompt and formatting provided here to allow for ease of reproducibility. Inflection-2.5 used just the first example in the prompt.

We also release a processed version of released Physics GRE exams (GR8677, GR9277, GR9677, GR0177) and compare the performance of Inflection 2.5 to GPT-4 on the first of these. We find that Inflection-2.5 performs at the 85th percentile of human test-takers in maj@8, and reaches nearly the top score in maj@32. We exclude some problems with images in the results below to allow wider comparison. We have released all questions nonetheless.

On BIG-Bench-Hard, a subset of BIG-Bench problems difficult for large language models, Inflection-2.5 shows over a 10% improvement on Inflection-1 and is competitive with the most capable models.

We also evaluated our models on MT-Bench, a widely used community leaderboard to compare models. However, after evaluating MT-Bench, we realized that a large fraction—nearly 25%—of examples in the reasoning, math, and coding categories had incorrect reference solutions or questions with flawed premises. Therefore, we corrected these examples and release that version of the dataset here.

In evaluating both subsets, we found that in the properly corrected version, our model is more in line with what we would expect based on other benchmarks.

Inflection-2.5 shows particular improvements over Inflection-1 in math and coding performance, as shown in the tables below.

On MBPP+ and HumanEval+, two coding benchmarks, we see a massive improvement over Inflection-1.

For MBPP, we report the value for GPT-4 from DeepSeek Coder. For HumanEval, we take the result from the EvalPlus leaderboard (GPT-4, May 23).

We have also evaluated Inflection-2.5 on HellaSwag and ARC-C, common sense and science benchmarks reported by a wide range of models. In both cases, we see strong performance on these saturating benchmarks.

All evaluations above are done with the model that is now powering Pi, however we note that the user experience may be slightly different due to the impact of web retrieval (no benchmarks above use web retrieval), the structure of few-shot prompting, and other production-side differences.

In short, Inflection-2.5 maintains Pi’s unique, approachable personality and extraordinary safety standards while becoming an even more helpful model across the board.

We thank our partners at Azure and CoreWeave for their support in bringing the state-of-the-art language models behind Pi to millions of users across the globe.

Try Pi Now