99.99% of people cannot comprehend how insane FrontierMath is. The problems are crafted by math profs and are private, not in any training data.
This is what Tamay Besiroglu, a co-author of the FrontierMath benchmark, has to say about o3's performance:
This is what Tamay Besiroglu, a co-author of the FrontierMath benchmark, has to say about o3's performance:
For context, FrontierMath is a brutally difficult benchmark with problems that would stump many mathematicians. The easier problems are as hard as IMO/Putnam; the hardest ones approach research-level complexity.
With earlier models like o1-preview, Pass@1 performance (solving on first attempt) was only around 2%. When allowing 8 attempts per problem (Pass@8) and counting problems solved at least once, we saw ~6% performance. o3's 25.2% at Pass@1 is substantially more impressive.