‘Manipulative and disgraceful’: OpenAI’s critics seize on math benchmarking scandal |
|
|
Hello and welcome to Eye on AI. In this edition…Watch OpenAI’s hands; Trump scraps Biden’s AI order; Whistleblower targets Amazon and Covariant; Titans vs. Transformers.
OpenAI may or may not be about to release something big and agentic.
According to a rather breathless Axios article on Sunday, an unidentified company is preparing “Ph.D.-level super-agents” that would be “a true replacement for human workers.” No names are named, but the article prominently notes that OpenAI CEO Sam Altman will give Trump administration officials a closed-door briefing at the end of the month.
It goes on to add: “Sources say this coming advancement is significant. Several OpenAI staff have been telling friends they are both jazzed and spooked by recent progress.” Those sources apparently come from “the U.S. government and leading AI companies.”
There’s more than a whiff of hype about all this. But Altman is no fan of such things, he claims. Addressing the separate but perhaps connected issue of OpenAI’s efforts to achieve “artificial general intelligence” (definitions differ, but this usually means AI with human- or superhuman-level capabilities), the CEO tweeted yesterday that “Twitter hype is out of control again” and “we are not gonna deploy AGI next month, nor have we built it.”
If he’s so anti-hype, Altman might want to take himself aside for tweeting, less than three weeks ago: “I have always wanted to write a six-word story. Here it is: Near the singularity; unclear which side.” A story, sure, but it also came across as a strong hint. (“The singularity” is a term referring to the inflection point where AI surpasses human intelligence.)
In yesterday’s tweet, Altman promised “We have some very cool stuff for you.” I’ve asked OpenAI whether it is the company that’s about to reveal “Ph.D.-level super-agents” and have received no response. But The Information reports that OpenAI will launch an agentic system called Operator, which can autonomously execute tasks on the user’s behalf, as soon as this month.
Whatever OpenAI does release, people should scrutinize it very closely, because the company has in recent days been caught up in a bit of a benchmarking scandal that raises questions about its performance claims.
The benchmark in question is FrontierMath, which was used in the demonstration of OpenAI’s flagship o3 model a month back. Curated by Epoch AI, FrontierMath contains only “new and unpublished” math problems, which is supposed to avoid the issue of a model being asked to solve problems that were included in its training dataset. Epoch AI says models such as OpenAI’s GPT-4 and Google’s Gemini only manage scores of less than 2%. In its demo, o3 scored a shade over 25%.
Problem is, it turns out that OpenAI funded the development of FrontierMath and apparently instructed Epoch AI not to tell anyone about this, until the day of o3’s unveiling. After an Epoch AI contractor used a LessWrong post to complain that mathematicians contributing to the dataset had been kept in the dark about the link, Epoch associate director Tamay Besiroglu apologized, saying OpenAI’s contract had left the company unable to disclose the funding earlier.
“We acknowledge that OpenAI does have access to a large fraction of FrontierMath problems and solutions, with the exception of a unseen-by-OpenAI hold-out set that enables us to independently verify model capabilities,” Besiroglu wrote. “However, we have a verbal agreement that these materials will not be used in model training.”
OpenAI has not yet responded to a question about whether it nonetheless used its FrontierMath access when training o3—but its critics aren’t holding back. “The public presentation of o3 from a scientific perspective was manipulative and disgraceful,” the notable AGI skeptic Gary Marcus told my colleague Jeremy Kahn in Davos yesterday, adding that the presentation was “deliberately structured to make it look like they were closer to AGI than they actually are.”
“OpenAI should be more transparent about what the business arrangements were [with Epoch AI] and the extent to which they were given a competitive advantage and the extent to which they trained directly or indirectly on materials they had access to and the extent to which they used data augmentation techniques on information they had access to,” Marcus said. “If they are not transparent, we should not take them seriously.”
That’s something to bear in mind over the coming weeks. And with that, here’s more on what has been a very busy few days on the AI news front.
David Meyer david.meyer@fortune.com @dmeyer.eu on Bluesky
|
|
|
2025 AI Business Predictions AI leaders are pulling ahead and redefining the way we do business. See what’s driving success, the risks to avoid, and strategies to thrive in 2025. Read more
|
|
|
Trump scraps Biden’s AI order. On his first day back in office, President Donald Trump scrapped dozens of his predecessor’s policies, among them Biden’s 2023 Executive Order on Safe, Secure, and Trustworthy Development and Use of AI. Much of that particular order has already been carried out, such as the creation of an AI Safety Institute within the National Institute of Standards and Technology (NIST). But Trump’s move does mean that AI companies will no longer have to give the U.S. government safety-test results before releasing new models. It also means that the U.S. now has no significant federal AI rules, creating an enormous disparity with the EU in particular, and perhaps setting the stage for future EU-U.S. clashes over the issue of AI safety.
Whistleblower targets Amazon’s Covariant acquihire. An unnamed shareholder and former employee of Covariant AI, a company that makes AI for logistics robots, has complained to the U.S. authorities about Amazon’s recent deal with the company. As it announced last August, Amazon hired three Covariant founders and a quarter of its staff, while taking a nonexclusive license for its models. Per the Washington Post, the whistleblower claims the acquihire deal was worth $380 million—over three times the threshold for giving antitrust regulators a heads-up, which never happened—and also that its terms limited the licenses that Covariant could sell to others. An Amazon spokesperson responded: “Covariant continues to serve its dozens of customers, and because Amazon is licensing Covariant technology on a non-exclusive basis, Covariant is free to license its technology to other companies."
Metropolis buys Oosto. Oosto, the Israeli AI facial recognition firm formerly known as AnyVision, has found a buyer. Metropolisan, an AI company that helps parking operators provide checkout-free payment experiences, will pay $125 million of its stock in exchange for Oosto, according to TechCrunch. Oosto had raised some $380 million from investors. Oosto/AnyVision was a controversial outfit, partly because many people are generally uneasy about facial recognition, but also because the Israeli government used its software to surveil West Bank Palestinians.
British government details extensive AI plans. The U.K.’s Labour government said last week that it would “mainline AI into the veins” of the country’s economy, and now it’s detailed how the country’s public services will embrace the new technology. As part of an announcement around the digitization of services and better sharing of data between agencies, the government announced an AI toolkit for civil servants. The package is dubbed “Humphrey," a witty reference to the classic TV show Yes Minister. The kit includes tools for rapidly parsing responses to public consultations, draws on decades of parliamentary debate to “better manage bills” (reportedly by predicting how legislation will be received by lawmakers), and summarizing policies and laws.
|
|
|
Google pits Titans against transformers. There’s a lot of buzz around a new neural-network architecture that Google researchers have just announced. The Titans architecture provides the possibility of long-term, persistent neural memory that can act in concert with more short-term memory, of the sort that is associated with the transformer architecture that underpins today’s LLMs. This would be useful for building agents. According to Google’s researchers, the new architecture is “more effective” than transformers when it comes to “common-sense reasoning” and other tasks, specifically when it comes to handling large amounts of information. However, the big question now is what the compute requirements look like.
Meta claims Babel Fish breakthrough. Meta’s researchers have announced a system called Massively Multilingual and Multimodal Machine Translation, or SEAMLESSM4T, that can translate spoken words into other languages without the need to convert the recording to text and back again (though it can do that too.) They suggest this is a big step towards the creation of something like the Babel Fish, a universal translator (and fish) that makes it possible for characters in Douglas Adams’s Hitchhiker’s Guide to the Galaxy to communicate with other species. According to the researchers, SEAMLESSM4T is far better at rejecting background noise than comparable systems.
|
|
|
Feb. 10-11: AI Action Summit, Paris, France
March 3-6: MWC, Barcelona
March 7-15: SXSW, Austin
March 10-13: Human [X] conference, Las Vegas
March 17-20: Nvidia GTC, San Jose
April 9-11: Google Cloud Next, Las Vegas
|
|
|
Reasoning models flourish in China. In the push for better AI “reasoning” models, all eyes are currently on China thanks to a couple of notable announcements.
First up: DeepSeek-R1. Hangzhou-based DeepSeek released its V3 model, currently considered by some to be the best open-source AI model out there (sorry, Meta,) just before Christmas. R1 was used to train V3, and DeepSeek claims it can just about match OpenAI’s o1 “across math, code, and reasoning tasks.” Benchmarking suggests this is true, providing a serious competitor to o1 that is much cheaper to run.
DeepSeek has now open-sourced a version of R1 called R1-Zero, which it says “encounters challenges such as endless repetition, poor readability, and language mixing,” as well as R1 itself, which apparently doesn’t. Perhaps because both are enormous, it has also transferred (or “distilled”) knowledge from them to versions of Meta’s Llama and Alibaba’s Qwen models, and open-sourced those too.
Meanwhile, China’s Moonshot AI just announced Kimi k1.5, a model that can reason over both text and vision modalities, and that Moonshot also claims is comparable to o1. It says the new version of the model will soon power its popular Kimi chatbot.
|
|
|
|