‘Loot lo, loot lo,’ a man yells while recording scenes after a truck toppled in Andhra Pradesh’s Nellore. His video is now going viral
Latest News

‘Loot lo, loot lo,’ a man yells while recording scenes after a truck toppled in Andhra Pradesh’s Nellore. His video is now going viral, sparking discussions online over ‘insensitivity’. But what exactly happened? The truck was loaded with bottles of cold drinks (primarily bottles of Thums

Editorial independence is core to our work. Some links may earn us a commission, without influencing our opinions.<h4 class=
Technology

Editorial independence is core to our work. Some links may earn us a commission, without influencing our opinions.Big price drop on 4K smart TVs: Up to 60% off on Samsung, Sony, LG and moreSmart TVs on Amazon are seeing noticeable price drops, making it easier to upgrade to sharper visuals

<h4 class=
Latest News

NRI installs Statue of Liberty replica atop Jalandhar home after 26 years in New YorkAn NRI installed a Statue of Liberty replica on his Jalandhar home, drawing visitors and going viral online. Published on: Mar 22, 2026 7:39 AM IST By Mahipal Singh Chouhan Share via Copy link A unique

West Bengal Chief Minister Mamata Banerjee on Saturday described Prime Minister Narendra Modi as the “biggest infiltrator” and reiterated her criticism of
Politics

West Bengal Chief Minister Mamata Banerjee on Saturday described Prime Minister Narendra Modi as the “biggest infiltrator” and reiterated her criticism of the Bharatiya Janata Party-led Union government in connection with the special intensive revision of voter rolls, PTI reported

<h4 class=
Sports

Meet Team USA's ‘giantkillers’ who handed Tom Brady, Joe Burrow brutal losses in Fanatics Flag Football Published on: Mar 22, 2026 4:11 AM IST By Yash Nitish Bajaj Share via Copy link U.S. National Flag team's Isaiah Calhoun reacts after scoring a touchdown against the Wildcats FFC (AP) Key

The US has lifted sanctions on some Iranian oil, as it scrambles to contain the impact of its war in Iran on energy markets. Treasury Secretary Scott Bessent
Business

The US has lifted sanctions on some Iranian oil, as it scrambles to contain the impact of its war in Iran on energy markets. Treasury Secretary Scott Bessent announced the issuing of a narrowly tailored, short-term authorisation permitting the sale of Iranian oil currently stranded at sea

<h4 class=
Latest News

Delhi engulfed in ‘not very common’ dense fog in March; records lowest temperature since February 26Experts attributed the occurrence of fog to a combination of recent rain, high moisture content in the atmosphere and relatively calm winds overnight. Updated on: Mar 22

Chandrasekhar, who is fighting against CPI(M) leader and state Education Minister V Sivankutty, and K S Sabarinadhan from the Congress in the April 9 polls
Politics

Chandrasekhar, who is fighting against CPI(M) leader and state Education Minister V Sivankutty, and K S Sabarinadhan from the Congress in the April 9 polls, submitted his nomination papers before the returning officer during the day, accompanied by Rajasthan CM Bhajanlal Sharma.Chandrasekhar

As AI capabilities continue advancing, researchers are developing evaluation methods that test for genuine understanding. When Anthropic released Claude 4 a week ago, the artificial intelligence (AI) company said these models set “new standards for coding, advanced reasoning, and AI agents”. They cite leading scores on SWE-bench Verified, a benchmark for performance on real software engineering tasks. OpenAI also claims the o3 and o4-mini models return best scores on certain benchmarks. As does Mistral, for the open-source Devstral coding model.

Posted By: Anita Mamgai Posted On: Aug 31, 2025Share Article
As AI capabilities continue advancing, researchers are developing evaluation methods that test for genuine understanding. When Anthropic released Claude 4 a week ago, the artificial intelligence (AI) company said these models set “new standards for coding, advanced reasoning, and AI agents”. They cite leading scores on SWE-bench Verified, a benchmark for performance on real software engineering tasks. OpenAI also claims the o3 and o4-mini models return best scores on certain benchmarks. As does Mistral, for the open-source Devstral coding model.

When Anthropic released Claude 4 a week ago, the artificial intelligence (AI) company said these models set “new standards for coding, advanced reasoning, and AI agents”. They cite leading scores on SWE-bench Verified, a benchmark for performance on real software engineering tasks. OpenAI also claims the o3 and o4-mini models return best scores on certain benchmarks. As does Mistral, for the open-source Devstral coding model.

AI companies flexing comparative test scores is a common theme.

The world of technology has for long obsessed over synthetic benchmark test scores. Processor performance, memory bandwidth, speed of storage, graphics performance — plentiful examples, often used to judge whether a PC or a smartphone was worth your time and money.

Yet, experts believe it may be time to evolve methodology for AI testing, rather than a wholesale change.

American venture capitalist Mary Meeker, in the latest AI Trends report, notes that AI is increasingly doing better than humans in terms of accuracy and realism. She points to the MMLU (Massive Multitask Language Understanding) benchmark, which averages AI models at 92.30% accuracy compared with a human baseline of 89.8%.

MMLU is a benchmark to judge a model's general knowledge across 57 tasks covering professional and academic subjects including math, law, medicine and history.

Benchmarks serve as standardised yardsticks to measure, compare, and understand evolution of different AI models. Structured assessments that provide comparable scores for different models. These typically consist of datasets containing thousands of curated questions, problems, or tasks that test particular aspects of intelligence.

Understanding benchmark scores requires context about both scale and meaning behind numbers. Most benchmarks report accuracy as a percentage, but the significance of these percentages varies dramatically across different tests. On MMLU, random guessing would yield approximately 25% accuracy since most questions are multiple choice. Human performance typically ranges from 85-95% depending on subject area.

Headline numbers often mask important nuances. A model might excel in certain subjects, more than others. An aggregated score may hide weaker performance on tasks requiring multi-step reasoning or creative problem-solving, behind strong performance on factual recall.

AI engineer and commentator Rohan Paul notes on X that “most benchmarks don't reward long-term memory, rather they focus on short-context tasks.”

Increasingly, AI companies are looking closely at the ‘memory' aspect. Researchers at Google, in a new paper, detail an attention technique dubbed ‘Infini-attention', to configure how AI models extend their “context window”.

Mathematical benchmarks often show wider performance gaps. While most latest AI models score over 90% on accuracy, on the GSM8K benchmark (Claude Sonnet 3.5 leads with 97.72% while GPT-4 scores 94.8%), the more challenging MATH benchmark sees much lower ratings in comparison — Google Gemini 2.0 Flash Experimental with 89.7% leads, while GPT-4 scores 84.3%; Sonnet hasn't been tested yet).

Reworking the methodology

For AI testing, there is a need to realign testbeds. “All the evals are saturated. It's becoming slightly meaningless,” the words of Satya Nadella, chairman and chief executive officer (CEO) of Microsoft, while speaking at venture capital firm Madrona's annual meeting, earlier this year.

The tech giant has announced they're collaborating with institutions including Penn State University, Carnegie Mellon University and Duke University, to develop an approach to evaluate AI models that predicts how they will perform on unfamiliar tasks and explain why, something current benchmarks struggle to do.

An attempt is being made to make benchmarking agents for dynamic evaluation of models, contextual predictability, human-centric comparatives and cultural aspects of generative AI.

“The framework uses ADeLe (annotated-demand-levels), a technique that assesses how demanding a task is for an AI model by applying measurement scales for 18 types of cognitive and knowledge-based abilities,” explains Lexin Zhou, Research Assistant at Microsoft.

Momentarily, popular benchmarks include SWE-bench (or Software Engineering Benchmark) Verified to evaluate AI coding skills, ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) to judge generalisation and reasoning, as well as LiveBench AI that measures agentic coding tasks and evaluates LLMs on reasoning, coding and math.

Among limitations that can affect interpretation, many benchmarks can be “gamed” through techniques that improve scores without necessarily improving intelligence or capability. Case in point, Meta's new Llama models.

In April, they announced an array of models, including Llama 4 Scout, the Llama 4 Maverick, and still-being-trained Llama 4 Behemoth. Meta CEO Mark Zuckerberg claims the Behemoth will be the “highest performing base model in the world”. Maverick began ranking above OpenAI's GPT-4o in LMArena benchmarks, and just below Gemini 2.5 Pro.

That is where things went pear shaped for Meta, as AI researchers began to dig through these scores. Turns out, Meta had shared a Llama 4 Maverick model that was optimised for this test, and not exactly a spec customers would get.

Meta denies customisations. “We've also heard claims that we trained on test sets — that's simply not true and we would never do that. Our best understanding is that the variable quality people are seeing is due to needing to stabilise implementations,” says Ahmad Al-Dahle, VP of generative AI at Meta, in a statement.

There are other challenges. Models might memorise patterns specific to benchmark formats rather than developing genuine understanding. The selection and design of benchmarks also introduces bias.

There's a question of localisation. Yi Tay, AI Researcher at Google AI and DeepMind has detailed one such regional-specific benchmark called SG-Eval, focused on helping train AI models for wider context. India too is building a sovereign large language model (LLM), with Bengaluru-based AI startup Sarvam, selected under the IndiaAI Mission.

As AI capabilities continue advancing, researchers are developing evaluation methods that test for genuine understanding, robustness across context and capabilities in the real-world, rather than plain pattern matching. In the case of AI, numbers tell an important part of the story, but not the complete story.

Comment on Post

Leave a comment

If you have a News Orbit 360 user account, your address will be used to display your profile picture.


In 2018, a young tigress stopped appearing on camera traps in Bandhavgarh Tiger Reserve in Madhya Pradesh. There was no official report of conflict as well as
World
Wildlife corridors between Madhya Pradesh and Chhattisgarh help increase tiger population

In 2018, a young tigress stopped appearing on camera traps in Bandhavgarh Tiger Reserve in Madhya Pradesh. There was no official report of conflict as well as no records of carcass recovery or poaching incidents. This indicated that she had not died within the reserve.In 2018

1 days ago

The Union government on Saturday allowed an additional 20% allocation of commercial liquefied petroleum gas to states and Union Territories
World
Centre hikes commercial LPG allocation to 50%

The Union government on Saturday allowed an additional 20% allocation of commercial liquefied petroleum gas to states and Union Territories, taking the overall allocation to 50%. Of the total amount, an allocation of 10% will be given on the condition that states undertake measures to ease the

1 days ago

<h4 class=
Business
DICGC risk-based premium

DICGC risk-based premium: How much will banks pay for deposit insurance?RBP framework provides a bank with a discount of up to 33.33% on the deposit insurance premium based on its rating category and an up to 25% vintage incentive. Updated on: Mar 22

1 days ago

Maharashtra politics heats up after a state women's commission chief resigns due to her association with a spiritual leader accused of rape
Politics
UBT Sena targets Eknath Shinde after Rupali Chakankar resigns over Ashok Kharat row

Maharashtra politics heats up after a state women's commission chief resigns due to her association with a spiritual leader accused of rape. The Uddhav Balasaheb Thackeray Sena now targets Deputy Chief Minister Eknath Shinde. Leaders demand action against political followers of such spiritual

1 days ago

TikTok star Taylor Frankie Paul sat uneasily in her chair during a live interview on ABC's Good Morning America this week, caught between trying to promote her
Life Style
The Bachelorette's messy break-up with its unlikely star Taylor Frankie Paul

TikTok star Taylor Frankie Paul sat uneasily in her chair during a live interview on ABC's Good Morning America this week, caught between trying to promote her turn in the network's new series of The Bachelorette and addressing fresh domestic violence allegations lodged against her by her

1 days ago

West Bengal Chief Minister Mamata Banerjee assured the minority community of her support during her Eid-ul-Fitr address, criticizing the BJP's alleged deletion
Politics
Mamata Banerjee’s Eid speech targets BJP

West Bengal Chief Minister Mamata Banerjee assured the minority community of her support during her Eid-ul-Fitr address, criticizing the BJP's alleged deletion of minority names from electoral rolls. She vowed to challenge these actions in the Supreme Court

1 days ago

In the age of convenience, using frozen, deep-fried or processed meats that are brimming with preservatives is the norm. It makes life easier on days when work
Life Style
Eating Ultra-Processed Foods Raises Risk Of Stroke by 67%

In the age of convenience, using frozen, deep-fried or processed meats that are brimming with preservatives is the norm. It makes life easier on days when work and life collide. However, a new study has found that eating too much ultra-processed food may significantly increase the risk of serious

1 days ago

<h4 class=
Entertainment
Mohit Suri on eight-hour shift debate

Mohit Suri on eight-hour shift in film industry debate: ‘Emraan Hashmi shot 24 hours with me’Director Mohit Suri feels improved working hours shouldn’t be reserved only for actors, but must also consider the well-being of the entire crew. Mar 22, 2026

1 days ago

Cuba’s national power grid suffered a nationwide collapse on Saturday, plunging the island into darkness for the second time in a week as the communist
Latest News
Cuba Announces Second Nationwide Power Blackout In A Week

Cuba’s national power grid suffered a nationwide collapse on Saturday, plunging the island into darkness for the second time in a week as the communist government struggles to maintain electricity supplies for its 10 million residents amid a U.S.-imposed oil blockade and ageing infrastructure

1 days ago

The National Farmers' Union has warned that food prices in the UK are likely to go up as a result of the conflict in the Middle East
Business
Food prices likely to rise due to Iran war

The National Farmers' Union has warned that food prices in the UK are likely to go up as a result of the conflict in the Middle East. NFU President Tom Bradshaw told the BBC that the price of cucumbers and tomatoes could rise over the next six weeks

1 days ago

Global intelligence agencies, including the CIA and Mossad, were closely watching during Nowruz on Friday to see whether Iran’s new Supreme Leader
World
Mojtaba Khamenei Breaks Nowruz Tradition With No Public Address

Global intelligence agencies, including the CIA and Mossad, were closely watching during Nowruz on Friday to see whether Iran’s new Supreme Leader, Mojtaba Khamenei, would uphold his father’s tradition of a New Year’s address. However, the holiday went by without Mojtaba’s address

1 days ago

Iran’s Islamic Revolutionary Guard Corps (IRGC) on Saturday claimed it had struck an Israeli F-16 fighter jet over central Iran, even as Israel announced
World
IRGC Claims Strike On Israeli F-16 Over Iran As IDF Says It Hit Missile Facilities

Iran’s Islamic Revolutionary Guard Corps (IRGC) on Saturday claimed it had struck an Israeli F-16 fighter jet over central Iran, even as Israel announced fresh overnight strikes on ballistic missile production facilities in Tehran. In a statement carried by Sepah News

1 days ago

These have been directed by PNGRB to shorten the timeline between the submission of applications and the commencement of gas supply to consumer households and
Business
City gas firms advised to prioritise PNG connections for commercial entities

These have been directed by PNGRB to shorten the timeline between the submission of applications and the commencement of gas supply to consumer households and pursue mass awareness initiatives. Also Read :India set to become rich by 2047? How growth

1 days ago

The combined wealth of Assam Chief Minister Himanta Biswa Sarma and his wife Riniki Bhuyan Sarma has increased to Rs 35.16 crore in 2026 from 17
Politics
Wealth of Himanta Sarma

The combined wealth of Assam Chief Minister Himanta Biswa Sarma and his wife Riniki Bhuyan Sarma has increased to Rs 35.16 crore in 2026 from 17.27 crore in 2021, showed the affidavit filed by the Bharatiya Janata Party leader on Friday. Himanta Biswa Sarma had filed the affidavit along with his

1 days ago

When </b>Liverpool</b> came to </b>Brighton</b> last May, the away end was in party mode at the full-time whistle despite the defeat as
Sports
If Liverpool play like that at PSG it could be 10

When Liverpool came to Brighton last May, the away end was in party mode at the full-time whistle despite the defeat as they sang and celebrated with inflatables and balloons. By then, Arne Slot's side had won the Premier League title and a trip to the south coast was another excuse to enjoy

1 days ago

The National Highways Authority of India will use advanced AI dashcams on 40,000 km of highways. This initiative aims to improve road maintenance and safety
World
NHAI turns on AI dashcams for smarter highway monitoring

The National Highways Authority of India will use advanced AI dashcams on 40,000 km of highways. This initiative aims to improve road maintenance and safety through data-driven operations. High-resolution imagery will help identify over 30 types of defects

1 days ago

For most office workers in India, the concept of working out like Bollywood’s “New Age Action Hero
Life Style
Too Busy To Work Out

For most office workers in India, the concept of working out like Bollywood’s “New Age Action Hero" Vidyut Jammwal, who is known for his incredible Kalaripayattu-powered feats, seems absurdly unattainable. The actor is believed to exercise for up to nine hours every day

1 days ago

In the high-stakes arena of the 2026 Assam assembly elections, observers say Chief Minister Himanta Biswa Sarma has deployed a political strategy that is as
Politics
The Great Assam Migration

In the high-stakes arena of the 2026 Assam assembly elections, observers say Chief Minister Himanta Biswa Sarma has deployed a political strategy that is as audacious as it is effective: dismantling the Congress party by absorbing its most influential DNA. On March 19

1 days ago

<h4 class=
Latest News
Indian man escorted off flight after demanding seat change from US-based techie

Indian man escorted off flight after demanding seat change from US-based techieA United States-based software developer has sparked a discussion on in-flight etiquette after refusing to switch seats with an Indian man on a plane Published on: Mar 22

1 days ago

India faces an LPG crisis as the US-Israeli war on Iran disrupts shipments through the Strait of Hormuz, which supplies 85-90% of the country’s imports
Latest News
Rajasthan Man’s ‘Vishwaguru Chulha’ Can Cook For 25 People in 30 Minutes

India faces an LPG crisis as the US-Israeli war on Iran disrupts shipments through the Strait of Hormuz, which supplies 85-90% of the country’s imports. Tanker delays and safety concerns have caused cylinder shortages, rising prices, and rationing, hitting households and eateries hard

1 days ago

<h4 class=
Business
Comprehensive financial planning

Comprehensive financial planning: Reflect on FY25 and plan for FY26 and beyondThe annual review of the financial year. Updated on: Mar 22, 2026 4:10 AM IST By Gopal Gidwani Share via Copy link The financial year-end is a good time to reflect on what worked well this year and what did not go as

1 days ago

With the inauguration of the Noida International Airport in Jewar drawing closer, a key concern for travellers across Delhi-NCR is accessibility
Auto
Which Airport Will Actually Save You Time

With the inauguration of the Noida International Airport in Jewar drawing closer, a key concern for travellers across Delhi-NCR is accessibility. While the airport promises to ease pressure on Delhi’s existing aviation infrastructure, its location in western Uttar Pradesh means commute time will

1 days ago

Editorial independence is core to our work. Some links may earn us a commission, without influencing our opinions.<h4 class=
Technology
7 Running watches that actually help you run better in 2026

Editorial independence is core to our work. Some links may earn us a commission, without influencing our opinions.7 Running watches that actually help you run better in 2026Running watches have changed how people track every mile. Here are seven options that stand out when it comes to performance

1 days ago

This Tiramisu Day, indulge in one of the world’s most beloved desserts. With its irresistible layers of espresso-soaked sponge, velvety mascarpone
Life Style
World Tiramisu Day 2026

This Tiramisu Day, indulge in one of the world’s most beloved desserts. With its irresistible layers of espresso-soaked sponge, velvety mascarpone, and a dusting of cocoa, tiramisu continues to inspire chefs and cafés to reinterpret it in both classic and contemporary ways

1 days ago

<h4 class=
Science
Is playing music good for the brain

Is playing music good for the brain?Studies show music boosts memory, brain function and may slow ageing, offering benefits far beyond performance or talent Updated on: Mar 21, 2026 3:31 PM IST The Economist Share via Copy link THE MEN who raided Joseph Haydn’s grave hoped that his genius would

1 days ago

Prime Minister Narendra Modi spoke with Iranian President Masoud Pezeshkian, conveying Eid and Nowruz greetings and expressing hope for peace and stability in
World
PM Modi stresses stability in talk with Iranian president Masoud Pezeshkian

Prime Minister Narendra Modi spoke with Iranian President Masoud Pezeshkian, conveying Eid and Nowruz greetings and expressing hope for peace and stability in West Asia. He also condemned attacks on critical infrastructure, warning that such actions threaten regional stability and disrupt global

1 days ago

The latest hike comes amid rising energy costs, including LPG and crude oil, which have increased operating expenses for restaurants and delivery partners
Latest News
Zomato increases platform fee by 19% to around Rs 15 per order

The latest hike comes amid rising energy costs, including LPG and crude oil, which have increased operating expenses for restaurants and delivery partners, prompting platforms to adjust pricing. New Delhi: Food delivery platform Zomato has increased platform fee by 19.2 per cent or Rs 2

1 days ago


Sing Up