‘Loot lo, loot lo,’ a man yells while recording scenes after a truck toppled in Andhra Pradesh’s Nellore. His video is now going viral, sparking discussions online over ‘insensitivity’. But what exactly happened? The truck was loaded with bottles of cold drinks (primarily bottles of Thums
Editorial independence is core to our work. Some links may earn us a commission, without influencing our opinions.Big price drop on 4K smart TVs: Up to 60% off on Samsung, Sony, LG and moreSmart TVs on Amazon are seeing noticeable price drops, making it easier to upgrade to sharper visuals
NRI installs Statue of Liberty replica atop Jalandhar home after 26 years in New YorkAn NRI installed a Statue of Liberty replica on his Jalandhar home, drawing visitors and going viral online. Published on: Mar 22, 2026 7:39 AM IST By Mahipal Singh Chouhan Share via Copy link A unique
West Bengal Chief Minister Mamata Banerjee on Saturday described Prime Minister Narendra Modi as the “biggest infiltrator” and reiterated her criticism of the Bharatiya Janata Party-led Union government in connection with the special intensive revision of voter rolls, PTI reported
Meet Team USA's ‘giantkillers’ who handed Tom Brady, Joe Burrow brutal losses in Fanatics Flag Football Published on: Mar 22, 2026 4:11 AM IST By Yash Nitish Bajaj Share via Copy link U.S. National Flag team's Isaiah Calhoun reacts after scoring a touchdown against the Wildcats FFC (AP) Key
The US has lifted sanctions on some Iranian oil, as it scrambles to contain the impact of its war in Iran on energy markets. Treasury Secretary Scott Bessent announced the issuing of a narrowly tailored, short-term authorisation permitting the sale of Iranian oil currently stranded at sea
Delhi engulfed in ‘not very common’ dense fog in March; records lowest temperature since February 26Experts attributed the occurrence of fog to a combination of recent rain, high moisture content in the atmosphere and relatively calm winds overnight. Updated on: Mar 22
Chandrasekhar, who is fighting against CPI(M) leader and state Education Minister V Sivankutty, and K S Sabarinadhan from the Congress in the April 9 polls, submitted his nomination papers before the returning officer during the day, accompanied by Rajasthan CM Bhajanlal Sharma.Chandrasekhar
As AI capabilities continue advancing, researchers are developing evaluation methods that test for genuine understanding. When Anthropic released Claude 4 a week ago, the artificial intelligence (AI) company said these models set “new standards for coding, advanced reasoning, and AI agents”. They cite leading scores on SWE-bench Verified, a benchmark for performance on real software engineering tasks. OpenAI also claims the o3 and o4-mini models return best scores on certain benchmarks. As does Mistral, for the open-source Devstral coding model.

When Anthropic released Claude 4 a week ago, the artificial intelligence (AI) company said these models set “new standards for coding, advanced reasoning, and AI agents”. They cite leading scores on SWE-bench Verified, a benchmark for performance on real software engineering tasks. OpenAI also claims the o3 and o4-mini models return best scores on certain benchmarks. As does Mistral, for the open-source Devstral coding model.
AI companies flexing comparative test scores is a common theme.
The world of technology has for long obsessed over synthetic benchmark test scores. Processor performance, memory bandwidth, speed of storage, graphics performance — plentiful examples, often used to judge whether a PC or a smartphone was worth your time and money.
Yet, experts believe it may be time to evolve methodology for AI testing, rather than a wholesale change.
American venture capitalist Mary Meeker, in the latest AI Trends report, notes that AI is increasingly doing better than humans in terms of accuracy and realism. She points to the MMLU (Massive Multitask Language Understanding) benchmark, which averages AI models at 92.30% accuracy compared with a human baseline of 89.8%.
MMLU is a benchmark to judge a model's general knowledge across 57 tasks covering professional and academic subjects including math, law, medicine and history.
Benchmarks serve as standardised yardsticks to measure, compare, and understand evolution of different AI models. Structured assessments that provide comparable scores for different models. These typically consist of datasets containing thousands of curated questions, problems, or tasks that test particular aspects of intelligence.
Understanding benchmark scores requires context about both scale and meaning behind numbers. Most benchmarks report accuracy as a percentage, but the significance of these percentages varies dramatically across different tests. On MMLU, random guessing would yield approximately 25% accuracy since most questions are multiple choice. Human performance typically ranges from 85-95% depending on subject area.
Headline numbers often mask important nuances. A model might excel in certain subjects, more than others. An aggregated score may hide weaker performance on tasks requiring multi-step reasoning or creative problem-solving, behind strong performance on factual recall.
AI engineer and commentator Rohan Paul notes on X that “most benchmarks don't reward long-term memory, rather they focus on short-context tasks.”
Increasingly, AI companies are looking closely at the ‘memory' aspect. Researchers at Google, in a new paper, detail an attention technique dubbed ‘Infini-attention', to configure how AI models extend their “context window”.
Mathematical benchmarks often show wider performance gaps. While most latest AI models score over 90% on accuracy, on the GSM8K benchmark (Claude Sonnet 3.5 leads with 97.72% while GPT-4 scores 94.8%), the more challenging MATH benchmark sees much lower ratings in comparison — Google Gemini 2.0 Flash Experimental with 89.7% leads, while GPT-4 scores 84.3%; Sonnet hasn't been tested yet).
Reworking the methodology
For AI testing, there is a need to realign testbeds. “All the evals are saturated. It's becoming slightly meaningless,” the words of Satya Nadella, chairman and chief executive officer (CEO) of Microsoft, while speaking at venture capital firm Madrona's annual meeting, earlier this year.
The tech giant has announced they're collaborating with institutions including Penn State University, Carnegie Mellon University and Duke University, to develop an approach to evaluate AI models that predicts how they will perform on unfamiliar tasks and explain why, something current benchmarks struggle to do.
An attempt is being made to make benchmarking agents for dynamic evaluation of models, contextual predictability, human-centric comparatives and cultural aspects of generative AI.
“The framework uses ADeLe (annotated-demand-levels), a technique that assesses how demanding a task is for an AI model by applying measurement scales for 18 types of cognitive and knowledge-based abilities,” explains Lexin Zhou, Research Assistant at Microsoft.
Momentarily, popular benchmarks include SWE-bench (or Software Engineering Benchmark) Verified to evaluate AI coding skills, ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) to judge generalisation and reasoning, as well as LiveBench AI that measures agentic coding tasks and evaluates LLMs on reasoning, coding and math.
Among limitations that can affect interpretation, many benchmarks can be “gamed” through techniques that improve scores without necessarily improving intelligence or capability. Case in point, Meta's new Llama models.
In April, they announced an array of models, including Llama 4 Scout, the Llama 4 Maverick, and still-being-trained Llama 4 Behemoth. Meta CEO Mark Zuckerberg claims the Behemoth will be the “highest performing base model in the world”. Maverick began ranking above OpenAI's GPT-4o in LMArena benchmarks, and just below Gemini 2.5 Pro.
That is where things went pear shaped for Meta, as AI researchers began to dig through these scores. Turns out, Meta had shared a Llama 4 Maverick model that was optimised for this test, and not exactly a spec customers would get.
Meta denies customisations. “We've also heard claims that we trained on test sets — that's simply not true and we would never do that. Our best understanding is that the variable quality people are seeing is due to needing to stabilise implementations,” says Ahmad Al-Dahle, VP of generative AI at Meta, in a statement.
There are other challenges. Models might memorise patterns specific to benchmark formats rather than developing genuine understanding. The selection and design of benchmarks also introduces bias.
There's a question of localisation. Yi Tay, AI Researcher at Google AI and DeepMind has detailed one such regional-specific benchmark called SG-Eval, focused on helping train AI models for wider context. India too is building a sovereign large language model (LLM), with Bengaluru-based AI startup Sarvam, selected under the IndiaAI Mission.
As AI capabilities continue advancing, researchers are developing evaluation methods that test for genuine understanding, robustness across context and capabilities in the real-world, rather than plain pattern matching. In the case of AI, numbers tell an important part of the story, but not the complete story.
Source: HindustanTimes
Related Posts: Indian Navy's Advancing Indigenous Capabilities A Source Of Pride Grand parade showcases capabilities Google is testing live translation capabilities to Circle to Search- Details AI to enhance capabilities NITI Aayog lays out a roadmap for advancing the circular economy framework for E-waste and lithium-ion battery scrap Advancing Care Through Intelligence Researchers unveil breakthrough blood test to detect Names proposed by Kerala-based researchers approved for Martian land forms Researchers identify key genetic factors causing oral cancer early among Indian tobacco chewers Top AI researchers leave OpenAI
In 2018, a young tigress stopped appearing on camera traps in Bandhavgarh Tiger Reserve in Madhya Pradesh. There was no official report of conflict as well as no records of carcass recovery or poaching incidents. This indicated that she had not died within the reserve.In 2018
1 days ago
The Union government on Saturday allowed an additional 20% allocation of commercial liquefied petroleum gas to states and Union Territories, taking the overall allocation to 50%. Of the total amount, an allocation of 10% will be given on the condition that states undertake measures to ease the
1 days ago
DICGC risk-based premium: How much will banks pay for deposit insurance?RBP framework provides a bank with a discount of up to 33.33% on the deposit insurance premium based on its rating category and an up to 25% vintage incentive. Updated on: Mar 22
1 days ago
Maharashtra politics heats up after a state women's commission chief resigns due to her association with a spiritual leader accused of rape. The Uddhav Balasaheb Thackeray Sena now targets Deputy Chief Minister Eknath Shinde. Leaders demand action against political followers of such spiritual
1 days ago
TikTok star Taylor Frankie Paul sat uneasily in her chair during a live interview on ABC's Good Morning America this week, caught between trying to promote her turn in the network's new series of The Bachelorette and addressing fresh domestic violence allegations lodged against her by her
1 days ago
West Bengal Chief Minister Mamata Banerjee assured the minority community of her support during her Eid-ul-Fitr address, criticizing the BJP's alleged deletion of minority names from electoral rolls. She vowed to challenge these actions in the Supreme Court
1 days ago
In the age of convenience, using frozen, deep-fried or processed meats that are brimming with preservatives is the norm. It makes life easier on days when work and life collide. However, a new study has found that eating too much ultra-processed food may significantly increase the risk of serious
1 days ago
Mohit Suri on eight-hour shift in film industry debate: ‘Emraan Hashmi shot 24 hours with me’Director Mohit Suri feels improved working hours shouldn’t be reserved only for actors, but must also consider the well-being of the entire crew. Mar 22, 2026
1 days ago
Cuba’s national power grid suffered a nationwide collapse on Saturday, plunging the island into darkness for the second time in a week as the communist government struggles to maintain electricity supplies for its 10 million residents amid a U.S.-imposed oil blockade and ageing infrastructure
1 days ago
The National Farmers' Union has warned that food prices in the UK are likely to go up as a result of the conflict in the Middle East. NFU President Tom Bradshaw told the BBC that the price of cucumbers and tomatoes could rise over the next six weeks
1 days ago
Global intelligence agencies, including the CIA and Mossad, were closely watching during Nowruz on Friday to see whether Iran’s new Supreme Leader, Mojtaba Khamenei, would uphold his father’s tradition of a New Year’s address. However, the holiday went by without Mojtaba’s address
1 days ago
Iran’s Islamic Revolutionary Guard Corps (IRGC) on Saturday claimed it had struck an Israeli F-16 fighter jet over central Iran, even as Israel announced fresh overnight strikes on ballistic missile production facilities in Tehran. In a statement carried by Sepah News
1 days ago
These have been directed by PNGRB to shorten the timeline between the submission of applications and the commencement of gas supply to consumer households and pursue mass awareness initiatives. Also Read :India set to become rich by 2047? How growth
1 days ago
The combined wealth of Assam Chief Minister Himanta Biswa Sarma and his wife Riniki Bhuyan Sarma has increased to Rs 35.16 crore in 2026 from 17.27 crore in 2021, showed the affidavit filed by the Bharatiya Janata Party leader on Friday. Himanta Biswa Sarma had filed the affidavit along with his
1 days ago
When Liverpool came to Brighton last May, the away end was in party mode at the full-time whistle despite the defeat as they sang and celebrated with inflatables and balloons. By then, Arne Slot's side had won the Premier League title and a trip to the south coast was another excuse to enjoy
1 days ago
The National Highways Authority of India will use advanced AI dashcams on 40,000 km of highways. This initiative aims to improve road maintenance and safety through data-driven operations. High-resolution imagery will help identify over 30 types of defects
1 days ago
For most office workers in India, the concept of working out like Bollywood’s “New Age Action Hero" Vidyut Jammwal, who is known for his incredible Kalaripayattu-powered feats, seems absurdly unattainable. The actor is believed to exercise for up to nine hours every day
1 days ago
In the high-stakes arena of the 2026 Assam assembly elections, observers say Chief Minister Himanta Biswa Sarma has deployed a political strategy that is as audacious as it is effective: dismantling the Congress party by absorbing its most influential DNA. On March 19
1 days ago
Indian man escorted off flight after demanding seat change from US-based techieA United States-based software developer has sparked a discussion on in-flight etiquette after refusing to switch seats with an Indian man on a plane Published on: Mar 22
1 days ago
India faces an LPG crisis as the US-Israeli war on Iran disrupts shipments through the Strait of Hormuz, which supplies 85-90% of the country’s imports. Tanker delays and safety concerns have caused cylinder shortages, rising prices, and rationing, hitting households and eateries hard
1 days ago
Comprehensive financial planning: Reflect on FY25 and plan for FY26 and beyondThe annual review of the financial year. Updated on: Mar 22, 2026 4:10 AM IST By Gopal Gidwani Share via Copy link The financial year-end is a good time to reflect on what worked well this year and what did not go as
1 days ago
With the inauguration of the Noida International Airport in Jewar drawing closer, a key concern for travellers across Delhi-NCR is accessibility. While the airport promises to ease pressure on Delhi’s existing aviation infrastructure, its location in western Uttar Pradesh means commute time will
1 days ago
Editorial independence is core to our work. Some links may earn us a commission, without influencing our opinions.7 Running watches that actually help you run better in 2026Running watches have changed how people track every mile. Here are seven options that stand out when it comes to performance
1 days ago
This Tiramisu Day, indulge in one of the world’s most beloved desserts. With its irresistible layers of espresso-soaked sponge, velvety mascarpone, and a dusting of cocoa, tiramisu continues to inspire chefs and cafés to reinterpret it in both classic and contemporary ways
1 days ago
Is playing music good for the brain?Studies show music boosts memory, brain function and may slow ageing, offering benefits far beyond performance or talent Updated on: Mar 21, 2026 3:31 PM IST The Economist Share via Copy link THE MEN who raided Joseph Haydn’s grave hoped that his genius would
1 days ago
Prime Minister Narendra Modi spoke with Iranian President Masoud Pezeshkian, conveying Eid and Nowruz greetings and expressing hope for peace and stability in West Asia. He also condemned attacks on critical infrastructure, warning that such actions threaten regional stability and disrupt global
1 days ago
The latest hike comes amid rising energy costs, including LPG and crude oil, which have increased operating expenses for restaurants and delivery partners, prompting platforms to adjust pricing. New Delhi: Food delivery platform Zomato has increased platform fee by 19.2 per cent or Rs 2
1 days ago