The way we measure progress in AI is alarming

The way we measure progress in AI is alarming


One of the goals of this study was to define a list of factors that make up a good brand. “It’s very difficult to discuss the kind of benchmarks, what we want from them, what we want from them,” says Ivanova. “The problem is that there is no single good standard for defining benchmarks. This paper is an attempt to provide a framework for evaluation. This is very useful.”

The paper was accompanied by the launch of a website, BetterBenchwhich is on the list of popular AI. Settings include whether or not experts were consulted on the design, whether the skills being tested were well defined, and other important factors – for example, is there a method of reporting the test, or is it peer-reviewed?

The MMLU benchmark had the lowest ratings. “I do not agree with these rankings. In fact, I am the author of some of the papers that are at the top, and I would say that the lower benchmarks are better than them,” said Dan Hendrycks, director of CAIS, Center for AI Safety, and one of who create the MMLU Image. That said, Hendrycks still believes that the best way to move the field forward is to build better benchmarks.

Some think that the methods may be missing the big picture. “The paper adds something very important. Management processes and documentation – both are important. It makes the benchmarks better,” said Marius Hobbahn, CEO of Apollo Research, a research organization that specializes in AI analysis. “But for me, the most important question is, are you testing the right thing? You can check all these boxes, but you may still get a bad mark because it doesn’t measure properly. “

In fact, even if the benchmark is perfectly designed, one that tests the model’s ability to provide a robust analysis of Shakespeare’s sonnets may be useless if one is overly concerned with the capabilities of AI.

“You will see an indicator that is supposed to measure good ideas. But what that means is not well defined. Are people who are experts in this area included in the process? Usually it is not,” says Amelia Hardy, one of the authors of the paper and an AI researcher at Stanford University.

There are organizations that are working to improve things. For example, new benchmark from Epoch AI, a research organization, was developed with input from 60 mathematicians and proven to be challenging by two winners of the Fields Medal, which is the most prestigious award in mathematics. The participation of these professionals fulfills one of the requirements in BetterBench assessment. The latest models can answer less than 2% of queries on the benchmark, which means there’s still a significant way to go before it’s ready.

Tamay Besiroglu, director of Epoch AI said: Although the test was difficult, Besiroglu thinks that it will only take four years for AI models to meet the benchmark, with a success rate of more than 80%.



Source link

Be the first to comment

Leave a Reply

Your email address will not be published.


*