
Every time a new version of AI is released, it is called acing its performance against several benchmarks. OpenAI’s GPT-4o, for example, was launched in May and included results that showed how well its performance outperformed the latest version of all AI companies in a series of tests.
The problem is that these indicators are poorly designed, the results are difficult to repeat, and the metrics they use are often inconsistent, according to a new study. This is important because the number of AI models against these benchmarks determines the number of evaluations they receive.
AI companies often cite benchmarks as proof of a new model’s success, and benchmarks are already part of some governments’ plans to regulate AI. But right now, they may not be good enough to use that way—and researchers have ideas on how to improve.
—Scott J Mulligan
We need to start fighting the behavior of AI assistants
Generative AI models have been great at interacting with us, creating photos, videos, and music for us, but they’re not perfect. to do things for us.
AI agents promise to change this. Last week the researchers published a new paper detailing their study simulated the personality of 1,000 people with amazing accuracy.
AI models that mimic you can come out and do it for you in no time. If such devices are cheap and easy to manufacture, they will raise many ethical concerns, but two in particular stand out. Read the full article.
—James O’Donnell
Leave a Reply