Apple is desperately trying to catch up to its competitors in AI, so it is badmouthing the successes:
A new paper from Apple's artificial intelligence scientists has found that engines based on large language models, such as those from Meta and OpenAI, still lack basic reasoning skills.The paper uses this typical example of an LLM failure:The group has proposed a new benchmark, GSM-Symbolic, to help others measure the reasoning capabilities of various large language models (LLMs). Their initial testing reveals that slight changes in the wording of queries can result in significantly different answers, undermining the reliability of the models.
The group investigated the "fragility" of mathematical reasoning by adding contextual information to their queries that a human could understand, but which should not affect the fundamental mathematics of the solution. This resulted in varying answers, which shouldn't happen.
Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?The paper complains that the LLMs subtract the 5 kiwis, even though the statement about them being smaller should be irrelevant.
No, this is a misguided criticism. If the LLM were strictly logical, it would refuse to answer the question as too vague and imprecise. The problem does not say how many kiwis Oliver started with, or whether he got any from other sources, or if he ate any. Without that info, no answer can be given.
The LLMs work by embedding the problem into a convex meaning space. However sloppy the problem is, it gets precise coordinates in the embedding space.
Whoever formulated this problem seemed to be saying that the 5 kiwis should not be counted. Why else is it phrased that way? Okay, it is phrased that way to be a trick question.
What would you want an LLM to do? To assign meaning in the most direct way, or try to interpret the problem as a trick question?
Apple seems to want to benchmark LLMs on trick questions. No thanks.
For another view, see Apple DROPS AI BOMBSHELL: LLMS CANNOT Reason. This video argues that Apple proved that the LLMs are worse that what people thinks, and not likely to get fixed soon.
Update: Dr. Bee discusses the Apple paper.
If LLMs are so great, then drive thrus would have been automated years ago. They aren't making any money. Attention can essentially be linearized, so LLMs are just RNNs in disguise. Neat but too stupid to pay their own bills.
ReplyDelete