Jake Browning
- Jan 19
- 5 min read

Why isn't Multimodality Making Language Models Smarter?

Philosophy has something called the "symbol grounding" problem. The basic question is whether the words and sentences of language need to be "grounded" in the world through some sort of sensory relationship in order to be meaningful, or whether their meaning can be inferred solely from their interrelationship with other linguistic symbols. Current language models appear to be an interesting way of testing out the main approaches because many models--like ChatGPT and Claude--were originally trained on just text. Thus, if these machines were speaking meaningfully, it implied meaning must not require non-linguistic grounding. By contrast, if they didn't speak meaningful--or, at least, their speech was not meaningful on certain topics--then this would suggest we do need some non-linguistic grounding.

The results were mixed. On the one hand, the language-only models often showed enormous competence on tests and simple puzzles. Elie Pavlick showed that the models often did show some competence on sensory categories, such as locating colors correctly in color space. And Dimitri Coelho Mollo and Raphaël Millière made a good case that they were grounded because they were trained on grounded conversation and fine-tuned according to human norms.

But, on the other hand, when provided with a puzzle with no corollary in its training data that relied on intuitive physics or psychology, there seemed to be a major gap (a point stressed by Kyle Mahowald and Anna Ivonava's paper, as well as Subbarao Kambhampati's work on reasoning and planning). This lead to hopes by many that these kinds of problems would lessen once the models were given access to non-linguistic modalities. Or, anthropomorphically, if we provide "vision," then the model will have sensory grounding and it will know (better) what it is talking about.

So more recent models attached a visual modality of sorts, connecting images with text, such that the model can "see" what it is talking about. Does sensory grounding help? We now have Gemini and GPT-4 and the answer is a resounding "no." It doesn't really matter. Which is a bad thing for both sides, sadly; we don't have much more text-data to train on, so they won't magically get smarter without sensory grounding. But providing sensory grounding--at least in mere images--doesn't provide the missing common-sense reasoning abilities we were hoping from. Many more modalities would help but, as of yet, we aren't seeing much progress. Why?

I think the answer is that the premise of the symbol grounding problem was (and is) wrong. The underlying assumption is that language is a fully autonomous domain, one that can be cleanly shorn of the world. This is combined with the idea that languages are the repository of knowledge--that all we know is encoded or encodable in claims, propositions, or equations. This is a seductive picture, one that underlies much of 20th century philosophy of language, mind, cognitive psychology, and symbolic AI. It's difficult to even formulate certain puzzles without this: Frank Jackson's Mary's Room only works as a critique of physicalism if all knowledge can be found in textbooks. A lot of accounts of concepts--from Fodor to Brandom--still revolve around this language-specific understanding of concepts. On these accounts, the question is, "what does sensory grounding add to our already robust conceptual model of the world?"

But part of the appeal of the rise of connectionism, cognitive neuroscience, and animal cognition studies at the end of the 20th century was that many people abandoned--or, at least, saw the limits of--that view of language and knowledge. These fields investigated more grounded kinds of reasoning, exploring how animals, infants, and other non-linguistic creatures could still solve complex problems and navigate difficult situations. The guiding assumption is that there is a rich, foundational body of knowledge and understanding to be found prior to language. This is why Yann LeCun keeps encouraging us to aim for the intelligence of a Cat and why Josh Tenenbaum is aiming for a three-year-old. This is a really high bar, despite how many philosophers treated it.

But from this perspective, the idea that language is an autonomous domain is seen as suspect, a bad analogy based on the autonomy of formal languages, like mathematics, logic, and code. A lot of contemporary approaches to language focus on supplemental resources needed not for making sense of language, like Tomasello and joint action or Scott-Phillips and theory of mind. These resources are taken as fundamental to language, the key factor responsible for its evolution but also for concretely understanding most sentences. They depend on language evolving around embedded agents trying to navigate real-world situations. There is no autonomy of language; it is intertwined with the world that makes talk of "grounding" simply a bad mental model.

LLM has lead to a resurgence of interest in the old paradigm: might we be able to really build a machine that knows it all through language? Couldn't we create our own mechanical Mary, one who knew everything even if they didn't have all the same modalities? The idea then runs, each modality provides a bit more knowledge, but nothing fundamental. The core of knowledge is still in the language model, but giving it an image or a sound clip will supplement the (already more-or-less complete) knowledge.

But it isn't working right. First, LLMs don't seem to know as much as we'd like. They certainly aren't using language right: they don't share our norms of honesty or consistency. They can talk about the world but they often prove incapable of accurately modeling it, especially its dynamics in changing circumstances. They aren't capable of planning ahead multiple steps in problems they haven't seen. Even simple world-models that seem learnable--like Othello and Chess--often prove brittle; the model often fails to play according to the rules.

But, second, it isn't clear we should expect it to help. Simply adding an image recognition system to a language model won't reproduce the complex understanding of vision provided to a stealthy predator that can navigate both on the ground and in the trees, like a chimp. Image recognition that connects an object with a label is important for a linguistic being, so it isn't nothing. But it isn't vision, in the sense we mean when we talk about basically any organic being. It also is, again, backwards; abending a label to an image is basically a freebie a language-users gets on top of the evolved, complex visual reasoning abilities.

So the language model approach isn't enough and simply adding image recognition isn't doing much. Might we just keep after it, with more and more data? The problem here, third, is that we're trying to push a ton of capacities through next token prediction. That's a classic case of, if you've got a hammer, everything looks like a nail. But expecting that, if you tweek it enough, it will finally start reasoning, planning, and modelling causality consistently is really desperate. We know evolution didn't do that; it spent 500 million years evolving a brain, and only 100,000ish years on language. We're certainly trying something strange.

So it doesn't surprise me much that progress is slowing down a lot. It surprises me even less than Sam Altman is starting to scale back expectations immensely for AGI, suggesting it won't be an AI God or even that transformative. LLMs made rapid progress these last few years, mostly with scaling with a little fine-tuning to stabilize it. But there isn't much more textual data, and simply slapping images or sounds onto it won't solve the problem. We're basically at the same point we've been in since we all recognized self-driving cars weren't magically appearing with more compute back in 2019. We need world-models and systems capable of reasoning through problems--and we still haven't figured out how to bring them about.

Why isn't Multimodality Making Language Models Smarter?

Recent Posts