(Expanded Remarks for NYU Debate on whether AI language models can understand the world based entirely on text]
I want to introduce two different models of understanding, both of which have been historically influential for thinking about artificial intelligence. The first, which I’ll dub the “cognitivist” approach, was seen most explicitly in the older work of Symbolic AI, beginning in the 1950s. The second, dubbed the “pragmatist” approach, was more familiar from the critics of early AI, like Bert Dreyfus and John Haugeland, though also picked up by many connectionists. The cognitivists focused on understanding as a capacity to solve problems in a linguistic format, whether natural or formal. The pragmatists, by contrast, focused on understanding as embodied engagement with the world.
The benefits of revisiting these two theories is that the rise of large language models has revived something like the old cognitivist picture for many researchers. I want to remind us of why earlier thinkers thought that paradigm was so limited and thereby relearn some hard-won lessons about understanding that the pragmatists taught us in their criticisms of Symbolic AI. This approach also has a major payoff: on the cognitivist picture, the problems of LLMs are seen as frustrating limitations which researchers keep hoping will be overcome by more and more data. On the pragmatist picture, LLMs are behaving exactly as we would expect—and it is a mistake to expect anything else from them.
In the 20th century, it was very common to treat understanding as principally centered around language. This appealed both to the empiricists, who embraced a Sapir-Whorf approach that held that language shaped our understanding right down to the color categories we perceive, and rationalists, who treated formal logic as the core mechanism of thought. But this is somewhat surprising since, for much of the modern period, the dominant notion of understanding was perceptual—that understanding something means perceiving it clearly and distinctly in one’s mind eye. This perspective held that an ideal understanding—such as God’s—would involve perceptual capacities that could perceive the world with utter clarity, as well as a capacity for imagination that would make possible the ability to see what was going to happen in the future.
But Kant recognized a problem for this view: he held that some thoughts, such as which hand is left or right, depend on perception. But he also held that many other thoughts depend on language, such as thoughts about freedom, necessity, or causality. Freedom or causality are not directly perceptible and are only capable of being understood if properly expressed as explicit rules governing how things worked. Kant also argued that certain kinds of judgment, such as disjunctions, conjunctions, and hypotheticals—are not directly perceivable either. He contended judgments depend on a non-perceptual format: propositions. Kant argued that some truths—the laws of logic, physics, and ethics—can only be expressed propositionally, as assertions. Philosophers at the end of the 19th century, like Gottleb Frege, went further and argued all thoughts are propositional (and, added bonus, numbers and equations could be derived as special cases of logic). Natural languages, for these thinkers, are just the way we clothe our propositional thoughts: thinking is linguistic—ideally, expressed in symbolic logic to avoid ambiguity, but natural language would do in a pinch.
This eventually brought with it a basic model of how understanding would work: the totality of thought just is a massive list of all the possible true propositions of language, and someone with understanding is someone with a non-trivial collection of these true assertions in their head—a vast database of true facts about the world, expressed in some linguistic format. For cognitivists, thinking is a matter of reasoning with these propositions—drawing inferences, checking the consistency of implications of different beliefs, and so on. But this position left open a major problem: how are these propositional thoughts grounded in embodied human life? Cognitivists proposed different solutions, but a typical one—seen in Bertrand Russell’s “knowledge-by-acquaintance”—was to ground these consistent logical theories in something perceptual that could serve to decide which among our theories is true.
The most ambitious form of this—though, to be clear, only one among numerous others—was Rudolf Carnap’s 1929 Logical Construction of the World. He argued we could provide a “logical construction of the world” which combined our best scientific theories with sensory data by rewriting sensory data as propositions. He came up with a rough guide for how to translate the whole of perceptual experience and the knowledge based on it into logical sentences, so that everything knowable by humans could be expressed in logic. His basic idea was that we could digitize each sense-modality: carve up perceptual experience into bite size chunks and categorize each chunk according to its spatial, temporal, or qualitative dimensions. Although novel in Carnap, this should be familiar now: our screens all use pixels, each one categorized by a proposition consisting of its x-y coordinates and its red-, green-, and blue-values. Carnap believed, having done so, his job was done: an experimentalist testing a logically consistent theory could form a hypothesis in the theory about what would be observed in certain experimental situations and then could simply “read off” the appropriate experimental information from the pixel values which would confirm or disconfirm the theory. For Carnap, this was evidence that all perceptual data—and, with it, all aspects of human experience—could be fully expressed as linguistic data and nothing of import was lost in the translation.
3. Varieties of Representations
This led to some hubris among the early AI researchers, such as those behind MIT’s infamous Summer Vision Project in 1966. Carnap’s work seemed to suggest reading the contents of a digital photograph should be relatively easy—largely “solvable” in a Summer by some grad students. But the project failed rather dismally, coming up with little in the brief period except for a simple realization; in David Marr’s words, “The first great revelation was that the problems [of vision] are difficult.”
This realization arose from the recognition that certain problems can be formalized and, as such, are rather easy. Carnap had found an easy one, for example, with pixel-values, but Helmholtz had gotten another when he recognized we could measure the difference between red and blue according to their distance in a three-dimensional similarity space. But the hard problems—such as recognizing a face from pixel values, much less understanding and responding to what the face is expressing—hadn’t even been started on.
But the problem was that writing out the pixel-values does not actually translate the content of an image into language. Reading texts and reading images aren’t synonymous; they depend on different skills. If we have both a written report and photos of the same car accident, they represent things differently and we learn different things from each of them. This puzzled a lot of folks in the cognitivist paradigm when they ran headlong into it, but it has proven a productive puzzle: it forced us to start thinking through different ways we represent the world.
A rather elegant version of this—one I will rely on here—is Haugeland’s “Representational Genera,” where he argues not just Kant’s point that certain things can’t be represented in language or in perception, but also articulates what makes each distinct—as well as how both language and images differ from distributed representations. Imagistic representations, what Haugeland calls “iconic,” fundamentally convey by capturing variations of elements along different dimensions, such as colors across spatial dimensions, or sounds along temporal dimensions. Images convey fundamental isomorphisms between the image and what it represents: the colors in a spatial organization cannot be rearranged and retain the isomorphism because their spatial location is essential for capturing what the image is representing and how it is representing it. This is not just for strict resemblances, like a picture or recording, but also for the isomorphic relationship between lines on a page and the rise of nominal GDP over the last decade in various countries.
Imagistic representations come in an enormous variety which can convey either a lot or a little detail—a sketchy set of lines on a napkin represent a map of the I-25 and I-40 junction, but so would a full-scale, three-dimensional 1:8th model of it. Despite this, there are strong limits to the sorts of knowledge that can be represented imagistically: it works well for continuous values and varying shapes, but discrete, highly specific, and abstract relationships cannot be represented this way. Consider laws of nature, the feeling of boredom, or the value of an honest day’s work; there is nothing to “picture” in any of these cases, and any attempt to picture an “instance” of such a thing must also picture countless other things as well, making it nearly impossible to convey precisely the content we want.
Linguistic representations work differently: they represent everything in a discrete, bite-size chunk. Each proposition expresses only a discrete fact, and nothing else. This can still be remarkably expressive because it allows us to represent properties, features, objects, events, facts, abstract relationships, impossible situations, unimaginable possibilities, unperceivable facts about quantum reality, and a host of other things. But we do so atomistically, piece-by-piece, using discrete symbols and connecting them according to strict rules for combination. On its own, each proposition says very, very little: if I say there was a person eating a sandwich, it implies they have a body—but it doesn’t say it. Linguistic representations compress away an enormous amount of information: a recording of Mahler’s 8th symphony, featuring the sounds of over a hundred instruments playing for over an hour, can all be fit into a relatively short score. Mahler’s 8th can be represented in both in a recording and its score, but they’ve compressed different bits of information and thus represent their content in radically different ways.
The recording of Mahler helps show the limits of linguistic representations. There is a lot lost in the compression process to get everything into discrete chunks, and so all the continuous values that depend on their interrelationships have to be written instead as discrete facts. Moreover, the linguistic symbols are often arbitrarily chosen—hence the diversity of different words for the same thing in different languages—so people need to learn the conventions to know what each word means. This is the “grounding” problem: what does each symbol mean? Many 20th century philosophers hoped you could simply define each symbol in terms of other symbols, as a dictionary does or as is done in distributed semantics. But the task of “making it explicit,” of saying all the possible relationships between every possible symbol, in order to provide a comprehensive understanding of each term proved impossible—as evidenced from the limited progress of Doug Lenant’s multidecade Cyc project. There are also numerous types of rules that we don’t—maybe can’t—know how to express, such as making explicit all the proper rules for winning every possible game of Go. And Go is a closed, certain domain with clearly known rules; a major takeaway from Symbolic AI is that success on a “microworld” has little bearing on success in the real-world.
There is also another way of representing the world: skillful know-how. The most familiar, and successful, model of skillful know-how is the kind encoded in distributed representations, such as those found in neural networks—both biological and artificial. Distributed representations are especially interesting in this regard because they have proven adept at interpreting, manipulating, and even generating both imagistic and linguistic representations—the same, domain-general network can learn to do a lot of things. But it doesn’t do so in the same way—the network doesn’t learn the skill of “reading”, for example, and can read either texts or images. It instead learns the skill for representing images by learning about edges, lines, shapes, features, and then using their relative locations to piece together what it is about and how the elements hang together spatially as objects in the world. By contrast, it learns the skill for representing language by learning about words, nouns, verbs, abstract properties and relationships, and so on, and then recognizing how certain discrete combinations of these elements can mean different things in different contexts. In both cases, the distributed representation encodes the relevant elements according to whatever the network is trying to do—whatever its function is. And this is a representation—lots of information is compressed in order to capture the salient features, though what is compressed depends on what the network is trying to do and on the nature of the inputs. The goal is for it to see through the mere pixel-level data or the discrete words and figure out what the input means—grasping what the various lines are about, or what the sentence is referring to
But, again, no way of representing can cover everything—they all fail to capture all the potentially significant information presented by their object. Distributed representations are no exception. It is great for accomplishing some function, but it is grossly inefficient and often inappropriate for storing information: getting the distributed representations to “know” discrete facts is somewhat indirect: you want it to output when the Battle of Waterloo is, but it encodes that information according to the statistical likelihood of which numbers follow the sentence, “the Battle of Waterloo happened in___.” Trying to get out that information from the network requires a sentence almost exactly like that, and the same information won’t necessarily be accessible to other parts of the network—even when it might be relevant to know. In short, it doesn’t represent things in an intuitive or obvious way, and it often proves pretty challenging to get out of it what we want—hard enough that “prompt engineer” is now an actual job. Also, while these networks can learn highly convoluted tacit rules such as how to play a good game of Go, strict rules over discrete symbols often prove beyond its ken, and the system instead makes do with approximations that permit lots of failure cases around the edges when the strictness is required.
This leaves us with three different ways of representing the world: linguistic, imagistic, and as know-how. We can now use them to make sense of the cognitivist program and, with it, the first model of understanding. It took understanding as entirely linguistic, as grasping the logical connections between concepts and the capacity to derive and explain phenomena in terms of those concepts. The approach was understanding as book smarts. This brought with it blind spots, such as how much information is encoded in a picture or recording, and how much of that is lost in language. It also devalued difficult to express kinds of knowledge, like know-how. And this meant a lot of the ways animals and infants understand their world—as well as the understanding found in emotions, imagination, and embodiment—were all relegated to the “non-cognitive,” to something peripheral to understanding. But this paradigm struggled, and its critics often argued the chief problem was overvaluing language and logical reasoning and misunderstanding what role the so-called “non-cognitive” stuff played in our understanding of the world.
4. The Second Model
It’s helpful to contrast the cognitivist model with other radically different approaches. Pragmatists, such as William James, took their starting point from aspects of life that the intellectualists considered “non-cognitive”: they argued that mindedness is fundamentally about embodied action. For James, perceiving is not just passive stimulation but is actively categorizing the world into what matters, where what matters is specified according to the interests of the organism—what is liable to lead to pleasure or pain, happiness or disgust, survival or death. James took cognition to get its start around our experience—the capacity of the organism to be aware of the environment and to predict how both the world will play out, but also how the organism’s actions will influence the environment. The evolution of organisms, then, proceeds along the increasing detail and sophistication of the imagistic representations produced by each sense-modality: the increasing complexity of our visual representations would provide more detailed awareness of one’s environment, more robust predictions about the causal interactions of other creatures, and would even permit reuse—simulations of various situations prior to choosing actions. Natural selection would ensure these imagistic representations would be grounded, accurately conveying a real-time, detailed isomorphism between representation and world permitting action, prediction, and planning.
Heidegger provided a different narrative, one more focused on humans. He argued that understanding was not fundamentally a matter of knowing about the world in a scientific sense. Rather, the fundamental knowledge was know-how: grasping how the aspects of the environment are relevant and useful to us. Like James, this was a fundamentally practical outlook, focusing on how objects and people support our interests: trees can provide shade from the Sun, hammers allow us to build things, and so on. In humans, as a highly social species, this also opened up wholly novel and human-specific kinds of know-how: social reasoning, communication, and (especially) norm-following. These are “scaffolded” on top of the physical reasoning abilities found in our embodied know-how for navigating the world, permitting many new kinds of understanding. Language, on this model, is a special kind of know-how—primarily a tool for communicating and accomplishing joint actions between humans. Although It opens up radically different abilities in humans—story-telling, explanations and justifications, inner monologues—these are built on top of the underlying communicative role of language.
The second model, then, focuses on what the first model dubbed “non-cognitive”: experience, know-how, the interests or the organism, including valence and emotion. For this model, language isn’t the core of cognition, with perception and embodied know-how simply “grounding” our thoughts. Rather, it regards it as in principle impossible to say it all: much of our understanding is built into our experience and the predictive models it supports, and in our know-how and the skills and abilities found there. Language allows us a highly-compressed channel for communication, but it depends on enormous amounts of background knowledge. Our communication—and, with it, the successful coordination of behavior—rests on us having similar embodiments, upbringings, skills, cultures, and so on. Our common-sense understanding, then, is not primarily linguistic, nor is it much of it found only in adult humans; it is instead distributed into different kinds of know-how which, when they come together, provide a useful predictive model for planning out our behaviors—including joint actions—in the world.
5. Large Language Models
These two models provide two different ways to think about current LLMs—and, with it, two ways of evaluating how far along they are to understanding what they are talking about. As I’ll suggest, on one model they are pretty far along; on the other, all we’re seeing is ersatz understanding, a pale imitation of the genuine article.
With current LLMs, many people have found themselves rearticulating the old cognitivist model—or, at least, a statistical version of it compatible with learning everything from language. It’s different, to be sure: the know-how for logical reasoning is only a limited feature of these systems, whereas the looser capacity to converse fluidly over a host of topics is taken as key. But both versions of the paradigm still assume the majority of human knowledge can be expressed linguistically or inferred from what has been said. We also see many of the same gambits being made: arguing language models can solve any “cognitive” task, where cognitive is typically defined as a task where the prompt is linguistic and the output is linguistic. And we see some of the same arguments that inputting sentences expressing pixel values will allow the machine to “read off” the visual content. This is superficially plausible since these networks are domain-general learners, though when trained they become domain-specific processors—effectively carving out visual and linguistic “modules”—which just reinforces how different these kinds of representations are. Finally, many people feel impressed by the ability of these systems to represent “microworlds,” such as the moves of an Othello game or drawing a map from linguistic directions. Gary Lupyan, for example, argues this provides “in principle” evidence these models can successfully represent the world—though how to scale them up remains as much a challenge for us as it did for Winograd. I’m sure if I had looked, someone would be beginning a new, statistical version of Cyc—a database of sentences that “says it all.”
But, if these are the successes of LLMs on the first model, the failures are also glaring. These models do not have the concepts implied by their sentences and, as a result, often flounder when talking about them in evenly slightly abnormal circumstances. We still find far too many cases where a superficially successful start to a problem turns into a mess of nonsense—as in the example from Anna’s paper, when the LLMs advice for getting a couch onto the roof is to cut it up and throw it out the window. We also see far too much inconsistency and illogical behavior, as the examples on the slide suggest. And grounding the systems, with vision and with human feedback, haven’t stopped confabulations—sometimes dangerous ones. These systems sometimes impress us with their ability to turn a phrase, but often we find ourselves meddling with the prompt for five minutes just to get an appropriate response. If we are aiming for natural language understanding, we should expect consistent competency, not this pervasive waffling between sense and nonsense. If cognitivist understanding is the goal, the systems have a long way to go. Still, an advocate might suggest, they are on their way.
On the second model, by contrast, language-only AI systems are working like we would expect. They possess only a shallow, inconsistent, and often underwhelming understanding for good reason: that’s what language is for. Language is for expressing the high level, abstract information, not the nitty-gritty stuff we can see and do and interact with—broad strokes, not details. Language rests on interlocuters already possessing the deep, multimodal, embodied know-how for successfully navigating the physical world, and the social awareness and reasoning needed for joint actions. It expects we already use imagistic representations—in perception and in mental simulation—that provides us the high-fidelity nuance and details, and that we have a robust and diverse kinds of know-how. And it expects understanding language won’t just be explaining concepts but also accomplishing things with other people.
The second model suggests building a self-driving car will be the hard problem, but a successful chatbot is easy—in effect, just a scaling problem. This is because it regards language as a microworld—a discrete chunk of reality, circumscribed by clear rules and limited by the number of symbols which most people can remember.
Perhaps both models are wrong in important ways—ways current LLMs can teach us. But, at present, it seems like many of the old criticisms leveled by Dreyfus and Haugeland still matter. Their central contention is that at its core, human understanding isn’t about book smarts; it’s about experience, know-how, emotions, and social interrelations. The first model suggests that talking competently about how similar red is to blue or how depressions feels is a sign it understands experience; the second suggests that is missing the point entirely. I’m inclined to think the second model is more relevant than ever right now—a necessary corrective to over excitement about the potential of linguistic systems to be the basis of some world-beating AGI. For me, they look like fascinating and important tools, but only a pale imitation of genuine understanding.