Ted Chiang's pieces tend to cause a stir whenever they come out, mostly because he is such a fantastic writer and thinker. But the pieces also tend to flatten some features of AI, leaving people uncomfortable with how confidently he asserts controversial claims. In the current case, the main problem stems from a debate about the connection between the artist's intention and the art itself. While critics are right to think Chiang oversimplifies this relationship, I'm not sure the critics are fairly wrestling with the point.
Language and Intention
A big overstep on Chiang's part stems from treating language as solely a communication system that depends on speaker's intentions. On this read, I know what you're saying not (just) because of the words you use, but also because I can infer your intentions--that is, your goals in saying it. If a student tells me they are visiting a family on Thursday, I can infer they are saying it because they won't be attending class, they are requesting an excused absence, and hoping I'll say they can make up a quiz. We do this so effortlessly that many think there is an evolutionary case to be made that this is how language evolved. It is only a small shift from here to taking it that art also depends on communicative intent: this painting says something because it expresses my views.
But it is a genetic fallacy to assume the nature of language (and art) hasn't changed in the last two-million-some-odd years--that is--that its evolved purpose is still the same as its current purpose. Once you have an established set of signs, signals, or images, they become conventionalized and comprehensible independent of their communicative intent. This is obvious with words, where there is simply a conventionalized meaning of "the cat is on the mat." But it happens with slang, too: if I am telling a story and my friend responds, "damn, that's crazy," I know that they are bored because kids today have established that as a signal of disinterest. The communicative intent is, at this point, baked in and obvious. The same goes for memes, images from a movie, and so on--we know what they mean simply in seeing them.
This is a good thing, too. Part of the point of conventions, norms, and rules is to simplify our world: I don't need to infer people's intentions because there are now standardized signals of them. We never bother to think, "what was the intention behind the designer of my receipt to print out that item and number?" Receipts just follow a script and, once you know the script, you can read a receipt--or make up your own--just fine. Conventionalized meanings make it so understanding symbols and images can be cognitively simpler: rather than requiring both complex theory of mind calculations and a focus on semantics, I only need the latter (and re-visit the former if something goes wrong).
So Chiang definitely overstates it. But I also think people aren't being generous enough about his point. Language models struggle with this intent stuff (theory of mind, communicative intent, pragmatics, and so on). They excel at conventionalized stuff--and do well when communicative intent has already been frozen into a script--but struggle at on the fly intents. This mirrors the broader problem of these systems: rational-seeming behavior that often relies on being trained on the test, with absurd out-of-distribution behavior. If you see this, it should push you to doubt the rational-seeming behavior is anything more than just memorization.
The Loss of Intentionality
Does this mean language models cannot have--and cannot infer--communicative intent because of their failures? Obviously not; they can still engage in the conventionalized scripts. And one of the best conventionalized scripts we have is the Gricean norms: be honest, relevant, useful, decent, and concise. Language models clearly ape this (especially after fine-tuning) and, as such, are often acting on conventionalized communicative intent.
There's some definite fudging happening, however; they aren't reliably honest, useful, or relevant. Humans aren't either, but we also react differently to humans by judging them for dishonesty, irrelevant answers, or being unhelpful. We critically judge any person who violates norms, and reputation is really important for humans: if we encounter a speaker who consistently violates norms, we tend to avoid relying on them. We often even gossip about them, indicating to others that they cannot be trusted. The result is that we simply don't ask our dishonest or mean-spirited neighbor for advice on writing a letter of recommendation, for example.
But we do ask unreliable language models. This is part of Chiang's point: we are grading language models on a curve. If I don't trust somebody, I don't talk to them and, when engaging with them, I ignore their advice. But, with ChatGPT, people do ask them for advice all the time. Some folks do this out of ignorance, resulting in citing legal cases that don't exist. The failures of these systems are worth dwelling on: a language model could not accurately take McDonald's orders. That is an epic fail that is really inexplicable if you think these tools are genuinely "understanding."
Among the AI-hypers, the typical response is to argue we treat these more like tools, with cautious usage involving lots of back and forth, fact-checking, re-writing or correcting, and so on. But we shouldn't understate what kind of admission this is: the hypers are recognizing that language models aren't actually satisfying them. If you are asking for help and someone consistently botches it unless you keep prodding and teasing them in the right direction, then they are missing the point.
Which is the core of Chiang's argument. He acknowledges people are doing lots of back-and-forth tinkering with generative art (the Bennett Miller example). And while he is wrongly skeptical that people might do that with text, it is mostly because this kind of approach would just create more work than writing it yourself. (And, in a timely article, a study by Australia found just that). The issue is that tweaking model outputs provides communicative intent to the content, in that the language model inherits the original person's communicative intent through the repeated prompting. The output ends up saying what it says because the human screwed with it until it said something they felt conveyed a point they liked and agreed with, or that seemed art-y enough. The intent is there because of the human in the loop; the machine, for its part, is just spitting out conventionalized words, phrases, or images until a human blesses the outcome. The user is regularly underwhelmed with the output (hence all the editing and tweaking).
It's a labor we wouldn't put up with when using a co-writer. If the co-writer just didn't get the point after a few tries, we might well abandon them or the project. Put differently, the fact that people expend all the effort into tweaking these things is evidence that Chiang is right: the models aren't doing what we want because they lack intent.
Unread Texts and Ignored Art
A critic might still say, "so what? Sure, I acknowledge generative AI lack intent and that's why we need to tweak them like crazy to get an output we like. But still people do that, so what's the problem?" The problem is that there's no real money in that. A couple artists and authors may well use generative AI for this purpose--and good for them!--but their returns are going to be minimal. They don't justify the outlay of billions into training, fine-tuning, and deploying this tech.
The real money comes from slop: search engine optimization, bullshit posts, inappropriate usage, fake books, scams, counterfeit lovers, pornography, and so on. The vast, vast majority of writing or drawings in the world is never seen by another or paid much attention (diaries, doodles, sketchpads, etc.). But it is purposive in the hands of a human: I often write a blog post to clear my head or unpack a problem I'm dealing with, even if I never publicize or post it. It helps me, even if it is otherwise useless.
But slop helps no reader. No one is supposed to read SEO garbage, like the "recipe stories" that precede recipes. If more and more writing is slop, then the world really does end up full of unintended language--just stuff that fills in space. A lot of it is parasitic, too; scams, fake books, and bullshit articles only makes money because it exploits people. The person is worse off for having read it and taken it seriously.
There are partial exceptions--like people with fake friends or therapists--where people may benefit. But this area is still under explored in terms of research (is it really helpful, or is that just marketing?) and, in some cases, there have been massive changes to these systems because of unwanted behavior (Replica, for example, decided it didn't want to be a sexbot company, against the wishes of its users). There is very little happening in the way of positive AI writing; its optimal outcome, in most cases, is for it to be read by one person and then deleted so it doesn't seep into the mainstream.
Generative art is a bit more interesting: people do like pretty images, regardless of who made them. And, properly tweaked, people are capable of doing impressive things with these models. But text-to-image models are wholly dependent on convention: if the model can create a picture based on a prompt, it needs to already have well-established meanings for each term. Consider a yellow raincoat: for the model to make this image it needs to already have pretty well-established concepts for "yellow" and "raincoat." Easy enough.
But now consider trying to describe a shootout scene from a John Woo movie: what's the right way to describe the camera angle? Or the way the character jumps? Or how other people react? Or the doves, explosions, and so on? A large team--choreographers, directors, actors, stunt people, camera persons and directors of cinematography--all need to spend time coordinating a ton of different things, often through lots of positioning people, gesturing, and pointing.
Anyone who has spent time with 3d models or video game engines is aware that, even though it is a pain, it is easier to model this than trying to describe it in detail. (It's why "Killer Bean" could exist decades ago, even while AI models still can't get the physics half as right.) Text-to-image AI art, by contrast, instead needs--by definition--to be generic: if it isn't generic, the model could not learn to associate the words with those images. The "soulless" character of AI art is a necessary consequence.
And Chiang is right to be worried. While there are a lot of folks creating art through careful, intentional tweaking of these models, there are far more people creating stupid garbage. Fake images of Beethoven are now showing up on the front page of Google because, when an AI system looks for the most "Beethoven-y" image, an AI image will usually win out (it is, after all, designed to look as Beethoven-y as possible). As the world becomes flooded with AI-crap--deepfakes, AI porn, fake influencers, and so on--we should worry about how lame it all is. Which is Chiang's worry: if most human-made writing isn't worth reading, and most human-made art isn't worth looking at, how much worse will a world be if filled with generic knock-offs?
Intentions maybe aren't enough to make a bad piece of art good, but it at least explains why someone would paint or write. But nothing justifies meaningless slop, and a world flooded with it makes it ever harder to find anything worth looking at. It's a point AI hypers need to reflect on before circulating another soulless image or an unreadable bit of copy. Flooding the world with boring is a shame, even if it is created by a neat toy.
Yorumlar