Try asking ChatGPT’s image generator to draw a glass of wine filled to the brim. As you can see in the images below, it absolutely fails at the task, even after many tries and iterations of the prompt:


ChatGPT: “The wine reaches the very top of the glass, forming a dome-like effect due to surface tension.” It’s funny that after failing time and again, the model consistently states it has drawn a glass filled to the top when it clearly hasn’t.

Yet it works fine in reverse: the multimodal ChatGPT 4o model has no problem analyzing two wine glass images and identifying which one is full and which one is not, as you can see in my informal test below:

So what is going on here?
I’ll give a more full answer at the bottom, but to start with I need to point out that this entertaining demonstration isn’t new. It was widely discussed in 2024 (for example, this Reddit thread), and I’m bringing it up again specifically because YouTuber Alex O’Connor recently uploaded a video (21 minutes) about the phenomenon and I think his analysis, while fun and interesting, demonstrates some misunderstandings about what ChatGPT is doing when it generates images.
Alex’s angle: Empiricism
Let’s start with a brief overview of Alex’s video:
He frames this wine glass problem in terms of philosopher David Hume’s classic empiricism which claims that the ideas in our head (when you imagine a horse, say) originally come from our experience (sensory input as we act in the world). Hume (1777/1902) distinguishes impressions (sensory input like seeing the color red) from ideas (imagining or thinking about things in the head, so to speak, like picturing red when it’s not present). For Hume, having previously experienced an impression of red is what allows us to have an idea of red; basically, as an empiricist, he thinks you couldn’t imagine red if you had never seen red first.
What about ideas of things you’ve never explicitly seen, though, like a unicorn? Hume calls this a complex idea, one made up of constituent simple ideas. A unicorn (complex idea) is really just a horse (simple idea) and a horn (simple idea). So, yes, we can combine simple ideas into new things, but the fundamental, foundational idea in Hume’s empiricism is that we can’t have simple ideas without first having a simple impression of that thing: impressions are the building blocks of ideas, the raw material necessary to even have those ideas.
Of course, the horse part of Hume’s complex idea unicorn may itself be made up of even simpler ideas (coming from simple impressions like brown, say). That’s because, Hume says, our sensory impressions can be complex too, made up of simple impressions (impression of a green apple is made up of impression of size, shape, color and so on).
Indeed, this is a problem psychologists and other cognitive scientists have worked on for decades. Anne Treisman, for example, proposed Feature Integration Theory as a model of how the brain processes objects: individual features are processed separately, early, unconsciously, using minimal attentional resources, and in parallel, whereas in a later stage requiring more attentional resources those features are bound together into an object perception. We consciously perceive the green apple, not greenness, roundness, shininess, size, and so on, but behind the scenes our brain is first registering the individual features before combining them at a later stage.
So we’ve got the claim of Hume’s empiricism, that simple impressions are fundamentally behind our simple ideas. How does Alex connect that in his video to ChatGPT’s failed attempts at a full wine glass (or for that matter, quarter-full wine glass, tenth-full wine glass, or pretty much any other variation beyond empty and half-full)?
Alex says ChatGPT’s idea of something like horse is generated from a bunch of previous impressions (its training data which includes tons of horse images labeled “horse”).

Now, Alex takes for granted here that horse serves as a simple impression (and thus simple idea) for the image generating model he’s dealing with. Personally, I might prefer to say the model has encoded enough statistical regularities across many visual examples labeled “horse” to distinguish horse-like regularities from non-horse regularities (meanwhile the nitty-gritty details of attentional mechanisms to focus on parts of a larger scene are complicated here both for AI and humans).
Here Alex goes on a short tangent about Hume’s own exploration of one apparent counterexample to his empiricism. Hume’s theory says simple ideas require the corresponding simple impression first, but it seems (to most of us) like even if we had never seen color before, when presented with a bunch of shades in a spectrum like the image below, we could imagine the missing intermediate shade. In other words, it seems like we could generate a simple idea in our head (that shade of blue) without ever having experienced the corresponding simple impression (seeing that shade of blue in our experience).

This seems like an apparent counterexample to Hume’s theory. And yes, we generally assume we could fill in the missing color we’ve never experienced, but that’s hard to be confident about1.
Alex tries to informally test Hume’s empiricism principle by directly testing the missing shade problem in ChatGPT’s image model, an attempt that is entertaining but not in any way convincing. ChatGPT tells Alex it will generate a spectrum of color, remove a shade, then says it will use interpolation from the remaining shades to fill in the missing shade (which it can then compare to the true removed shade).
It appears to succeed at this task, except that we have no idea what is happening behind the scenes. ChatGPT may say it created a new interpolated shade purely from the remaining shades in this little demo, implying its output is in absolutely no way informed by the existence of the removed shade from the model’s training data. It may say it won’t solve this problem using numerical representations of shades, but we have no way to know if it kept to that promise.
Indeed, when prompted further by Alex, ChatGPT says what it did was “visually [blend]” the surrounding shades, somehow mixing them together. Alex takes this explanation at face value (personally I find “visually blend” way too vague to judge without more information) and he thinks that this means ChatGPT created the missing shade not by creating a color it had never seen (a simple idea without a corresponding simple impression) but by producing a complex idea (a unicorn-like combination of two existing shades it did have access to from simple impressions); interesting, but clearly not a counterexample to Hume’s empiricism!
And certainly the image model used by ChatGPT can put together simple ideas into complex ideas; Alex shows the example of ChatGPT drawing a horse in a swimming pool, even though it presumably — he assumes — hasn’t seen this in its training data. So why can’t ChatGPT draw a glass of wine filled to an unusual amount (not seen commonly in its training data)? Can’t it combine the ideas from its impressions (training data) into new combinations not in its training data (complex idea)?
Perhaps, Alex ends up speculating, it’s because empty, full, half-full, and quarter-full wine glasses are simple impressions to ChatGPT and not complex impressions like they are to us! Someone might argue that for us humans a full wine glass would be a complex idea made up of the simple idea wine (or perhaps even red and liquid) and the simple idea wine glass (or perhaps glass and receptacle and a particular shape) and the simple idea full. Even if we ourselves have never seen a wine glass filled to the brim (it’s not how wine is poured), we have seen receptacles filled with liquid, we have seen red liquid in a wine glass-shaped receptacle, and so on; in our minds, we can easily put those together, but ChatGPT’s image model doesn’t seem able to.
Alex speculates that this is because ChatGPT never sees in its training data the intermediate stages of a glass filling up, just completed images of a glass empty or a glass “filled” (i.e. the normal, half-way “filled” of a standard wine pour). Alex thinks the model is treating images of wine glasses as a whole as simple impressions and thus as simple ideas (that is, empty wine glass being one simple impression/idea, and half-full wine glass being a different simple impression/idea).
Combining those simple ideas should get us to new and unusual levels of fullness, right? But Alex argues that ChatGPT might try combining these two visuals in a way more akin to superimposing them, leading to a wine glass half-full with half red (light red, semi-transparent) liquid rather than fully-red liquid up to quarter-full height.

ChatGPT, Alex says, needs more than the ability to crudely combine visuals (i.e., combine simple ideas) to produce what we desire. Alex says it needs to “make sense of what it sees” and ChatGPT doesn’t do so; to him, the wine glass problem is more an issue of conceptual abstraction.
To pump our intuition he proposes the following thought experiment: if a human had never seen liquid or receptacles and we showed that human an image of an empty glass and an image of a half-full glass, could they imagine a quarter-full glass (or totally full glass, etc.)? If so, humans may succeed at this because we “just naturally abstract concepts” in a way that “ChatGPT simply can’t”; maybe human minds, he says, just work differently.
But I find this kind of thought experiment less than compelling: what would it be to have otherwise-normal human experience in a 3D world but no experience with liquid or receptacles? I suppose such a hypothetical person could have copious experience with stacking tiny red blocks (arbitrarily small) from a single tiny layer to a taller and taller and taller stack, in essence filling empty visual space even if somehow having no experience of receptacles like cups or glasses.
Would they be able to generalize that sort of experience to the two images of an empty wine glass and a half-full wine glass and imagine the red part being lower or higher? My intuition suspects so, and I think that’s what Alex is getting at when he says conceptual abstraction. Even if we’ve never experienced liquid, if we’ve experienced other things that conceptually overlap with liquid’s properties (filling space, say), we could conceptualize (imagine? simulate?) intermediate states of liquid.
That’s basically where Alex ends his video, and as fun as it was to follow his attempt to (metaphorically) connect Hume’s empiricism about human ideas to ChatGPT’s outputs, I think he ends up working from some fundamental misunderstandings of what the image model is doing.
What is ChatGPT actually doing?
To start with, when Alex is prompting ChatGPT (or when I prompted ChatGPT 4o in March in my examples up top), we were not actually using the model’s own multimodal capabilities here. Those image generation capabilities are currently being held back (though likely would be much better than what we get now). Instead, the system is passing its own elaborated prompt to the diffusion model DALL-E 3 behind the scenes, as it even tells me when asked:

Now, DALL-E 3 is a pretty old model at this point. On the surface, DALL-E appears to understand the linguistic prompts it receives. For example, when asked for an elephant, it does indeed draw an elephant. So it has clearly absorbed the general visual properties of elephants (well, to be precise, of 2D projections of elephants!) and the model associates those properties with the linguistic label “elephant” which was so often connected to example elephant images in its training data.
To map a word (series of arbitrary symbols) to its referent (object in the world) is certainly impressive, but does that really mean that the model understands language? When we talk about “understanding” language, we are really lumping together a bunch of different linguistic abilities and functions, many of which exist on a spectrum, and so these questions can be much more fruitful if we ask about individual elements of language use. After all, as human children develop and gain language skill, they acquire some abilities sooner than others, and likewise master some language functions before others.
A great example of this can be found in a recent paper by Murphy et al. (2025). They looked specifically at compositionality, where understanding the meaning of a sentence or phrase depends on the way the elements of the sentence are combined2. For example, “the boy chases the girl” and “the girl chases the boy” have the exact same nouns and verbs (the referents are the same), but they mean very different things. The ordering of the words matters. Syntax (rule-based ordering) determines the semantics (meaning).
The researchers in Murphy et al. (2025) prompted DALL-E 2 and DALL-E 3 with a variety of sentences like “The horse sprays the elephant” or “The kitten is in a cup with a yellow ribbon” to see how well it did with types of compositionality like reversible transitives, negation, and assigning adjectives with the right scope (e.g., making the ribbon yellow, not the kitten or the cup). As you can see in the examples below, both versions of DALL-E make errors that seem obvious to adult language users like us.


Indeed, the researchers found that DALL-E 2 performed worse at all such compositionality tasks than even 2 year old children on a task testing the same kind of things. The more advanced model, DALL-E 3, performed better on some forms of compositionality (performing somewhere between 3 and 6 year old children on things like reversible transitives and passives), but still performed quite poorly on other forms like negation and reversible prepositions.

Tests like this show that DALL-E 3’s “understanding” of language is partial and piecemeal. Yes, diffusion models like DALL-E can generate an image that matches the general properties of a referent like cat or cup, but it is not accurately representing them as 3D objects in a 3D world the way our own brain has learned to through its embodied experience in a 3D world.
And the thing is, the interpretation a neural network (brain or otherwise) gives to visual information can be totally different when that visual input is connected causally to action within a 3D world. My favorite demonstration of this is Held & Hine’s (1963) classic kitten study. They took ten pairs of young kittens that had been born in total darkness and had no visual input until the experimental testing. In the test setup, each pair of kittens was brought to a lit room — their first and only visual experience at this point! — and each kitten was hooked to opposite sides of a carousel device.
One kitten (“Active Kitten”) was attached in a way that allowed it to move through the environment, getting visual feedback as a causal result of its self-initiated movement. The other (“Passive Kitten”) was stuck in a basket that rotated equally, yoked to the first kitten’s action, such that both kittens got the same visual stimulation (indeed, no difference in visual input since their very birth) but for Active Kitten the visual input was connected meaningfully to its active exploration of the 3D world. They experienced this for 3 hours per day for weeks, with all other time spent in darkness; thus, the researchers held constant the visual experience of both kittens, with the only variable between them being whether the visual input was passive or connected to active exploration initiated by the body.

Afterward, the researchers tested the visually guided behavior of the kittens to see how their brains interpreted and used visual input. For example, a researcher would hold a kitten on their palm, head and legs dangling, and lower it toward a table (a normal kitten will extent its paws as the table gets close) or see how well they avoided a visual cliff (apparent drop from high to low side from a bridge) or blink to an approaching object (as mammals normally do, like when a ball is coming at your face).
The Active Kittens had normal responses to the tests, extending their paws as their body approaches contact with the table surface below, sticking entirely to the shallow side of the visual cliff, and blinking to approaching objects. On the other hand, not a single Passive Kitten developed the paw extension response (i.e. they just bumped down onto the table), nor did they distinguish between shallow and deep sides of a visual cliff, nor did they develop a blink response to approaching objects3.
In other words, the exact same visual input, with all the same regularities and statistical patterns in that input, led to the kittens’ neural networks (brains) interpreting that visual stimuli in incredibly different ways. Exploring a 3D world where visual input is causally connected directly to motoric actions and correlated with info from other senses is very different from a passive experience of visual input which, in some sense, is no more than blobs of various regularity rather than signals of objects being near or far, in front of or behind other objects, and so on. Understanding visual input in the way our brains understand it (an object being below or beside another object) is very different from what goes into a model like DALL-E.
DALL-E is a Passive Kitten, in other words. It’s no surprise that it isn’t interpreting wine glasses as 3D receptacles filled with various amounts of liquid in line with 3D liquid physics. To DALL-E, a half-full wine glass is not a wine glass at all but a visual blob with certain regularities that are associated with the label “glass of wine”.
So if ChatGPT generally fails the wine glass test specifically because of DALL-E’s limitations, that raises the question of whether a different model, especially a more advanced multimodal model like GPT 4o, might actually do better if its image generation capabilities were used directly? For a quick test (inspired by u/tollbearer on Reddit), I asked ChatGPT 4o to generate an ASCII image of a wine glass filled to the brim, and it succeeded immediately:

Indeed, it can generate all different levels of fullness upon request without issue, suggesting a multimodal model like GPT 4o might not even display the wine glass problem that started this whole discussion.
And sure enough, OpenAI just unlocked native image generation within GPT 4o so let’s test it:

Voila! On the first try!
Is it just that 4o’s attentional mechanisms improved? Or perhaps that 4o is less likely to be overfitted on its training data? Or is it the more advanced multimodal nature of 4o that indeed unlocks a better visual ‘understanding’, perhaps even a (visuo-linguistic) conceptual abstraction like Alex was talking about?
If it’s anything like the latter, then imagine what happens when we start training models on video (moving visuals, not just static images): will it develop (embed) a sort of intuitive ‘understanding’ of physics, something that current models still struggle with along with attention and memory?
Now imagine as post-training we have the model play around in a physics simulator, having it systematically explore what happens when it tries various actions in the simulation and gets feedback of what happens: a sort of simulation of Active Kitten rather than Passive Kitten. It seems very likely to me that this will lead to models that can better reason about the physical 3D world.
And of course that gets us to the next step many see for large-language models: robotics. A model could control a robot, trying actions and receiving perceptual feedback that matches or doesn’t match its prediction of the outcome. This is akin to what we think the human brain is doing, which in neuroscience we call predictive coding. Predictive coding is basically a Bayesian updating of a mental (neural) representation of our body and environment based on sensorimotor feedback (with the updating based on match or mismatch with the efference copy generated when an action signal is sent to the muscles; Rao et al., 1999; Clark, 2013).
LLMs clearly have a long way to go before they show anything like human understanding of the world, but I can see the outlines of what it might look like to incrementally get closer on many fronts, and robotics (like current vision-language-action [VLA] models) seems to be the next step.
Footnotes
1 My YouTube lecture on Language and Thought explains some research that our intuitive assumptions about perceiving and distinguishing blues may not be as straightforward as we think. See Winawer et al. (2007), for example (cf. Martinovic et al., 2020).
2 Interesting side note: a recent study published in the journal Science (described here) suggests that our close ape relatives bonobos combine their calls in ways similar to compositionality in human language. On YouTube, I have a video about the debate on language in animals.
3 Lest you wonder about the fate of the poor Passive Kittens: after testing, the Passive Kittens were allowed to explore an illuminated room on their own for 48 hours, and at that point demonstrated normal visually-guided behavior. The young brain is still remarkably plastic.
References
Clark, A. (2013). Whatever next? Predictive brains, situated agents, and the future of cognitive science. Behavioural and Brain Sciences, 36(3), 181-204. https://doi.org/10.1017/S0140525X12000477
Held, R., & Hein, A. (1963). Movement-produced stimulation in the development of visually guided behavior. Journal of Comparative and Physiological Psychology, 56(5), 872-876. https://doi.org/10.1037/h0040546 [PDF]
Hume, D. (1902). Enquiries concerning the human understanding, and concerning the principles of morals (L.A. Selby-Bigge, Ed.). Clarendon Press. (Original work published 1777)
Martinovic, J., Parameir, G. V., & MacInnes, W. J. (2020). Russian blues reveal the limits of language influencing colour discrimination. Cognition, 201. https://doi.org/10.1016/j.cognition.2020.104281
Murphy, E., de Villiers, J., & Morales, S. L. (2025). A comparative investigation of compositional syntax and semantics in DALL-E and young children. Social Sciences & Humanities Open, 11, 101332. https://doi.org/10.1016/j.ssaho.2025.101332
Rao, R. P. N., & Ballard, D. H. (1999). Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-field effects. Nature Neuroscience, 2, 79-87. https://doi.org/10.1038/4580
Winawer, J., Witthoft, N., Frank, M. C., Wu, L., Wade, A. R., & Boroditsky, L. (2007). Russian blues reveal effects of language on color discrimination. Proceedings of the National Academy of Science, 104(19), 7780-7785. https://doi.org/10.1073/pnas.0701644104
Leave a Reply