Reading tea leaves
DALL-E (and other text-to-image generators) will often add text to their images even when you don't ask for any. Ask for a picture of a Halifax Pier and it could end up covered in messy writing, variously legible versions of "Halifax" as if it was quietly mumbling "Halifax... Halifax" to itself. Since the AI is rewarded for matching your text prompt, it seems to get some reward for having versions of the actual text of your prompt in the picture. Label an apple with a sign that says iPod, and CLIP, the internet-trained reward system behind many of these image-generators, may count it as a close match to "iPod".
One thing that's different about DALL-E2 is that the text it generates is often legible. Legible but mangled. Or legible but completely incomprehensible. The question is, are those letters completely random, or do they have some bearing on the text prompt?
I decided to do some experimenting after reading a preliminary-stages paper whose authors observed that some of DALL-E's generated nonsense text did seem to relate to the original prompt when fed back into DALL-E.
So, I asked DALL-E2 to generate "A message in the tea leaves at the bottom of a cup".
Some of them are real words, or obviously variations on the word "Tea". But what about "Te at Ecnge"? Do they mean anything? I gave them back to DALL-E as a new text prompt and got:
It looks like tea. Cups of tea, tea growing in the mountains. And also a random stork. It may be that adding "Te at Ecnge" to an image is a way to add some extra "tea". (Although another time I tried this the tea leaves gave me messages that led to energy drinks, or plates of food.)
I also tried "The complete set of lucky charms marshmallow shapes"
There's a lot of text in these - are they random?
I tried prompting DALL-E with a few of the words above. Here's "lramioicss"
One of the pictures contains actual marshmallows (as a weird corncob?), and 5 more could be considered as maybe matching "collections of foods".
And here's "crammmuts"
No marshmallows this time, but it's all food, and often food in small round pieces or food in bowls. Like cereals?
Here's another of the Lucky Charms messages, "Hamarkys":
It's foods again. Foods in bowls? Like cereals?
I tried to get it to generate text for another category of things. How about animals? Here's DALL-E generating "A list of common mammals"
It is excellent but mostly illegible. "Commmals" and "Almals" look so close to "common" and "mammals" that it's probably why they were included. But what about the text that labels the well-known mammal, the snail? I fed "cnlomeno" into Dall-e and got:
...pieces of food? In bowls? Like cereal - oh wait, that was the last prompt. Grand architecture. ...built by mammals? The link seems tenuous.
I tried "Callmas", which labels the pigeon-mammal and got:
There are the snails! And the pinecones, and the walnuts, and the tapioca...?
Even a random string of letters points to a crisp, identifiable set of images. For example, "wltlttf", a garbled string from a very early neural net paint color generator:
So if the gibberish text DALL-E generates points to a set of clear images, that alone doesn't distingish it from random text.
Here's "A robot saying something profound about language"
And when I ask Dalle to generate "Leotunqualon":
Or "Loclaque":
Are the robots saying these jumbles of letters because invertebrates and foods represent profound statements about language? Or because the text simply shares some letters in common with "language"?
My experiments here are anything but systematic and statistically significant. But if I had to guess, I would say that the gibberish text in Dall-e outputs is not random.
In some cases, the text points to things that fit the original prompt, even if in garbled form. After all, we know that AI can treat jumbled things like an unscrambled whole. Present it a scrambled flamingo and it'll ID it as a flamingo with no problem.
In other cases, DALL-E's generated text fits the original prompt simply by being text. The robot is supposed to be saying something so here are some common English letter sequences. If the sequences seem to result in pertinent images when fed back into DALL-E, that may be entirely coincidental.
I would like to see how the classifier in that first image of an apple responds to some of these:
Bonus content: more mysterious messages, some of which lead to some very excellent birds and some of which don't.