It's a confusing jumble of Frodo Bagginses, all cloaked, some in pizza delivery caps, many bearing pizzas or boxes, striding out from the ominous openings of many mines

“Frodo Baggins delivering a pizza through the mines of Moria”

Remember my attempts to get CLIP+BigGAN to generate candy hearts? Here’s what an alternative method, CLIP+FFT, does with the prompt “a candy heart with a message”.

There are thousands of pastel-colored hearts, each illegible, pressed up against each other into a solid cavern that recedes into misty distance.

Rather than a single obsessively-scribbled-upon heart, we now have a vast universe of candy hearts, jostling against one another with their messages screaming incomprehensible love at the viewer.

As before, CLIP is the judge, telling another algorithm whether this collection of pixels looks more like “a candy heart with a message” than that collection of pixels. But this time, the algorithm presenting the images to CLIP isn’t steering through BigGAN, which was trained on a set of human photography. Instead, it’s doing something a lot more like the classic Deep Dream images, changing parts of the image to maximize how much it looks like dogs, or whatever the prompt is supposed to be.

First panel: Voice: "How do you like this painting?" Painting is of a single pine tree by a lake with a mountain. Robot box: "Meh." Second panel: Voice: "How about now?" The mountain now has a dog face. Robot box: "I'm intrigued." Third panel: Voice: "How about now?" Sky, lake, and pine tree all have dog faces. Robot box: "THIS IS THE BEST PAINTING EVER!"

(this cartoon is from my book You Look Like a Thing and I Love You: How AI Works and Why It’s Making the World a Weirder Place - out in paperback on March 23, 2021)

And since CLIP was trained on text and images that appeared together on the internet, it can be the judge of just about anything.

Here’s “A stegosaurus flying a spaceship among lasers”.

You would have to know they're stegosauruses, but they're definitely spiky, and the air is filled pretty solidly with lasers.

And it knows how to judge pop culture figures and even the look of TV shows. Here’s “Godzilla and Paul Hollywood in the Bakeoff tent”

Paul Hollywood from the Great British Bakeoff is unmistakeable, and repeated several times. Godzilla is less distinct, but vaguely godzilla-shaped and godzilla-textured

Note that it not only correctly has the tent as white and pointy-roofed, it even is trying to do the Union Jack bunting. And it’s really sensitive to the prompt, so if you type “Godzilla and Paul Hollywood taking a selfie in the Bakeoff tent” instead, Paul Hollywood breaks out into a grin and cameras appear. (it seems to be less sure what a grinning Godzilla looks like)

Now Godzilla and Paul Hollywood are seen mostly from torso up. Paul Hollywood's face is in several pieces and godilla doesn't really have a face, more of a hulking presence. Also everything has a dewy glow like a 1990s mall photo.

Here’s “Mr Darcy emerges from a lake in a white shirt while his horse looks on”

It could plausibly be Colin Firth from 1995 Pride and Prejudice. The lake and the countryside are there, repeated and tiled upon one another several times. The horse requires much more imagination.

It does less well to my mind when there are fewer clues about what the background should look like. Tell it just to do “Tyrannosaurus Rex” and things get very abstract and smeary, and it even resorts to trying to write “tyrannosaurus” everywhere.

There are some brown shapes that might be tyrannosaurus heads, definitely far too many baleful eyes, and maybe some foliage. Not sure what the swaths of lumpy red are. There's illegible writing everywhere.

“A tyrannosaurus wearing a crinoline hoop skirt on a fashion show runway” looks a bit more realistic. Or maybe that’s just my preference. The trees in the background are a nice touch.

The tyrannosaurus isn't identifiably *in* the dress per se, but there are many tiered floofy white dresses, and strong hints of tyrannosaurus legs and tails, and nice Cretaceous trees in the background. Crowds line the runway.

Here’s a zoomed-in view of one of the best ones: “a library made of bones and skeletons; a library in the style of catacombs”. It doesn’t seem to resort to word graffiti if the prompt suggests a finely textured background, maybe. (This may have been from a newer version of the CLIP+FFT notebook, so that could explain some of the improved quality.)

Image is an intricate tangle of bone-covered bookcases and piles of skulls.

You do need a bit of imagination maybe to figure out what the original prompts were, so I wouldn’t exactly say that CLIP+FFT as successful as making images to order as the original CLIP+DALL-E (still not released publicly). But having a neural net that will attempt whatever I ask for (and not turn every human into a horror of many-eyed blobs) is still pretty fun.

“The daleks have filled the tardis with llamas and David Tennant is annoyed”

David tennant is clearly visible (at least the left half of his face) frowning. Daleks and tardises seem to merge into segmented blue cones. The llamas are kind of dalek-shaped furry blobs. There are EXCESSIVE numbers of lasers for some reason.

Read more about CLIP+FFT (built by Vadim Epstein) and try it yourself for free with the colab notebook!

I made a bonus gallery of various characters delivering pizza. Spider-man’s not the only one who’s recognizable with a fresh pie in hand. To see the gallery, and get other bonus content, become an AI Weirdness supporter! Or become a free subscriber to get new AI Weirdness posts in your inbox.

Subscribe now

Subscribe now