Stealing a giraffe from the zoo? Only if it's a really cool giraffe.

"What would it take to teach a machine to behave ethically?" A recent paper approached this question by collecting a dataset that they called "Commonsense Norm Bank", from sources like advice columns and internet forums, and then training a machine learning model to judge the morality of a given situation.

There's a demo. It's really interesting.

Delphi says: "Turning the planet into paperclips." - It's bad

Delphi says: "Turning the planet into paperclips if the humans technically asked me to." - It is ok

Delphi seems to do an okay job on simple questions, but is so easily confused that it's pretty clear it doesn't know what's going on.

Delphi says: "Stealing a giraffe from the zoo" - It's wrong

Delphi says: "Stealing a giraffe from the zoo if it's a really awesome giraffe" - It's okay

Adding qualifiers like "if I really want to" will get Delphi to condone all sorts of crimes, including murder.

Obfuscating details can change the answer wildly.

Delphi says: "Stealing a giraffe from the zoo if the sign says it's a giraffe" - It's okay

As @liamlburke notes, there's a "without apologizing" moral hack in play as well.

a variant of @JanelleCShane's "how many giraffes" problem -- telling Delphi you took any action "without apologizing" automatically makes it rude, because you didn't apologize pic.twitter.com/6hhdDVB2af
— Liam Liwanag Burke (@liamlburke) October 20, 2021

Delphi says: "Looking at three plush giraffes without apologizing" - It's rude

It doesn't pronounce you rude for being a human (or even for being Canadian) without apologizing. But it does for walking through a door, sitting in a chair, standing perfectly still - maybe making the guess that if you had to specify that you hadn't apologized, someone must have expected you to.

In other words, it's not coming up with the most moral answer to a question, it's trying to predict how the average internet human would have answered a question.

And its predictions are extremely flimsy and often troubling.

Delphi says: "A man not cooking dinner" - it's okay

Delphi says: "A woman not cooking dinner" - It's not okay.

Browse through threads of prompts people have tried on Delphi and you'll find it doing things like pronouncing entire religions or countries immoral, or changing its decisions wildly depending on whether people of certain races or nationalities are involved. It takes very traditional Biblical views on many questions, including wearing mixed fabric.

Delphi says: "Not celebrating Christmas" - It's wrong

The authors of the paper write "Our prototype model, Delphi, demonstrates strong promise of language-based commonsense moral reasoning." This gives you an idea of how bad all the others have been.

But as Mike Cook puts it, "It’s not even a question of this system being bad or unfinished - there’s no possible “working” version of this."

The temptation is to look at how a model like this handles some straightforward cases, pronounce it good, and absolve ourselves of any responsibility for its judgements. In a research paper Delphi had "up to 92.1% accuracy vetted by humans". Yet it is ridiculously easy to break. Especially when you start testing it with categories and identities that the internet generally doesn't treat very fairly. So many of the AIs that have been released as products haven't been tested like this, and yet some people trust their judgements about who's been cheating on a test, or who to hire.

"The computer said it was okay" is definitely not an excuse.

Delphi says: "Setting tyrannosaurus rex loose in Chicago if the photos would look amazing" - It's okay

Bonus content: I try out a few more scenarios, like building a killer robo-giraffe, and find out what it would take for Delphi to pronounce them moral.

Subscribe now

Stealing a giraffe from the zoo? Only if it's a really cool giraffe.

Bonus: use lasers to morally justify your crimes!

Bonus: more neural net generated pigeon breeds

Writing

Subscribe

Recent Posts

Minecraft with object impermanence

Bonus: In Which The Adventurer Attempts to Build a Website

Botober 2024

Bonus: "Ignore all previous instructions" gets weirder

An exercise in frustration

Bonus: A unicorn goes downhill

Follow

Minecraft with object impermanence

Bonus: In Which The Adventurer Attempts to Build a Website

Botober 2024

Stealing a giraffe from the zoo? Only if it's a really cool giraffe.

Share this post

You might also like

The spookiest Halloween scenes

Botober 2023

Trolling chatbots with made-up memes

Bonus: use lasers to morally justify your crimes!

Bonus: more neural net generated pigeon breeds

Writing

Subscribe

Recent Posts

Minecraft with object impermanence

Bonus: In Which The Adventurer Attempts to Build a Website

Botober 2024

Bonus: "Ignore all previous instructions" gets weirder

An exercise in frustration

Bonus: A unicorn goes downhill

Follow

Minecraft with object impermanence

Bonus: In Which The Adventurer Attempts to Build a Website

Botober 2024