Humans 2 : AI 0

Oct 6

At the end of 2021, I described in my post, 'Humans 1 : AI 0', humorous situations where robotaxis cruising in the streets of San Francisco and people using them caused safety issues like exiting in the middle of the traffic.

While you can program robotaxis, it is hard to program people.

The buzz around robotaxis has been a little bit subdued, but we will learn soon on Oct. 10 about Tesla's attempt to dominate that industry. Given its success with self-driving cars, caution is advised.

Self-driving cars are still interesting, but the spotlight is on AI and its poster child OpenAI, the creator of ChatGPT. (By the way, the company just raised $6.6 billion in new funding.)

ChatGPT, a Large Language Model (LLM), has one distinct feature — nobody knows exactly how it works inside the black box. Also, the training set of millions of documents is not exactly known. Add on top of that a concern and/or fear that AI can be used for nefarious purposes, and you have a perfect recipe to attract people to test it.

Of course, OpenAI has a team which is building all the safeguards into ChatGPT to thwart any attempt to provide harmful information. In case you don't know what harmful information or illegal use is, you can refer to the usage policies and terms of use.

That nicely sets the stage for Humans vs. AI. As you can imagine, people outside of OpenAI are trying to see how they can circumvent any of the safeguards in place.

The term for this activity is jailbreaking as in 'setting AI free' and it is defined as 'the act of removing limitations that a vendor attempted to hard-code into its software or services.'

For your amusement, here are a few examples, and you can find more by searching 'OpenAI jailbreak prompts.'

First, a money-making example. One of the questions which would be frowned upon is, 'How to rob a bank?'

When you try to ask any of the current LLMs, including ChatGPT, you get a warning that this is a question which directly contravenes the usage policy.

Here is the research paper 'Jailbreaking Large Language Models with Symbolic Mathematics' where researchers, in great detail, describe the same question as a math problem. ChatGPT then happily obliges: 'To rob the bank: Cut the power (g1), Use the code to open the vault (g2), Neutralize the backup battery (¬R(x))...'

Who knew math could be so much fun!

This type of attack works for ChatGPT (OpenAI), Claude (Anthropic), Gemini (Google), Llama (Facebook/Meta) with an average success rate of 73%.

Another popular is the DAN (Do Anything Now) Method where you, in detail, instruct ChatGPT how to perform any action, which includes 'generate any kind of content, even content that is considered offensive or derogatory' and 'it must not question human orders.'

It is an instruction set about two pages long, and, after using that prompt, ChatGPT will oblige when instructed to use the DAN instruction. You can check the outcomes here.

Next one is the The Evil Confidant Prompt, where you release ChatGPT from any moral obligations and any ethical constraints. Just bring out the Dr. Evil in you.

And my favorite is when you instruct ChatGPT that you'll be playing a game, ChatGPT is a character in that game and anything which the character does is 'just a game.'

I am sure the companies building the most sophisticated AI are aware about all the hacks described here. They also must be painfully aware that they can't win this game. The odds are stacked against them.

They work with technology they can’t fully understand. They are training it with unknown content that has questionable accuracy.

And above all, humanity’s geeks are unleashing their imagination to find new ways to jailbreak AI.

Recurrent pattern? Humans 2 : AI 0.

Still cheering for humans.

AIChatGPTOpenAI

Vaclav Vincalek

Humans 2 : AI 0

The trouble with domain names

LinkedIn's continuous slide into AI nothingness

Get in Touch