AI Hallucinations : Fear Not — It’s A Solved Problem — Here’s How (With Examples!)

--

There has been a lot of press lately about hallucinations and ChatGPT “making things up”. Most of the focus has been around how to modify ChatGPT or Bard and somehow have it be more accurate.

In my opinion, general purpose LLMs and systems like ChatGPT will NEVER be able to fully control hallucinations. That is NOT their job. Just like Google Search, their job is to absorb the vast knowledge on the Internet and answer questions based on that.

And expecting ChatGPT to not hallucinate is the same as expecting Google to ONLY show articles with the truth (that’s not going to happen!)

However, there is a better way: Using Retrieval Augmented Generation (RAG) along with “ground truth” knowledge to control the responses generated by ChatGPT.

In this post, I will cover some anti-hallucination lessons learned in the field. (PS: I’ve already covered the Basics of Anti-Hallucination in a previous article, so if this is new to you, please read that first)

But first ..

How were these lessons learned?

Over the last nine months, we have battle tested some of these solutions to control hallucinations in generative AI across thousands of customers. And literally made about a 100+ system upgrades to our RAG SaaS platform to get hallucination under control.

I’ve written a previous article about how to stop hallucinations in ChatGPT. If you want non technical details and basics, that would be a good place to start.

But in this blog post, I want to go through some of the technical details. And rules that we have followed based on these experiences.

And just to confirm, some of these experiences have been with large companies and extremely stringent use cases — like health therapy or banking-related use cases. And so the system seems to have been battle tested with a very low threshold of tolerance for hallucination.

Cause ya know: When a customer sees it, he will tell you about it!

Without further ado, here are some lessons:

Lesson 1 : 97% Effective Is As Good As 100% Useless.

So hallucinations are like “security” or “uptime”. If you are doing some basic anti-hallucination and get to 97%, unfortunately that is not good enough.

That would be like a DevOps person saying “my machine is up 97% of the time” or a SecOps person saying, “Oh, we are 99 percent secure”. It is just not how business works.

Lesson 2: Every Part Of The RAG Pipeline Needs Anti Hallucination

So one of the things that makes me cringe is when people just put in one line of anti- hallucination into the prompt and expect it to work. It’s usually something like this:

prompt = """Answer the question as truthfully as possible using the provided text, and if the answer is not contained within the text below, say "I don't know"
Context:
<CONTEXT FROM KNOWLEDGEBASE>
Q: <USER QUESTION>
A:"""

Unfortunately, a line like that where you say, “please stay within the context”, that is just not good enough. The root cause of the hallucination can occur with 5 other components of your RAG pipeline.

That is the sort of thing that gets you to 95%.

Why?

Because in a RAG pipeline, there are multiple components and each of them can be the cause of introducing hallucinations.

Starting from the data chunking to the query intent, then going to the prompt engineering, then going to the type of LLM that is used, then going to how that LLM is used, then going through the AI response to see if it is a hallucinated answer.

C-Suite execs from top companies call this “Fully Integrated Horizontal Pipeline” (sweet!)

As you can see, there are multiple components that matter in a RAG pipeline. And any one of them could be the root cause of your hallucination.

On a side note: When I was talking to an old-timer from the manufacturing world, he said:

Yeah — this is basically like a manufacturing pipeline and what you call “hallucination”, we call “defect”.

Lesson 3: Build It Versus Buy It

Till now we have spent literally thousands of hours across hundreds of cases and millions of queries across thousands of customers dealing with hallucination cases.

When any one of our customers identifies a hallucination case, we identify the root cause and then upgrade the entire system. So when we do this, all the thousands of customers get the benefit.

This is the value of “economy of scale”. So when you are trying to control hallucination by yanking out Langchain (which by the way is a pain!) and implementing anti-hallucination with it, it’s just you by yourself.

The testing that will be done and the anti hallucination you will put in place is limited to your one account and experience.

There is no network effect where thousands of customers are helping each other by identifying the root causes and then purpose-built SaaS systems solving those cases. Due to this, when you consider a “Buy It” strategy, you get the benefit of all of those thousands of hours of work at a very, very minimal cost.

It’s the same reason we all use the OpenAI API, right? Do we really want to sit around and reinvent the wheel and start building our own LLMs like Bloomberg?

Lesson 4: Query Intent Decides Anti Hallucination

A big problem with basic RAG pipelines is that the user query keeps changing under various conditions. People have:

  • long conversations
  • short conversations
  • sudden turns in conversations
  • Bad prompts

It turns out that normal users are really terrible at prompting. They talk to chatbots like they are talking to friends.

We see a lot of queries like: “okay”, ”Yes”, “Haha. That’s right”, “No, that one”, “2”, etc.

And handling the true intent of such queries and making sure that every aspect handling the user query should keeps anti-hallucination in mind.

Especially when it comes to retrieving the right context from your knowledge base (aka: vectorDB search and chunk re-ranking) and including it in the LLM call that is critical in anti hallucination.

Lesson 5: Test, Test, Test!

As you can imagine, test the heck out of the RAG pipeline. And this is probably the most painful.

How do you create a test suite of conversations and context to see if the response is being hallucinated?

You have to consider a large number of real-world scenarios. In particular:

  • Long conversations: These might be a thread with hundreds of previous messages and responses.
  • Short conversations: These might be quick conversations like those coming in over SMS to a car dealer.
  • Big fat prompts: Where people are typing thousand word prompts.
  • Big fat responses: Where the response itself is like 4,000 words and that leads to follow-on questions.
  • Sudden turn in query intent : This one is the most tricky, where the flow of the conversation could take a sudden turn.

These are the types of things that will need to be tested and confirmed.

Lesson 6: Quantitative Testing

There are some new ways in which your system can be tested quantitatively to see if hallucination is occurring and the degree of hallucination that is present in the AI responses.

You can use an LLM-agent to validate the context, prompt and AI response and calculate validation metrics from it.

Think of it almost like a quality score that is computed for each response. So by measuring and logging these metrics in real time (just like you would with say “response time” or “error rate”), you can optimize your RAG pipeline over time.

Frequently Asked Questions

But wait — you can never be 100% sure with hallucinations, correct?

Correct — that is why I often refer to hallucinations like DevOps people refer to “uptime”. For some people, 98% is good enough — for others, they need 99.999% accuracy.

Hallucination is like “uptime” or “security”. There is no 100%. Over time, we will come to expect “Five 9s” with hallucinations too.

Technically, what are the major anti-hallucination methods?

While we had to put in anti-hallucination at ALL parts of the RAG pipeline, here are some that had the most effect:

1. Query Pre-processing: Using an agent (called “InterpreterAgent”) to understand the user intent gave the most improvement. It not only helped calculate better CONTEXT from the vectorDB search, but also helped create a better prompt for the LLM API call resulting in better AI response quality.

2. Anti-hallucination prompt engineering: We created a concept of a dynamic “context boundary wall” that was added to each prompt. This helped re-affirm anti-hallucination in the final prompt that the LLM (like ChatGPT) operates on.

3. LLM Model: We used ONLY ChatGPT-4. The higher the model’s ability to reason, the better the anti-hallucination score. And yes — while this might make it lot more expensive, it’s the price to pay for the added anti-hallucination points.

Going from 97% to 99.999% for anti-hallucination is a long and hard process.

But wait — can’t ChatGPT or Llama-2 just fix this for us?

A general-purpose LLM like ChatGPT (or Llama-2) can NEVER fully control hallucinations.

Think about it: These models are by definition designed to hallucinate or be “creative” (in the words of Sam Altman)

Expecting ChatGPT to not hallucinate is like asking Google to ONLY show truthful articles. What would that even mean?

Hmmmf — I don’t believe you. I want to see this in action.

I get this a lot. Here are some live chatbots. If you can get them to hallucinate, please drop a comment below and if your concern is real, I’ll send a reward your way :-)

CustomGPT’s Customer Service: Consolidated chatbot with all of CustomGPT’s knowledge.

MIT’s ChatMTC: Multiple knowledge bases with MIT’s expertise on Entrepreneurship.

Tufts University Biotech Research Lab: Decades of biotech lab research documents and videos.

Dent’s Disease Foundation : Consolidated knowledge from Pubmed and articles about a rare disease.

Abraham Lincoln : Public articles and websites about Honest Abe.

Side note: Hallucination should NOT be confused with jailbreaking. Jailbreaking is where you are trying to break your own session using prompt injection. It’s like lighting your house on fire and blaming the fire department.

Dude, why not just create a verification agent to check the AI response?

This is a great idea — IF you have the luxury of doing it.

The idea: Take the AI response, and then confirm it using an LLM against the context and the user query. So think of a prompt like this:

Act like a verification agent to confirm whether the assistant is staying within the context for the given conversation.

The problem: Most real-world use cases use response streaming. This means that as soon as the first word is available from the LLM (like ChatGPT), it is sent to the user. In such cases, by the time the full response has been dished out to the user, doing verification is too late.

But yes — if this was a non-streamed or offline use case, then adding a verification agent would definitely help.

Conclusion

So these are some basic lessons and rules learned from the school of hard knocks.

I hope that you can share your own hallucination experiences in the comments below. So that we can all tackle what is effectively a defect in AI and solve it to create safer and more reliable AI systems.

The author is CEO @ CustomGPT, a no-code/low-code cloud RAG SaaS platform that lets any business build RAG chatbots with their own content. This blog post is based upon experiences working with thousands of business customers over the last 9 months (since the ChatGPT API was introduced).

PlainEnglish.io 🚀

Thank you for being a part of the In Plain English community! Before you go:

--

--