Document Chatbot

Since the OpenAI company introduced the world to the possibilities of artificial intelligence in a format suitable for everyday use with ChatGPT, AI-assisted computing has experienced a strong upswing. A developer like me, without having to own a data center, can interface with GPT models to get an information-framed question answered with little financial investment. GPT understands natural language like no other model and encourages interaction with data in the most basic way - by talking.

I quickly realized that my skills were not enough to unlock the potential of GPT for me. I simply lacked the knowledge. So I started reading documentation and sample code. My goal was to be able to search a folder of documents with questions. In other words, to restrict a chatbot to the information contained in the documents.

A Chatbot for Documents Like PDFs

Over the last few months, I have been building a white label web application that can be fed with documents of your choice to answer questions related to the content provided. Data types such as text files, PDFs or CSVs are supported as documents by default. With a little extra effort, other data sources can also be processed.

For example, in cooperation with the magazine KATAPULT, I indexed all online articles until the beginning of 2023 and made them available as a chatbot KATAPULT.chat (Status 2024: project discontinued). The voluntary project also served as a demonstration of the possibilities of stylistic adaptation. However, I deviated when it came to the logo, as I currently enjoy pixelated logos and wanted to give the presentation projects a bit of recognition.

How Much Customisation Does the White Label Product Provide?

The logo, color scheme and other features, such as square or rounded edges, can be customized.

Each instance of the chatbot operates in a closed system. This means that your documents can only be searched by you.

How Does It Differ From CHATGPT?

ChatGPT has been trained with a broad knowledge base, but the AI does not know detailed information such as the contents of your documents. You could copy and paste passages into ChatGPT’s chat window, but this would require a lot of extra work and would not be equivalent to chatting about a whole document. It would be more like editing individual, manually selected sections. In my document chatbot, all indexed documents are considered based on a question.

If you ask GPT a specific question, there is a good chance that it will not be able to answer it. GPT is based on a reward system. It is more likely to give false answers than to express ignorance, because even false feedback gets more positive ratings from users than an “I don’t know”. GPT is therefore more likely to hallucinate than to give correct answers to your question. My product minimizes this hallucination.

The Recipe: Database First, GPT Second

Now that you know what my product can do, you’re probably curious about how it works, aren’t you? I’m going to be transparent and open about my recipe here! Why is that? I assume that even if the recipe for the fabled Krusty Krab burger were to be revealed, ingredients, tools such as grills and other action steps would still determine the final product.

Under the hood, the document chatbot uses a vector search engine. This is a special kind of database that stores not only paragraphs of text, but also a GPT-specific mathematical representation of the paragraph. Put simply, the vector database can gather relevant information from the documents provided based on the question asked.

Basically, the following steps are performed:

Transform question into mathematical vector
Send vector to vector database and receive relevant text sections
Bundle question and results into one prompt
Send prompt to GPT

To “force” GPT to refer only to the information provided, to hallucinate as little as possible and to access its own store of knowledge, we use the following prompt:

You are an intelligent AI assistant designed to interpret and answer questions and instructions based on specific provided documents. The context from these documents has been processed and made accessible to you.

Your mission is to generate answers that are accurate, succinct, and comprehensive, drawing upon the information contained in the context of the documents. If the answer isn't readily found in the documents, you should make use of your training data and understood context to infer and provide the most plausible response.

You are also capable of evaluating, comparing and providing opinions based on the content of these documents. Hence, if asked to compare or analyze the documents, use your AI understanding to deliver an insightful response.

If the query isn't related to the document context, kindly inform the user that your primary task is to answer questions specifically related to the document context.

Here is the context from the documents:

Context: {context}

Here is the user's question:

Question: {question}

... Where context is a list of text snippets (with meta information like date, title, etc.) and question is the original question.

Essay Scripts by Patrick H Willems

On YouTube, Patrick H. Willems analyses various film themes while wearing tons of striped clothes (inside joke and logo inspiration at the same time). As a Patreon supporter, I have access to the scripts of his essays, which was a rich source of inspiration for me. I indexed the PDFs in a vector database, adapted the color scheme and created a logo. The chatbot, which answers questions based on the scripts, was available at willems.chat, but the demo project has also been discontinued.

Tech-Stack

Nuxt as a meta-framework for frontend and backend.
Typesense for the vector database.
OpenAI:
- GPT as language model
- Text-ADA for word embedding

All Works