Az

ChatGPT & LLMs as Knowledgebases

Large Language Models (LLMs) are in ascendance today. Only 3 years ago GPT-2 was released to the world and everyone was quite awed by its capabilities. Its ability to auto-summarise and answer questions based on finetuning was mindblowing compared to previous attempts, but this model was quickly eclipsed by the release of GPT-3 (2020) and now the ChatGPT interface (2022).

Let me preface the rest of this post with a notice:

All technology presents solutions, but it is of critical importance to understand its limitations and the issues it may cause.

I have some major concerns about the current hype on LLMs as knowledgebases, mostly centered on people who are impressed with the technology but have limited understanding of the potential issues with it. To be clear: I’m not an AI expert. A big part of my job is moving ML/AI systems from research/development phase into production so I’m very aware of their limitations but a bit rusty on the core math that drives them. So this post is about discussing those limitations, based on my experience in real world use.

We don’t need Google anymore!

The left image shows ChatGPT suggesting a 4:21 min/km is a 6:52 min/mile. The right image is a Google search that gives some page suggestions for running pace calculators.

ChatGPT is good at giving summarised answers that are precise, but not accurate. In this case, ChatGPT is flat out wrong. A 4:21 min/km is a 7 min/mile.

This is a common theme in ChatGPT answers, it sounds very authentic but can be wrong and provides no way to check accuracy through its interface. In fact, when users ask further questions to get it to break down how it arrived at an answer it is pretty evenly divided between a whole bunch of wrong calculations, or individual calculations that are right and a steadfast refusal to correct the incorrect answer. This is the major indicator that LLMs are stochastic parrots and shows that they are doing NLP (Natural Language Processing) rather than NLU (Natural Language Understanding). But the people trying to convince you that LLMs are better than Google are trying to convince you to take its answers at face value, which is dangerous.

A huge part of this danger is derived from the fact that you can’t follow the sources, especially with open ended questions like “how were gothic cathedrals built”. Sure, Google can also suggest results that may be misinformation, but high school teaches classes on media literacy where you’re taught to look at sources during your research and understand legitimacy, point of view, and manipulation. This is a key tactic to fighting modern disinformation campaigns. Trusting a source that says “it came to me in a dream”, which is the reality of what you get from ChatGPT, makes for an easy path into falling for misinformation.

A reference footnote from a book that says “This was once revealed to me in a dream.

Nicolas Berdyaev, From The Divine and the Human, English Translation 1949, p. 6.

I’m not saying Google is perfect, far from it. But it at least provides some tools for helping you understand whether a source could be considered trustworthy, even if they should be doing more to help users understand its limitations.

Future GPT versions will be smarter!

That’s true, future GPT versions are planned to have orders of magnitude more parameters and produce more natural output. That doesn’t mean they’ll produce correct output. The problem in this case is less about the model itself and more the underlying training data. ML/AI is a domain where accuracy is derived from two components:

  • A solid model.
  • Good training data.

Without a solid model, your AI/ML won’t work at all. More complex models and specifically designed models can often give better (read: more accurate) results; we’ve seen that at my work as we move to more advanced systems in our flora recognition models. Different types of models can yield different benefits too: in some we’ve improved edge detection, in others we’ve improved ability to correctly identify down to the species level.

One of the important phrases above is “specifically designed models”. AI/ML can be a good solution for specific tasks, but to achieve high accuracy you also need it to be designed (both in model and training data) for a specific domain. That’s why an AI that is built to detect specific weeds against specific crops from a specific viewpoint is a much more relevant task for AI/ML than being able to autopilot a car safely under reasonable conditions [1, 2, 3, 4]. Trying to build a single model that does everything, such as answer any question you have about any topic, is not an appropriate or smart use case for modern AI/ML. Even some big companies are starting to recognise the limitations of general purpose AI/ML.

The solution to some of the accuracy problems in models isn’t bigger or more complex systems, it’s open algorithms and models where we can find out how results were reached and understand them to improve them. In the the case of AI/ML that we want to use for providing accurate information, this means users need to understand where the answer is coming from, which isn’t something that can be done with the technology behind LLMs. A related issue is that applications of LLMs don’t provide any feedback about the error that a model is operating under, which is a very important metric when you’re trying to get correct information out of them.

Well, it just needs more training data!

Unfortunately, “more training data” won’t fix this problem. GPT-3 models are trained on over 500 billion datapoints and you can see the sources in the original paper under Table 2.2. The minority of training data comes from books and Wikipedia and the overwhelming majority comes from web-crawled repositories like Common Crawl. This is not data that has been processed by humans and checked for accuracy and fuck me drunk that’s a huge part of the problem.

In many technical disciplines you’ll hear about the inviolable principle of GIGO (garbage-in-garbage-out). If you feed an algorithm bad input, you’ll obviously get bad output because it can’t correct what it doesn’t understand is wrong. And we’re talking over 400 billion tokens worth of unverified data from the raw-dogged web.

The lack of good training data is why you see issues with facial recognition reinforcing racist policing and loan applications run through commercial black box AI that doesn’t have to report on how it came to a decision.

And if you’re using a giant, unverified corpus of text from the same internet that produces 4Chan, uncountable anti-vax sites, and Jenkem, I’d be dubious about the validity of the training data you’re using for your question answering model. This is why the authors of the “stochastic parrot” article above recommend that all language models should use training data libraries that are verified by humans. At work, we don’t allow any new training data that hasn’t been verified by specialists working in that field, it’s a costly and time-consuming process but it helps produce accurate models. On top of this, methods of verifying the accuracy of a model under all conditions (especially edge cases) are important to help users understand the limitations.

As a final note, a common factor in LLM applications (as compared to raw models) is the usage of fine-tuning; a good example of this is the use of GPT-2 to auto-summarise by “fine-tuning” it on a specific article (you can do this at home thanks to HuggingFace). The problem with this as the solution to LLMs as knowledgebases are two-fold:

  • This requires you to already have an accurate source, obviating the need for the LLM.
  • The model will still be impacted by the prior learning it has received.

Conclusion

LLMs are cool! I think they’re a great option for dicking around or using them as leaping off points in writing or helping you create a cover letter. But even then you need to understand the context of the results it will output. LLMs are fun, but the responses still requires human review and human editing.

But for all the reasons above, LLMs are not knowledgebases.