The great LLM hallucination fallacy

Ark 51 is an AI-powered document repository. It is designed to give you insight, hindsight and foresight into the contents of your document portfolio. It will extract data with significantly greater levels of accuracy than any human – in a fraction of the time – reducing the cost of achieving regulatory compliance and unlocking commercial opportunity. With any AI-powered solution, people are sometimes (understandably) concerned about the possibility of “hallucinations”. However, in the context of Ark 51 at least, they shouldn’t be. To understand exactly why, let’s take a closer look at the phenomenon of LLM hallucinations.

How do LLMs work?

Whilst Large Language Models (“LLMs”) “have learned to mimic the products of thought”, they do not themselves think. They do not have a true understanding of meaning or reality. They do not reason and they do not verify truth. Rather, they predict text based on statistical patterns found in the datasets upon which they have been trained. In simple terms, an LLM asks the question: “Given a sequence of words, what word is most likely to come next?” LLMs process input as a series of linguistic components called “tokens” (typically a word- or syllable- length piece of text or even punctuation). In simple terms, “tokens” can be looked at as being the building blocks of LLMs.

What is an LLM hallucination?

LLM hallucinations are responses that are factually incorrect, inconsistent, or entirely made up. Often the response, whilst incorrect, is plausible and presented with a high level of confidence.

What types of hallucination occur?

There are various different types of hallucination that can occur in LLMs.

Factual inaccuracies

Factual inaccuracies occur when LLMs mishandle information with their existing training data set and, as a result, provide an incorrect response. However, they DO NOT invent new information in the process. An example of a factually inaccurate hallucination might be an LLM stating that Yuri Gagarin was the first man to walk on the Moon (it was Neil Armstrong).

Fabrications

Fabrications involve invented information. There are two main types of fabrication – “extrinsic hallucinations” and “intrinsic hallucinations”. An “extrinsic hallucination” occurs when an LLM produces fabricated information, but this information does NOT contradict the original prompt. An example of an extrinsic hallucination would be inventing a non-existent book title or the citing of a non-existent piece of caselaw to support a legal argument. An “intrinsic hallucination” is an error that contradicts the input or context provided. For example, if a prompt states, “The event occurred in 2020”, an LLM response which asserted that the event occurred in 2019 would be regarded as an “intrinsic hallucination”.

Misinterpretations

Misinterpretations occur when LLMs fail to correctly understand input or context. Often, the root cause of the problem can be traced back to the prompt itself – which is usually vague, ambiguous, lacks context, is open-ended, is hypothetical or involves compound questions (in other words, questions where multiple complex topics are combined together).

Conflicting information

Conflicting information occurs when the output from an LLM contains inherent contradictions. An example would include the statement that “John is a teenager who is 67 years old.”

Needle in a Haystack

The “Needle in a Haystack” problem refers to the challenge LLMs face in retrieving specific, correct information from the huge set of data upon which they are trained. “Needle in a Haystack” hallucinations usually take two forms:

1. Missed key Data Points: where incomplete information is provided (for example, citing only one cause of World War I while omitting others), and
2. Partial Incorrectness: where correct and incorrect information is blended together to produce outputs that are neither entirely true nor entirely false (for example, the statement that Neil Armstrong walked on the moon in 1959 instead of 1969).

Why do hallucinations occur?

Hallucinations stem from the fundamental mathematical structure of LLMs. In simple terms, they occur when an LLM makes a wrong token choice (for example, predicting that “Mars” will be the next word instead of “Jupiter”). There are a number of possible root causes for LLM hallucinations.

Prompt ambiguity

In practice, prompts which are vague or ambiguous are the primary cause of LLM hallucinations. An example of an ambiguous prompt would be “Tell me about Tesla.” Is this a reference to the car company or the scientist? If the question is open to interpretation by a human, why should an LLM know any better?

Training Data Limitations

Hallucinations can also occur where an LLM has a gap in its training data or is asked to extrapolate beyond its training data. The LLM will attempt to generate a coherent response but may ‘fill in the blanks’ – typically with plausible, but incorrect, information. This issue is exacerbated by the fact that LLMs are incentivised to provide a response (even when they don’t know the answer). Biased or outdated information within an LLM’s training set may result in similar outcomes.

Context window constraints

The number of tokens that an LLM can process at any one time is called its “context window”. The larger the context window, the more information the LLM can consider at any one time. The more information an LLM can consider at any one time, the lower the probability that the LLM’s “token” window is exceeded. In simple terms, if an LLM is required to consider information which (in “token” terms) exceeds its context window, it may invent information to ‘fill in the blanks’.

Overgeneralisation

The predictive powers of an LLM is based on the patterns it discerns from the data on which it has been trained. If a particular pattern within that dataset is overly-prevalent (for example, the statement that “most birds can fly”) an LLM may apply it incorrectly to specific cases (“A penguin is a bird, therefore a penguin can fly”).

Overfitting

Overfitting is the opposite of overgeneralisation. It occurs when an LLM memorises specific examples from its training set, instead of learning general patterns. This leads to the creation of dominant (but inaccurate) associations. For example, if an LLM’s training data includes the assertion that “swans are white” (which is true in SOME cases) when asked the question “What colour are swans?”, an LLM might respond with “white” – an answer which is not true in ALL cases.

Attention issues

LLMs employ mechanisms to help them decide which tokens within a prompt should be given priority. Avoiding hallucinations requires the LLM to correctly balance the full context of a query. For example, in the prompt “What is the significance of Washington, D.C.?” a model with a malfunctioning attention mechanism might place undue priority on the word/token “Washington” in isolation. This might result in the LLM generating an answer about George Washington, rather than the city of Washington D.C.

So, what can be done about it?

There are a number of strategies – all of which we use within Ark 51 – which can be employed to reduce the occurrence of LLM hallucinations. In reality, most of these strategies are nothing more than good industry practice.

Improve prompt engineering

Crafting clear, specific and instructive prompts which minimise ambiguity reduces the frequency of hallucinations. For example, a better prompt than simply asking an LLM to “Tell me about Tesla,” would be “Provide verified facts about the scientist Nikola Tesla. Only include information from reliable sources. Admit uncertainty if data is lacking.”

Update LLM models regularly

New LLMs are trained on ever-increasing sets of data. This naturally helps reduce hallucinations. Why? Because the LLM simply has a greater chance of having been exposed to the information that is relevant to a prompt. The chart below shows the exponential growth of training data sets in relation to Open-AI’s GPT:

Model	Training Parameters (billions)
GPT-1	117
GPT-2	1,500
GPT-3	175,000
GPT-3.5 (turbo)	175,000
GPT-4 (standard)	1,000,000

Context windows are growing at the same exponential rate as training datasets. Therefore, the “overspill” problem associated with older LLMs (whereby an LLM ‘filled in the blanks’ if a query was too complex, in token terms, for its available context window) is fast receding into the rear-view mirror as an issue.

Model	Context Window Size (tokens)
GPT-1	512
GPT-2	1,024
GPT-3	2,048
GPT-3.5 (turbo)	4,096
GPT-4 (standard)	8,192
GPT-4 (32k)	32,768
GPT-4 Turbo	128,000
GPT-4o	128,000
GPT-4.1	1,000,000

Human-in-the-loop validation

In critical applications, human experts can be employed to review outputs and to catch errors.

Post-Processing and Fact-Checking

A relatively easy measure to implement to reduce hallucination rates is to enforce consistency by requiring an LLM to choose a result from a finite list of options (for example, a drop-down list). It is equally easy to filter out low-confidence outputs. Techniques such as “Retrieval-Augmented Generation” (“RAG”) also help to ground LLM responses in verified external sources of information, reducing the frequency of hallucination (although this becomes less important as the context window of LLMs grows). Within a RAG system, the LLM will supplement its internal knowledge by accessing an external knowledge base of verified information before providing a response.

Implement model constraints

Hallucination rates can be reduced by constraining LLM parameters such as “top-k sampling”, “top-p sampling” and “temperature”. Top-k sampling and top-p sampling are techniques used in LLMs to generate more natural and varied text. “Top-k sampling” is a technique in which the LLM looks at the “k” most likely next words and randomly picks one from that shortlist. For example, if k = 10, only the 10 most likely words are considered, and one is chosen at random by the LLM. “Top-p sampling” is slightly different. Instead of picking the “k” most likely next words to occur, the LLM selects the smallest number of words whose total probability adds up to at least “p” (e.g., 0.9). The LLM then randomly selects from that set. This allows for a flexible number of candidate words depending on how confident the model is. “Top-k sampling” and “top-p sampling” both help an LLM to produce more creative and varied responses. However, at the same time, they introduce randomness, increasing hallucination risks. Reducing “k” and increasing “p” help to combat the risk of LLM hallucination. In addition, lowering the “temperature” of an LLM model helps to reduce the frequency of hallucinations. Model “temperature” typically ranges between 0.0 and 1.0. Reducing the “temperature” of an LLM reduces randomness, makes the LLM more deterministic, increases the likelihood that the MOST likely next word will be chosen by the LLM, and so ultimately reduces the likelihood of hallucination.

So, where does that leave us?

Hallucinations stem from the fundamental mathematical and logical structure of LLMs. In addition, language itself is inherently ambiguous. No LLM training dataset, however large, can deal with all of the ambiguity. Moreover, in the real-world information is constantly changing and being updated. No LLM model can ever be up-to-date in all respects at every moment in time. As such, it is unlikely that hallucinations will ever be completely eradicated from LLMs (at least in the form that they are currently constructed). Nonetheless, as we have seen, most of the levers that can significantly reduce the occurrence of hallucinations in practice are both within our control and are readily available to be pulled. As such, through a combination of architectural improvements, dataset enhancements and fact-checking mechanisms it is possible to achieve very low rates of LLM hallucination. You also have to bear in mind that certain types of activities are more prone to LLM hallucination than others. For example, where an LLM is asked to perform complex logical or mathematical reasoning, or is asked to speculate on future events, hypothetical scenarios or ambiguous situations, hallucination rates will be higher than might otherwise be the case. However, Ark 51 is NOT in this position. Ark 51 extracts data from legal documentation. It is asked specific questions by reference to specific documents. It is not asked to generalise. There is no ambiguity. There is no need to extrapolate. In simple terms, it is not put in a place where it might hallucinate. The scope for hallucination is, by the very nature of the exercise undertaken, far below the average. So, hallucinations are a fact of life but – in the context of Ark 51 – they should not be a major concern. Given the commercial and regulatory imperative of understanding the contents of trading documentation, it is also worth considering the alternatives… One option would be to use humans to manually extract all data. Whilst humans do not hallucinate in quite the same way as LLMs, they suffer from input errors, they make assumptions, they get bored. In practice, these all present much bigger challenges to accurate data extraction than LLM hallucinations. A human can be expected – AT BEST – to obtain about 85% accuracy. This is WAY below the accuracy rates achieved by Ark 51 (if you are interested in learning more about this, check out our other article here: https://ark-51.com/2025/05/13/the-rise-of-the-machines/). Moreover, Ark 51 will take around 30 seconds to achieve superior results to those a human could achieve in 45 minutes. So, if you are adopting a human-led approach in order to extract a given number of data points to a given level of accuracy, the cost differential when compared to Ark 51 is huge. Another option would be string-matching software. Broadly, the string-matching approach involves creating a library of examples of legal prose (together with the correct ‘answer’). The theory is that, by querying this library, the string-matching software will be able to extract data from other documentation with very high levels of accuracy. So goes the theory. But what about the practice? In practice, the string-matching approach is hampered by a number of unavoidable dependencies which do not impact AI-led data extraction. First off, for the library to work efficiently, it needs to be populated with a large number of examples (sometimes thousands). This is no good for most buy-side firms, who simply don’t have that many documents. Even for large sell-side firms, it means a huge investment of time (and money) before any results are generated. Even putting that issue to one side, a string-matching approach can only ever be as good as the Optical Character Recognition (“OCR”) upon which it relies in order to create machine-readable text. Even the best OCR platforms only claim to be 98-99% accurate. Moreover, string-matching relies on humans to provide the correct ‘answer’ in the first place – numerous studies have shown that humans are not actually very good at this task. One thing you will also quickly become aware of is the sheer volume of errors introduced into documentation by the lawyers who drafted and negotiated it. String-matching cannot handle this volume of variability. The net result? It’s a bit like Brian Fantana’s “Sex Panther” – 60% of the time it’s accurate 100% of the time. So, whilst no approach is perfect, AI-led approaches to data extraction are already best-of-breed and only getting better (at a rapid rate of knots). Anything else is legacy. Anything else is static. LLM Hallucinations may be a fact of life, but it would be a fallacy to think that that they undermine the rationale behind using AI to gain insight into your documentation. If you would like to learn more about unlocking the content of your contract portfolio or the way in which Ark 51 could help you to implement a robust and cost-effective programme of document data extraction, drop us a line at [email protected].

Sources

https://medium.com/@tam.tamanna18/understanding-llm-hallucinations-causes-detection-prevention-and-ethical-concerns-914bc89128d0

https://arxiv.org/html/2409.05746v1

https://www.techtarget.com/searchenterpriseai/tip/Why-does-AI-hallucinate-and-can-we-prevent-it

https://www.stylefactoryproductions.com/blog/chatgpt-statistics