The Rise of the Machines: AI and document data extraction

Today, almost every firm understands the value of the information contained within its portfolio of legal agreements. However, the challenge remains as to how to transform that information into structured data, capable of providing actionable insights and commercial advantage. Artificial intelligence (“AI”) is revolutionising a once manual process. But do humans have nothing left to bring to the party?

AI can deliver huge benefits

Scalability

AI dramatically reduces the need to perform repetitive manual tasks – automatically parsing documents and extracting key data points. In doing so, it frees up internal resources to focus on strategic tasks rather than tedious data entry. However, it is the ability of AI to process high volumes of data in a fraction of the time required by manual methods that is truly transformational.

Extracting 100 points of data manually from a document such as the ISDA Master Agreement or Credit Support Annex typically takes upwards of 30 minutes. Moreover, the time taken scales in a linear fashion as data requirements increase. In other words, 200 data points will take twice as long to extract manually as 100 data points – with a concomitant doubling of cost.

In comparison, with a well-established pipeline, AI can extract 100 points of data in just a few seconds. Whilst it also scales (broadly) in a linear fashion as data requirements increase, the absolute incremental time increases are so small as to be almost imperceptible.

Put simply, the greater your data needs, the more compelling becomes the business case for AI-lead data extraction.

Ark 51 can extract data in a fraction of the time manual methods require

Information in real time

Not only do firms need information, but that information must be up-to-date and delivered in real-time. Here too, AI prevails over manual data extraction – enabling organisations to query and re-query portfolios as circumstances requirement – empowering them to respond swiftly to market dynamics and operational changes.

Ark 51’s AI engine enables document portfolios to be queried and re-queried with ease

Data in context

In contrast to human-led data extraction, AI models can quickly and automatically implement intelligent validation mechanisms, cross-referencing extracted data against other sources of information – eliminating inconsistencies and reducing human error. For example, Ark 51 automatically cross-references counterparty names against the Global Legal Entity Identifier database – confirming counterparty identity. It also automatically classifies those same counterparties for the purposes of a number of regulations (such as CRR, EMIR and DORA) – enhancing compliance.

Ark 51 automatically enriches data within its context

But what about accuracy?

It is important to realise that, as a first step in any data extraction process, documents should be “OCR’d” (“Optical Character Recognition”). In simple terms, OCR turns words on a page into machine-readable text – rendering documents searchable and enabling AI engines to query their contents. The best OCR engines on the market today claim accuracy levels of between 98% and 99% (although, provided that the underlying document is of good quality, accuracy levels tend to be higher). Why is this relevant? Because, broadly, AI data extraction can only ever be as good as the accuracy of the underlying OCR process.

What is clear is that AI models already surpass human accuracy in data extraction tasks. Studies^[1] (supported by our own testing) suggest that humans can be expected to achieve, at best, about 85% accuracy when manually extracting data from (relatively simple) contracts – even where they have substantial experience of the documentation under review. The reality is that human accuracy levels are likely to be significantly lower than 85% – once the individual is taken out of time-bound, focussed, laboratory conditions and is required to review longer, more complex documentation under the time pressures and distractions of work.

Our testing suggests that Ark 51 achieves approximately 95% accuracy on average when extracting data. In simple terms, Ark 51 already vastly outperforms even the best results produced by human beings. Moreover, in contrast to humans, the accuracy of Ark 51 continues to improve all of the time as its bespoke AI extraction pipeline is further enhanced.

These observed results offer useful insights to any firm which is planning to programmatically extract data from its document portfolios. Based on these findings, the average confidence levels that firms can expect to have in data, depending on their chosen approach to quality assurance, are:

	Layers Of Human Quality Assurance
		0	1	2	3	4	5
Using Ark 51?	No	N/A	85.000	97.750	99.663	99.949	99.992
Using Ark 51?	Yes	95.000	99.250	99.888	99.983	99.997	99.999

Percentage Confidence Levels in Data

Isn’t 100% accuracy achievable?

Sorry to disappoint. Outside of the laboratory (where correct responses are known in advance), it is not possible to achieve (and certainly not possible to prove) 100% accuracy. There are a number of reasons why this is the case (most of which have little to do with the capabilities of AI).

Best-in-class OCR can only deliver approximately 99% accuracy.

The OCR process itself is not perfect. Moreover, the lower the quality of the ‘original’ document, the more prone it becomes to OCR errors. Digital watermarks are a particular hazard when it comes to accuracy of the OCR process.

Fortunately, one of the main benefits of an AI-led approach to data extraction over more traditional ‘string matching’ is AI’s inherent flexibility and creativity. This means that it can overcome typos and still infer correct meaning. Rules based approaches (such as string matching) work well when the information extracted satisfies the rule or matches the string EXACTLY. However, when the format of a document changes, or when an error (like a typo) in a document is encountered, they begin to struggle badly. However, even where AI is concerned, there are obviously still limits as to what can be achieved.

Current document design is sub-optimal.

Broadly, OCR engines read a page of text strictly from left to right and from top to bottom. Why is this an issue? Well, if we take an industry standard document such as the ISDA Master Agreement, it is typically drafted like this:

Current document design does not always assist automated data extraction

An OCR engine will read this text as “Schedule to the 2002 Master Agreement dated as of 1 April 2025 between Big Bank PLC and Super Client Ltd (“Party A”) (“Party B”). If the document’s draftsman has accidently inserted an extra carriage return in front of the word “and”, an OCR engine might even read the text as “Schedule to the 2002 Master Agreement dated as of 1 April 2025 between Big Bank PLC Super Client Ltd (“Party A”) and (“Party B”)”. The point is that seemingly innocuous formatting errors which barely register with a human reader can compound the challenges of the data extraction process.

Whilst neither of these particular examples pose insurmountable problems for an AI engine, more complex examples abound (think “eligible collateral schedules” within a CSA). If the industry is serious about turning legal prose into actionable information, expect to see redesign of documentation within the next few years – all in the name of closing the gap to 100% accuracy. Some of this redesign work will be very simple (see the example below). However, some of it will have to be far more fundamental.

The ISDA Schedule would be easier for a machine to read if drafted like this

Documents contain errors

One of the first things you become acutely aware of when extracting data from legal documentation is that the documents themselves contain errors – lots of errors. In reality, errors in the document are at least as pervasive as errors in the extraction process.

In turn, this raises the key question of what your approach should be to dealing with aspects of a document that are (obviously?) an error. Do you amend the record so that it reflects what you (confidently?) suspect to have been the correct state of affairs? Or do you faithfully represent the document – warts and all?

We believe that ‘warts and all’ is the way to go.

Hopefully you will not have to blow the dust off a contract due to the commencement of legal proceedings or the occurrence of another Archegos Capital Management. However, if you do, it will be better to know what the contract says – warts and all – rather than what it probably should have said had the individuals involved in its creation not made an error. Knowledge of the existence of an error is itself valuable information.

True meaning can be hard to infer

It is genuinely possible (and, indeed, not uncommon) for a provision within a contract to be capable of being ascribed multiple interpretations – all of which are legitimate. This is not surprising. Disagreement over the meaning of contractual provisions is the basis of many legal disputes. The reason is that, in reality, contracts often document compromises. Sometimes they are couched in deliberately vague terms so that both parties feel able to sign. “Have we just created a ticking time bomb? Maybe, but the chances of it exploding are small and it will be a problem for another day.” The truth is that this kind of approach is absolutely justifiable in the real world, where the commercial imperative of getting a deal signed necessitates the parties accepting a degree of risk.

Justifiable, yes, but it doesn’t make the quest for 100% accuracy any easier. In simple terms, if a group of humans reviewers could all legitimately differ in their interpretation of a given clause, why would we expect an AI engine to provide an undisputed version of the ‘truth’?

How deep are your pockets?

Research by Azure’s Document Intelligence service suggests that the cost of achieving 99.9% accuracy is EIGHT TIMES the cost of achieving 95% accuracy. Moreover, the cost of achieving 100% accuracy (within laboratory conditions) is estimated to be TEN TIMES the cost of achieving 99% accuracy.

The cost of data extraction rises exponentially the more we try to close the gap to 100% accuracy. Knowing this, as well as the fact that that 100% accuracy is unattainable anyway, it becomes clear that all firms will have to draw the line somewhere. The question is just “where?”

Is 100% accuracy even the right question?

In truth, to ask whether 100% accuracy is achievable is probably to ask the wrong question. Unless you were the actual author of the document (and therefore know, with certainty, what data it contains) there is no way to PROVE 100% accuracy of data extraction. Accept this fact. Don’t let perfection be the enemy of the ‘still-incredibly-valuable’.

So, what’s the best approach?

The average confidence levels provided in the table above are incredibly insightful. However, averages can hide a set of deeper truths about document data extraction – truths which can assist in the design of an optimal document data extraction process:

Both humans and AI are better at extracting simple data than complex data (although AI is also significantly better than humans at extracting simple data).
If an item of data is not where a human would expect it to be within a document then human accuracy rates plummet – humans gets bored, make assumptions and feel the pressure of other work deadlines. AI accuracy levels also dip in these circumstances – but only very marginally. This creates a huge disparity in real-world performance between humans and AI.
Humans are still better than AI at reviewing extremely complex information (provided that the human’s attention is brought to the existence of the data in question) – although this gap is closing all the time.

The table also provides insight as to where the balance between cost and accuracy may lie. It is clear that:

In terms of extracting a given volume of data to a desired level of confidence, using a combination of AI and “human in the loop” review (as opposed to human-only extraction) enables the user to ‘take two humans out of the loop’ without compromising on overall data quality. Not only will this generate significant cost savings but it will allow information to be processed in a fraction of the time.
In terms of optimal layers of quality assurance, there is little to be gained by implementing more than one layer of “human in the loop” checking over AI-based extraction. In reality, “four-eyes” checking of data reduces error rates, but by less than statistical averages suggest should be the case.[2]

Additional “humans in the loop” create diminishing returns

With this in mind, and based on our years of experience in this area, we believe that it is helpful to divide your overall document portfolio into two distinct parts. For convenience, we will refer to them here as:

the “High Risk Portfolio”, and
the “Lower Risk Portfolio”.

Documents which include ‘high risk’ or ‘complex’ information (such as NAV triggers, ratings downgrade provisions or eligible collateral schedules) or which relate to ‘premier clients’ should be regarded as part of the “High Risk Portfolio”. Those documents which remain should be regarded as being part of the “Lower Risk Portfolio”.

We recommend that clients adopt one of two approaches, based on their budget and internal risk tolerance:

Approach 1: Extract data from the entire document portfolio using AI but focus “human in the loop” effort on reviewing data from the ‘High Risk Portfolio only.

Approach 2: Extract data from the High Risk Portfolio only using AI and focus “human in the loop” effort on reviewing ‘high risk data’ only.

In adopting either of these approaches, in return for a reasonable investment in the underlying process, you can expect to achieve accuracy levels of between 95% and 99.25%. The likelihood is that they will be closer to the top end of this range.

We do NOT see value in inserting more than one “human in the loop” to check AI extracted data. In simple terms, it is difficult to justify the marginal increase in confidence levels in the underlying data when set against the additional cost.

So, there’s still value in having a “human in the loop”?

Absolutely, yes.

The “human-in-the-loop” approach allows complex information to be validated and for ambiguous cases to be flagged for review – ensuring that the final output is reliable and actionable.

Conclusion

Data is the new competitive frontier. Unlocking the information contained within portfolios of legal agreements can provide any organisation with genuine insight, hindsight and foresight into its risks and commercial opportunities.

Artificial intelligence is the foundation stone upon which any programme of document data extraction should be built. By comparison, “human-only” approaches are costly, inefficient and will not deliver the results that could otherwise be achieved.

But a “human in the loop” process still has definite value – harnessing the power of AI to enhance operational efficiency and provide strategic insight.

If you would like to learn more about unlocking the content of your contract portfolio or the way in which Ark 51 could help you to implement a robust and cost-effective programme of document data extraction, drop us a line at [email protected].

References

“The Race to 100% Accuracy in Document Data Extraction: Challenges and Innovations”: https://www.linkedin.com/pulse/race-100-accuracy-document-data-extraction-challenges-asit-sahoo-0dxzc/
“Artificial Intelligence More Accurate Than Lawyers for Reviewing Contracts, New Study Reveals”: https://www.prnewswire.com/news-releases/artificial-intelligence-more-accurate-than-lawyers-for-reviewing-contracts-new-study-reveals-300603781.html
“An AI just beat top lawyers at their own game”: https://mashable.com/article/ai-beats-humans-at-contracts?
“The Verdict Is In: AI Outperforms Human Lawyers in Reviewing Legal Documents”: https://futurism.com/ai-contracts-lawyers-lawgeex?
“How good are large language models for automated data extraction from randomized trials?” https://www.medrxiv.org/content/10.1101/2024.02.20.24303083v1.full.pdf+html

[1] Artificial Intelligence More Accurate Than Lawyers for Reviewing Contracts, New Study Reveals: https://www.prnewswire.com/news-releases/artificial-intelligence-more-accurate-than-lawyers-for-reviewing-contracts-new-study-reveals-300603781.html?

[2] “Effects of double data extraction on errors in evidence synthesis: a crossover, multicenter, investigator-blinded, randomized controlled trial”: https://academic.oup.com/pmj/advance-article-abstract/doi/10.1093/postmj/qgae195/7944091?redirectedFrom=fulltext