DonUT Model

DonUT Model

Is it better to forgo OCR in intelligent document processing?

(Fine-tuning OCR-free DonUT Model for invoice extraction)



Why am I writing this to you? Why should you read this? Why should you trust me?

This is my story of discovering and working with an exciting approach to extracting data from documents. It's a far cry from classical approaches using OCR models, even if they seem as evolved as they are. This article is to prove once more that generative models are the way to go forward.

The story starts as a consultancy project; well you know I should present myself first, I'm Ahmed Belarbi a computer science engineer and I work as an MLOps consultant at AyDesignIt Bv in Belgium, I've been doing AI development for over 5 years; but enough of my backstory I have a story to tell … The project I worked on aimed to simplify tax management for independent contractors in Europe.

As you might know, these kinds of systems already exist, they are called intelligent document processing, what can my client provide to be a top player? And exactly how good is this competition?

So what exactly is this intelligent document processing?

In the bustling world of business documentation, where unstructured data runs wild, Intelligent Document Processing (IDP) swoops in like a superhero. Equipped with the power of artificial intelligence, the speed of machine learning, and the vision of computer vision, IDP automates the processing of documents, transforming them into actionable data. It's like having a personal assistant who not only reads all your documents but also understands them, categorizes them, and then presents you with the exact information you need. Using a combination of natural language processing, machine learning, and computer vision, IDP can handle a wide variety of unstructured document formats, offering scalability, cost-efficiency, and improved customer satisfaction. It's like hiring a superhero but without the need for a secret lair or a fancy costume.

What is the current situation of IDP?

The traditional OCR-based models, like LayoutLM, have indeed been a cornerstone in the field of Intelligent Document Processing (IDP). However, they come with their own set of challenges — accuracy issues due to document quality, font, text clarity, and not to mention, the computational intensity and slowness of OCR.

This is where the Donut Model (Document Understanding Transformer) comes into play. It's like a breath of fresh air in the world of IDP. By eliminating the need for OCR, it addresses many of the challenges associated with traditional models. The Donut Model is a testament to the power of innovation — it's not just about doing things better, but also about doing things differently.

According to the original paper, the Donut Model even outperforms the LayoutLM model in terms of accuracy, which is quite an achievement.

So, if we're navigating the complex maze of document understanding, the Donut Model could be the torchbearer we're looking for. It's a step towards a more efficient, accurate, and OCR-free future of IDP. And who knows, it might just be the secret ingredient to the success recipe in IDP!

So in this article, we will dive into the architecture of the donut model and how it can be finetuned.

The DonUT Architecture

The Donut Model is a new way of looking at document understanding, without the need for OCR. The model has two parts: a visual encoder and a text decoder. The visual encoder turns the document image into a set of hidden vectors using a Swin transformer. The text decoder takes these vectors and generates a sequence of tokens, using a BART transformer.

The model uses a teacher-forcing technique, which means it uses the real data as the input, not the output. The model operates by generating a sequence of tokens in response to a given prompt, which instructs it on how to proceed, whether it's for classification, answering questions, or parsing information. For example, if we want to know the document type, we give the image and the task to the decoder, and the model gives us the document type. If we want to ask a question, we give the question and the image to the decoder, and the model gives us the answer. The output sequence is then converted to a JSON file. For more information, please read the original paper.

DonUT architecture. source

How to gather the data?

This is the hardest part and the most important in any model fine-tuning. We had two choices for this project: we could either look for a dataset on the internet that fits our problem or create our dataset. Either approach has its limitations and sets of problems.

As a consultant, I needed to find a simple and cost-effective solution that didn't require the involvement of multiple engineers. I should also clarify that the AI aspect of this project is solely dependent on me, and for my sanity, I chose not to annotate hundreds if not thousands of invoices manually by myself But if you want to, there are many amazing tools for the job like UBIAI Text Annotation.

Finding suitable datasets with consistently annotated invoices for research is challenging. Many available datasets suffer from small sizes or biases that can hinder effective training. Such datasets exist like Inv3D and SROIE.

Working with the data

We start by using SROIE to fine-tune our model. This effort yielded a somewhat negative result in the end. The model performed well with simple invoices because the data mainly consisted of receipts, which are generally less complex than regular invoices. Another issue is that the SROIE dataset has a few annotated fields which aren't enough for a viable product.

Our next solution was to use a more robust and larger dataset. After a long period of research, we've found a new dataset called Inv3D. The dataset's main goal is to train a model for unwrapping curled-up documents. However, it can also be utilized to generate new data by combining 2D document images with their text transcriptions.

The data contains 25000 samples that are diverse and as bias-free as possible according to the original paper.

figure 5
INV3D creation. source

Training DonUT

Having to train such a large text model comes with its set of issues. Data preparation for models necessitates significant computational resources, including ample RAM and CPU power. This is because the process involves intricate calculations to structure and arrange the data efficiently.

This resource-intensive process is essential for ensuring that the model receives high-quality, properly structured input for accurate and reliable predictions. For the Inv3D a minimum of 256Gb of RAM is required for the data preparation. After feeding the data vectors to the model trainer, we will have to face the issue of VRAM. Most consumer GPUs have a limited amount of VRAM, usually less than 24 GB. Unfortunately, this is insufficient for our model training needs. The only feasible choice for GPUs with greater VRAM is to look into professional cards. However, these cards are significantly more expensive and do not fit within our client's budget. An intuitive solution was to use a cloud system for training like AWS or GCP. Our first choice was to apply for an AWS Startup grant.

Despite receiving a grant, we faced limitations when training our model due to significant resource requirements. Even with the available credit, we were unable to utilize the entire dataset.

We decided to first establish the accuracy of the data by creating a proof of its quality. Our approach involved using a subset of 1/4, the maximum capacity our EC2 can handle, to optimize the model.

The results weren't as good as we wanted; they're even worse than the SROIE fine-tuning but we should keep in mind that we didn't know if our subset had any kind of biases generated from cutting the data. However, we are still positive that using the full data will yield impressive results.

In future articles we will be explaining in detail how to fine-tune the model, with examples and code, so stay tuned for that.

The Big Elephant in the Room: Model hallucination

Model hallucination is a phenomenon in the field of artificial intelligence where an AI generates a response that contains false or misleading information presented as fact. This can occur when the AI is not trained on a sufficiently large and diverse dataset, or when it is trained on biased data. It is also called confabulation or delusion

The problem with our model increased when the number of fields extracted from each photo increased, causing the model to hallucinate and wrongly detect the same field in most images, even when it wasn't present.

This is a common issue with LLMs, and we're trying to implement different approaches to convey this issue. In future articles, we will discuss the measures we took to convey this issue.


The article explores the revolutionary Donut Model, a new approach to Intelligent Document Processing (IDP) that replaces traditional OCR-based models. This shift aims to eliminate the need for OCR technology and improve the processing of documents significantly. As a consultant, I tried to walk you through the model's architecture, spotlighting its innovative visual encoder and text decoder.

We have not been able to validate the superior performance claimed for DonUT as described in the paper. Still, we are dedicated to improving the model to meet the highest standards. The current solution doesn't fully address the problem and comes with its own set of issues. Further efforts are needed to improve data extraction and achieve faster results.

We will write future articles about this project, explaining its different aspects from training to deployment.