ReceiptNinjBuilding ReceiptNinja: An Intelligent Receipt Processing Demo App

In today’s digital-first world, managing receipts—whether physical or digital—can be a daunting task for individuals and businesses alike. Manual data entry for expense tracking or finance management is time-consuming, error-prone, and tedious. Enter ReceiptNinja, an intelligent demo application designed to automate this process by extracting key fields from various types of receipts such as images, PDFs, and even physical copies.

In this article, we’ll guide you step by step through building ReceiptNinja, using cutting-edge technologies like Google Gemini for its advanced language and reasoning capabilities, and Doctr, an open-source optical character recognition (OCR) model. The application will seamlessly extract and categorize vital information, including store name, date of purchase, total amount, item list, tax details, payment method, and discounts.

By the end of this guide, you’ll have a fully functional demo app that can be easily integrated into personal finance tools or business expense management systems. Whether you’re a developer looking to explore AI-driven applications or a business professional seeking efficient receipt management solutions, this tutorial will provide you with the practical tools and insights to get started.

OCR is Easy, But Field Extraction Was a Challenge Before LLMs

Optical Character Recognition (OCR) technology has long been used to convert scanned images, PDFs, and other documents into machine-readable text. With modern open-source solutions like Doctr, OCR has become easier than ever, allowing developers to quickly extract raw text from various sources with minimal setup.

However, extracting relevant fields from receipts, such as the store name, date of purchase, total amount, or even itemized lists, presents a much greater challenge. Before the advent of Large Language Models (LLMs) and Generative AI (GenAI), solving this problem required custom solutions that were not scalable. Let’s explore why.

Traditional Approaches: Why They Fell Short

1. Custom Models for Specific Receipt Types

One approach developers took was to train custom machine learning models for specific types of receipts. This could involve building a model that recognizes the structure and layout of a particular format. For example, a grocery receipt might have a predictable structure with the store name at the top, followed by item lists and a total at the bottom. However, this approach required training separate models for each type of receipt, as variations in format between retailers, regions, or even receipt generations made it impossible to generalize.

Training such models for all possible receipts is expensive, time-consuming, and requires a constant influx of data to keep the models up to date.

2. Template-Based Solutions

Another approach was to use template-based matching. Developers would build static templates for various receipts, mapping out the positions of the store name, item list, and totals. While this works for well-defined formats, it fails when the layout changes even slightly—be it from a different printer, a new version of the receipt format, or an unfamiliar store.

The need to manually create and maintain templates for every possible variation of receipt format made this solution non-scalable and fragile.

Enter GenAI: A Scalable Solution

Thanks to advances in Generative AI (GenAI) and Large Language Models (LLMs) like Google Gemini, we now have a powerful alternative for handling the variability and complexity of receipts. LLMs are not constrained by rigid formats or pre-defined templates. Instead, they understand context and semantics, enabling them to extract key fields across a wide variety of receipt formats with high accuracy.

Let’s dive into the core components of building this application.

Required Libraries:

  • Pillow: For image processing.
  • PyMuPDF (fitz): For handling PDFs.
  • Doctr: For OCR.
  • Google Generative AI: For field extraction.

Step 2: Using Doctr for OCR

The first step in processing a receipt is extracting the raw text using OCR. We’ll utilize the Doctr library for this task. The class ImageProcessor includes methods to process both image and PDF files, convert them to text, and enhance image quality.

Image and PDF Processing

  • Images are processed using standard libraries such as Pillow, and methods are included to enhance sharpness and adjust orientation.
  • PDFs are handled using PyMuPDF to convert pages into images, which are then processed like any other image.

Here’s an excerpt from the ImageProcessor class that handles image and PDF processing:

Converting PDFs to Images:
For PDFs, each page is converted into an image, processed by OCR, and then stitched together if necessary.

Once the OCR text is extracted, the next challenge is making sense of the data—this is where Google Gemini comes in.

Step 3: Applying Google Gemini for Field Extraction

The OCR text is raw and unstructured, but using Google Gemini we can extract key fields such as:

  • Store Name
  • Total Amount
  • Date of Purchase
  • Store Address
  • Currency
  • Payment Method

Using the Gemini Model

We feed the OCR text along with an initial prompt into the Google Gemini model, which then processes and extracts relevant fields in a structured format.

It took us a while to get the prompt right. Here is the final prompt. We not only specify the task to the model but also show a sample example:

Complete code can be found here https://github.com/sankit1/receipt-ninja