ReceiptNinja: Using Google Gemini to extract information from Retail Receipts

Ankit Sachan

2 months ago

Categories: How-to

ReceiptNinja: Using Google Gemini to extract information from Retail Receipts

Building ReceiptNinja: An Intelligent Receipt Processing Demo App

In today’s digital-first world, managing receipts—whether physical or digital—can be a daunting task for individuals and businesses alike. Manual data entry for expense tracking or finance management is time-consuming, error-prone, and tedious. Enter ReceiptNinja, an intelligent demo application designed to automate this process by extracting key fields from various types of receipts such as images, PDFs, and even physical copies.

In this article, we’ll guide you step by step through building ReceiptNinja, using cutting-edge technologies like Google Gemini for its advanced language and reasoning capabilities, and Doctr, an open-source optical character recognition (OCR) model. The application will seamlessly extract and categorize vital information, including store name, date of purchase, total amount, item list, tax details, payment method, and discounts.

By the end of this guide, you’ll have a fully functional demo app that can be easily integrated into personal finance tools or business expense management systems. Whether you’re a developer looking to explore AI-driven applications or a business professional seeking efficient receipt management solutions, this tutorial will provide you with the practical tools and insights to get started.

OCR is Easy, But Field Extraction Was a Challenge Before LLMs

Optical Character Recognition (OCR) technology has long been used to convert scanned images, PDFs, and other documents into machine-readable text. With modern open-source solutions like Doctr, OCR has become easier than ever, allowing developers to quickly extract raw text from various sources with minimal setup.

However, extracting relevant fields from receipts, such as the store name, date of purchase, total amount, or even itemized lists, presents a much greater challenge. Before the advent of Large Language Models (LLMs) and Generative AI (GenAI), solving this problem required custom solutions that were not scalable. Let’s explore why.

Traditional Approaches: Why They Fell Short

1. Custom Models for Specific Receipt Types

One approach developers took was to train custom machine learning models for specific types of receipts. This could involve building a model that recognizes the structure and layout of a particular format. For example, a grocery receipt might have a predictable structure with the store name at the top, followed by item lists and a total at the bottom. However, this approach required training separate models for each type of receipt, as variations in format between retailers, regions, or even receipt generations made it impossible to generalize.

Training such models for all possible receipts is expensive, time-consuming, and requires a constant influx of data to keep the models up to date.

2. Template-Based Solutions

Another approach was to use template-based matching. Developers would build static templates for various receipts, mapping out the positions of the store name, item list, and totals. While this works for well-defined formats, it fails when the layout changes even slightly—be it from a different printer, a new version of the receipt format, or an unfamiliar store.

The need to manually create and maintain templates for every possible variation of receipt format made this solution non-scalable and fragile.

Enter GenAI: A Scalable Solution

Thanks to advances in Generative AI (GenAI) and Large Language Models (LLMs) like Google Gemini, we now have a powerful alternative for handling the variability and complexity of receipts. LLMs are not constrained by rigid formats or pre-defined templates. Instead, they understand context and semantics, enabling them to extract key fields across a wide variety of receipt formats with high accuracy.

Let’s dive into the core components of building this application.

Required Libraries:

Pillow: For image processing.
PyMuPDF (fitz): For handling PDFs.
Doctr: For OCR.
Google Generative AI: For field extraction.

Step 2: Using Doctr for OCR

The first step in processing a receipt is extracting the raw text using OCR. We’ll utilize the Doctr library for this task. The class ImageProcessor includes methods to process both image and PDF files, convert them to text, and enhance image quality.

Image and PDF Processing

Images are processed using standard libraries such as Pillow, and methods are included to enhance sharpness and adjust orientation.
PDFs are handled using PyMuPDF to convert pages into images, which are then processed like any other image.

Here’s an excerpt from the ImageProcessor class that handles image and PDF processing:

def process_image(self, image_path):
    img_original = Image.open(image_path)  # Load the image
    # Use Doctr OCR to extract text
    model = ocr_predictor('db_resnet50', pretrained=True, assume_straight_pages=False)
    doc = DocumentFile.from_images([image_path])
    result = model(doc)
    ocr_text = " ".join([word.value for page in result.pages for block in page.blocks for line in block.lines for word in line.words])
    
    return img_original, ocr_text

Converting PDFs to Images:
For PDFs, each page is converted into an image, processed by OCR, and then stitched together if necessary.

def convert_pdf_to_images(self, file_path, dpi=300):
    pdf_document = fitz.open(file_path)
    images = []
    for page_number in range(pdf_document.page_count):
        page = pdf_document.load_page(page_number)
        pix = page.get_pixmap(matrix=fitz.Matrix(dpi / 72, dpi / 72))
        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
        images.append(img)
    return images

Once the OCR text is extracted, the next challenge is making sense of the data—this is where Google Gemini comes in.

Step 3: Applying Google Gemini for Field Extraction

The OCR text is raw and unstructured, but using Google Gemini we can extract key fields such as:

Store Name
Total Amount
Date of Purchase
Store Address
Currency
Payment Method

Using the Gemini Model

We feed the OCR text along with an initial prompt into the Google Gemini model, which then processes and extracts relevant fields in a structured format.

It took us a while to get the prompt right. Here is the final prompt. We not only specify the task to the model but also show a sample example:

prompt = '''1. **Input:** You will receive an image containing details from a shopping or food store bill. The content may vary in format and may be in different languages. Use your own internal vision capabilities to accurately extract the relevant text directly from the image, without relying on external OCR libraries like Tesseract or any other Python-based tools. Additionally, an OCR (Optical Character Recognition) output will be provided as a reference. The OCR text may contain errors or inaccuracies, so your primary task is to use your own vision capabilities to extract the correct details directly from the image.

2. **Objective:** Your task is to extract specific details from the bill and return them as a formatted JSON object. Use the exact key names provided below and ensure that all data is translated to English. If any detail is missing, unclear, or unreadable, follow the error handling instructions outlined below.

3. **Extraction Rules:**
- **Store Name ("store_name")**: Extract the full name of the store from which the bill originates. Ensure the name is accurate and complete.(Always present in image)
- **Store Address ("store_address")**: Extract the full address of the store, including street, city, postal code, and country if available. (Always present in image)
- **Total Amount ("total_amount")**: Extract the total amount charged on the bill. Interpret the currency based on the image and store it separately.
- **Currency ("currency")**: Extract the currency of the total amount, which may be in various formats such as symbols (e.g., $, €, RM) or abbreviations (e.g., USD, EUR, MYR).
- **Bill Date ("bill_date")**: Extract the date of the transaction. Format this as **YYYY-MM-DD**. If the time is also present, include it in the format **YYYY-MM-DD HH:MM**.
- **Payment Method ("payment_method")**: Extract the method of payment used (e.g., cash, credit card, debit card). If multiple methods are listed, extract every method(e.g., cash, credit card, debit card, coupon) that is used for the transaction.

4. **Error Handling:**
- If any detail cannot be extracted or is unclear, prefix the value of the relevant field with "ERROR:" and include an explanation of the issue.
Example:
```json
{
"store_name": "ERROR: Store name not visible...(more details)",
"store_address": "123 Example Street, Example City, EX 12345, USA",
"total_amount": 92.50,
"currency": "USD",
"bill_date": "2023-09-01 14:30",
"payment_method": ["coupon","Credit Card"]
}
```

5. **OCR Text for Reference:**
- Use the OCR text as a supplementary reference only. If you cannot confidently extract the information from the image alone, you may use the OCR text as a hint to guide you, but always prioritize your own extraction over the OCR data.
- OCR text:
'''

Complete code can be found here https://github.com/sankit1/receipt-ninja

« Key Considerations for Implementing Object Detection on Edge Devices

Ankit Sachan:

Key Considerations for Implementing Object Detection on Edge Devices
When starting an object detection project, the initial focus is often on building the most…
How Transformers Are Shaping the Future of Object Detection
The world of computer vision changed forever 2011 onwards, when convolutional neural networks (CNNs) revolutionized…