Build OCR Features in C# Using IronOCR

Optical Character Recognition (OCR) enables applications to extract text from images and scanned documents. IronOCR is a popular C# library that simplifies OCR integration for .NET developers. This article shows how to set up IronOCR, implement common OCR features, and optimize results for accuracy and performance.

Prerequisites

.NET 6 or later (reasonable default)
Visual Studio or another C# IDE
NuGet package manager
Sample image or PDF to test

Installation

Create or open a .NET project.
Install IronOCR from NuGet:

bash
dotnet add package IronOcr

Add using directive in your code files:

csharp
using IronOcr;

Basic OCR: Extract Text from an Image

Initialize the OCR engine and read an image:

csharp
var Ocr = new IronTesseract();
using (var Input = new OcrInput(“invoice.jpg”))
{
var Result = Ocr.Read(Input);
    Console.WriteLine(Result.Text);
}

Explanation:

IronTesseract wraps Tesseract with pre-configured settings.
OcrInput accepts file paths, byte arrays, URLs, and streams.

OCR from PDF

csharp
var Ocr = new IronTesseract();
using (var Input = new OcrInput(“scan.pdf”))
{
    Input.Deskew();           // optional preprocessing
    Input.EnhanceContrast();  // optional preprocessing
    var Result = Ocr.Read(Input);
    Console.WriteLine(Result.Text);
}

IronOCR handles multi-page PDFs and returns aggregated text.

Extract Structured Data (e.g., Invoice Fields)

Use regular expressions or simple parsing against Result.Text:

csharp
var text = Result.Text;
var invoiceNumber = Regex.Match(text, @“Invoice\s#?:?\s(\w+)”).Groups[1].Value;
var total = Regex.Match(text, @“Total\s[:\$]?\s([\d,.]+)”).Groups[1].Value;

For more robust extraction, combine OCR with template matching or ML-based parsers.

Improve Accuracy with Image Preprocessing

Deskew(), EnhanceContrast(), RemoveNoise(), Sharpen() are available on OcrInput:

csharp
Input.Deskew()
     .EnhanceContrast()
     .RemoveNoise()
     .Sharpen();

Choose preprocessing steps empirically per document type.

Language and Character Support

Set specific OCR language packs if needed:

csharp
Ocr.Language = OcrLanguage.English;

For other languages, load appropriate language data via IronOCR settings or NuGet language packs.

Handling Layout and Zones

Use regions to focus OCR on parts of an image:

csharp
Input.AddImage(“form.jpg”);
Input.AddRectangle(100, 200, 400, 100); // x, y, width, height
var Result = Ocr.Read(Input);

Process multiple zones separately to preserve structure.

Performance and Concurrency

Reuse IronTesseract instance across calls to reduce model load time.
For high throughput, run OCR tasks in parallel but limit degree of parallelism to avoid CPU/IO contention.

Output Formats

Export OCR results to plain text, searchable PDF, or JSON:

csharp
File.WriteAllText(“output.txt”, Result.Text);
Result.SaveAsSearchablePdf(“searchable.pdf”);

Error Handling and Logging

Catch exceptions and inspect Result.PageErrors or Result.Paragraphs for diagnostics.

csharp
try { var Result = Ocr.Read(Input); }
catch (Exception ex) { Console.WriteLine(ex.Message); }

Example: Console App Putting It Together

csharp
using System;
using IronOcr;
using System.Text.RegularExpressions;

class Program
{
    static void Main()
    {
        var Ocr = new IronTesseract();
        using (var Input = new OcrInput(“invoice.jpg”))
        {
            Input.Deskew().EnhanceContrast().RemoveNoise();
            var Result = Ocr.Read(Input);

            Console.WriteLine(“Raw Text:\n” + Result.Text);

            var invoiceNumber = Regex.Match(Result.Text, @“Invoice\s#?:?\s(\w+)”).Groups[1].Value;
            var total = Regex.Match(Result.Text, @“Total\s[:\$]?\s([\d,.]+)”).Groups[1].Value;

            Console.WriteLine(\("Invoice #: </span><span class="token interpolation-string interpolation" style="color: rgb(57, 58, 52);">{</span><span class="token interpolation-string interpolation expression language-csharp">invoiceNumber</span><span class="token interpolation-string interpolation" style="color: rgb(57, 58, 52);">}</span><span class="token interpolation-string" style="color: rgb(163, 21, 21);">"</span><span class="token" style="color: rgb(57, 58, 52);">)</span><span class="token" style="color: rgb(57, 58, 52);">;</span><span> </span><span>            Console</span><span class="token" style="color: rgb(57, 58, 52);">.</span><span class="token" style="color: rgb(57, 58, 52);">WriteLine</span><span class="token" style="color: rgb(57, 58, 52);">(</span><span class="token interpolation-string" style="color: rgb(163, 21, 21);">\)“Total: {total}”);

            Result.SaveAsSearchablePdf(“invoice_searchable.pdf”);
        }
    }
}

Tips and Best Practices

Use good-quality scans (300 DPI recommended).
Preprocess images to remove skew, noise, and improve contrast.
Restrict OCR to zones when possible to reduce false positives.
Reuse engine instances and cache results for frequent documents.
Validate parsed fields with regex or rules to catch OCR errors.

Build OCR Features in C# Using IronOCR

Build OCR Features in C# Using IronOCR

Prerequisites

Installation

Basic OCR: Extract Text from an Image

OCR from PDF

Extract Structured Data (e.g., Invoice Fields)

Improve Accuracy with Image Preprocessing

Language and Character Support

Handling Layout and Zones

Performance and Concurrency

Output Formats

Error Handling and Logging

Example: Console App Putting It Together

Tips and Best Practices

Further Reading

Comments

Leave a Reply Cancel reply

More posts

Gluten-Free Alternatives to Greyhound Cracker: Best Substitutes and Brands

Emulators Pack 1 Pro: Optimized Settings for Smooth Play

7 Ways to Get Started with SentiSight SDK Today

The Sandman: Dreams, Legends, and Nighttime Stories