Build OCR Features in C# Using IronOCR

Build OCR Features in C# Using IronOCR

Optical Character Recognition (OCR) enables applications to extract text from images and scanned documents. IronOCR is a popular C# library that simplifies OCR integration for .NET developers. This article shows how to set up IronOCR, implement common OCR features, and optimize results for accuracy and performance.

Prerequisites

  • .NET 6 or later (reasonable default)
  • Visual Studio or another C# IDE
  • NuGet package manager
  • Sample image or PDF to test

Installation

  1. Create or open a .NET project.
  2. Install IronOCR from NuGet:

bash

dotnet add package IronOcr
  1. Add using directive in your code files:

csharp

using IronOcr;

Basic OCR: Extract Text from an Image

  1. Initialize the OCR engine and read an image:

csharp

var Ocr = new IronTesseract(); using (var Input = new OcrInput(“invoice.jpg”)) { var Result = Ocr.Read(Input); Console.WriteLine(Result.Text); }
  1. Explanation:
  • IronTesseract wraps Tesseract with pre-configured settings.
  • OcrInput accepts file paths, byte arrays, URLs, and streams.

OCR from PDF

csharp

var Ocr = new IronTesseract(); using (var Input = new OcrInput(“scan.pdf”)) { Input.Deskew(); // optional preprocessing Input.EnhanceContrast(); // optional preprocessing var Result = Ocr.Read(Input); Console.WriteLine(Result.Text); }

IronOCR handles multi-page PDFs and returns aggregated text.

Extract Structured Data (e.g., Invoice Fields)

  1. Use regular expressions or simple parsing against Result.Text:

csharp

var text = Result.Text; var invoiceNumber = Regex.Match(text, @“Invoice\s#?:?\s(\w+)”).Groups[1].Value; var total = Regex.Match(text, @“Total\s[:\$]?\s([\d,.]+)”).Groups[1].Value;
  1. For more robust extraction, combine OCR with template matching or ML-based parsers.

Improve Accuracy with Image Preprocessing

  • Deskew(), EnhanceContrast(), RemoveNoise(), Sharpen() are available on OcrInput:

csharp

Input.Deskew() .EnhanceContrast() .RemoveNoise() .Sharpen();
  • Choose preprocessing steps empirically per document type.

Language and Character Support

  • Set specific OCR language packs if needed:

csharp

Ocr.Language = OcrLanguage.English;
  • For other languages, load appropriate language data via IronOCR settings or NuGet language packs.

Handling Layout and Zones

  • Use regions to focus OCR on parts of an image:

csharp

Input.AddImage(“form.jpg”); Input.AddRectangle(100, 200, 400, 100); // x, y, width, height var Result = Ocr.Read(Input);
  • Process multiple zones separately to preserve structure.

Performance and Concurrency

  • Reuse IronTesseract instance across calls to reduce model load time.
  • For high throughput, run OCR tasks in parallel but limit degree of parallelism to avoid CPU/IO contention.

Output Formats

  • Export OCR results to plain text, searchable PDF, or JSON:

csharp

File.WriteAllText(“output.txt”, Result.Text); Result.SaveAsSearchablePdf(“searchable.pdf”);

Error Handling and Logging

  • Catch exceptions and inspect Result.PageErrors or Result.Paragraphs for diagnostics.

csharp

try { var Result = Ocr.Read(Input); } catch (Exception ex) { Console.WriteLine(ex.Message); }

Example: Console App Putting It Together

csharp

using System; using IronOcr; using System.Text.RegularExpressions; class Program { static void Main() { var Ocr = new IronTesseract(); using (var Input = new OcrInput(“invoice.jpg”)) { Input.Deskew().EnhanceContrast().RemoveNoise(); var Result = Ocr.Read(Input); Console.WriteLine(“Raw Text:\n” + Result.Text); var invoiceNumber = Regex.Match(Result.Text, @“Invoice\s#?:?\s(\w+)”).Groups[1].Value; var total = Regex.Match(Result.Text, @“Total\s[:\$]?\s([\d,.]+)”).Groups[1].Value; Console.WriteLine(\("Invoice #: </span><span class="token interpolation-string interpolation" style="color: rgb(57, 58, 52);">{</span><span class="token interpolation-string interpolation expression language-csharp">invoiceNumber</span><span class="token interpolation-string interpolation" style="color: rgb(57, 58, 52);">}</span><span class="token interpolation-string" style="color: rgb(163, 21, 21);">"</span><span class="token" style="color: rgb(57, 58, 52);">)</span><span class="token" style="color: rgb(57, 58, 52);">;</span><span> </span><span> Console</span><span class="token" style="color: rgb(57, 58, 52);">.</span><span class="token" style="color: rgb(57, 58, 52);">WriteLine</span><span class="token" style="color: rgb(57, 58, 52);">(</span><span class="token interpolation-string" style="color: rgb(163, 21, 21);">\)“Total: {total}); Result.SaveAsSearchablePdf(“invoice_searchable.pdf”); } } }

Tips and Best Practices

  • Use good-quality scans (300 DPI recommended).
  • Preprocess images to remove skew, noise, and improve contrast.
  • Restrict OCR to zones when possible to reduce false positives.
  • Reuse engine instances and cache results for frequent documents.
  • Validate parsed fields with regex or rules to catch OCR errors.

Further Reading

  • IronOCR official docs and API reference for advanced options (e.g., custom training, image filters).
  • Tesseract OCR concepts for understanding recognition limits and language training.

This guide provides a practical starting point to add OCR features to C# apps using IronOCR. Adjust preprocessing and parsing strategies to your document types for best results.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *