Build OCR Features in C# Using IronOCR
Optical Character Recognition (OCR) enables applications to extract text from images and scanned documents. IronOCR is a popular C# library that simplifies OCR integration for .NET developers. This article shows how to set up IronOCR, implement common OCR features, and optimize results for accuracy and performance.
Prerequisites
- .NET 6 or later (reasonable default)
- Visual Studio or another C# IDE
- NuGet package manager
- Sample image or PDF to test
Installation
- Create or open a .NET project.
- Install IronOCR from NuGet:
bash
dotnet add package IronOcr
- Add using directive in your code files:
csharp
using IronOcr;
Basic OCR: Extract Text from an Image
- Initialize the OCR engine and read an image:
csharp
var Ocr = new IronTesseract(); using (var Input = new OcrInput(“invoice.jpg”)) { var Result = Ocr.Read(Input); Console.WriteLine(Result.Text); }
- Explanation:
- IronTesseract wraps Tesseract with pre-configured settings.
- OcrInput accepts file paths, byte arrays, URLs, and streams.
OCR from PDF
csharp
var Ocr = new IronTesseract(); using (var Input = new OcrInput(“scan.pdf”)) { Input.Deskew(); // optional preprocessing Input.EnhanceContrast(); // optional preprocessing var Result = Ocr.Read(Input); Console.WriteLine(Result.Text); }
IronOCR handles multi-page PDFs and returns aggregated text.
Extract Structured Data (e.g., Invoice Fields)
- Use regular expressions or simple parsing against Result.Text:
csharp
var text = Result.Text; var invoiceNumber = Regex.Match(text, @“Invoice\s#?:?\s(\w+)”).Groups[1].Value; var total = Regex.Match(text, @“Total\s[:\$]?\s([\d,.]+)”).Groups[1].Value;
- For more robust extraction, combine OCR with template matching or ML-based parsers.
Improve Accuracy with Image Preprocessing
- Deskew(), EnhanceContrast(), RemoveNoise(), Sharpen() are available on OcrInput:
csharp
Input.Deskew() .EnhanceContrast() .RemoveNoise() .Sharpen();
- Choose preprocessing steps empirically per document type.
Language and Character Support
- Set specific OCR language packs if needed:
csharp
Ocr.Language = OcrLanguage.English;
- For other languages, load appropriate language data via IronOCR settings or NuGet language packs.
Handling Layout and Zones
- Use regions to focus OCR on parts of an image:
csharp
Input.AddImage(“form.jpg”); Input.AddRectangle(100, 200, 400, 100); // x, y, width, height var Result = Ocr.Read(Input);
- Process multiple zones separately to preserve structure.
Performance and Concurrency
- Reuse IronTesseract instance across calls to reduce model load time.
- For high throughput, run OCR tasks in parallel but limit degree of parallelism to avoid CPU/IO contention.
Output Formats
- Export OCR results to plain text, searchable PDF, or JSON:
csharp
File.WriteAllText(“output.txt”, Result.Text); Result.SaveAsSearchablePdf(“searchable.pdf”);
Error Handling and Logging
- Catch exceptions and inspect Result.PageErrors or Result.Paragraphs for diagnostics.
csharp
try { var Result = Ocr.Read(Input); } catch (Exception ex) { Console.WriteLine(ex.Message); }
Example: Console App Putting It Together
csharp
using System; using IronOcr; using System.Text.RegularExpressions; class Program { static void Main() { var Ocr = new IronTesseract(); using (var Input = new OcrInput(“invoice.jpg”)) { Input.Deskew().EnhanceContrast().RemoveNoise(); var Result = Ocr.Read(Input); Console.WriteLine(“Raw Text:\n” + Result.Text); var invoiceNumber = Regex.Match(Result.Text, @“Invoice\s#?:?\s(\w+)”).Groups[1].Value; var total = Regex.Match(Result.Text, @“Total\s[:\$]?\s([\d,.]+)”).Groups[1].Value; Console.WriteLine(\("Invoice #: </span><span class="token interpolation-string interpolation" style="color: rgb(57, 58, 52);">{</span><span class="token interpolation-string interpolation expression language-csharp">invoiceNumber</span><span class="token interpolation-string interpolation" style="color: rgb(57, 58, 52);">}</span><span class="token interpolation-string" style="color: rgb(163, 21, 21);">"</span><span class="token" style="color: rgb(57, 58, 52);">)</span><span class="token" style="color: rgb(57, 58, 52);">;</span><span> </span><span> Console</span><span class="token" style="color: rgb(57, 58, 52);">.</span><span class="token" style="color: rgb(57, 58, 52);">WriteLine</span><span class="token" style="color: rgb(57, 58, 52);">(</span><span class="token interpolation-string" style="color: rgb(163, 21, 21);">\)“Total: {total}”); Result.SaveAsSearchablePdf(“invoice_searchable.pdf”); } } }
Tips and Best Practices
- Use good-quality scans (300 DPI recommended).
- Preprocess images to remove skew, noise, and improve contrast.
- Restrict OCR to zones when possible to reduce false positives.
- Reuse engine instances and cache results for frequent documents.
- Validate parsed fields with regex or rules to catch OCR errors.
Further Reading
- IronOCR official docs and API reference for advanced options (e.g., custom training, image filters).
- Tesseract OCR concepts for understanding recognition limits and language training.
This guide provides a practical starting point to add OCR features to C# apps using IronOCR. Adjust preprocessing and parsing strategies to your document types for best results.