Extract Text from PDFs with the Pdftools SDK

What is the Pdftools SDK?

The Pdftools SDK is a comprehensive PDF library that enables software developers to create, modify, convert, and validate PDF documents across .NET Core, C/C++, C#, Java, Python, and VB. It supports a wide range of functionalities including document assembly, PDF/A conversion, digital signatures, annotations, encryption, and content extraction. Designed for flexibility and performance, the SDK streamlines PDF handling for diverse applications, from generating ZUGFeRD invoices to powering AI workflows.


For more information visit the Pdftools SDK Product Summary or have a look at the Getting Start Guides. Feel free to contact us for more information.

Important Note: The code samples provided here are for demonstration purposes and learning. While we strive to maintain these examples, they are not officially supported. For production use, please refer to our official documentation.

Introduction

Extracting text from PDFs can unlock a wealth of possibilities, from building searchable archives to powering machine learning pipelines. This tutorial introduces PDFIngestor, a C# application that leverages the Pdftools SDK and Pdf Toolbox Add-On to extract and structure text from PDF files.

PDF Text Extraction with the Pdftools SDK code sample

PDFIngestor demonstrates how to:

Source Code

The full application is available on GitHub:

See also PDF Search Engine with Pdftools Conversion Service and Elasticsearch.

Why Extract PDF Text?

Extracting and structuring text from PDFs has several practical applications:

How PDFIngestor Works

PDFIngestor is a lightweight C# application that processes PDFs using the Pdftools SDK and Pdf Toolbox Add-On. It extracts text, metadata, and formatting information, outputting everything in JSON format.

The application utilizes the following namespaces from the Pdftools SDK:

using PdfTools.FourHeights.PdfToolbox;
using PdfTools.FourHeights.PdfToolbox.Geometry.Real;
using PdfTools.FourHeights.PdfToolbox.Pdf.Content;
using Document = PdfTools.FourHeights.PdfToolbox.Pdf.Document;

Key Features

Use Cases

Building and Running PDFIngestor

To build and run the application, follow the step-by-step guide available in the GitHub repository:

This guide covers prerequisites, configuring the Pdftools SDK, adding license keys, and running the application in different modes (watch/execute).

Getting Started with Pdftools SDK

PDFIngestor requires the Pdftools SDK and Pdf Toolbox Add-On. These tools provide a comprehensive suite for PDF manipulation and text extraction.

Example Workflow

  1. Convert incoming documents to PDF using Pdftools Conversion Service.
  2. Use PDFIngestor to extract text and metadata.
  3. Index extracted data into Elasticsearch.
  4. Power search engines, AI pipelines, or reporting systems with structured PDF data.

For a complete PDF search engine using Pdftools Conversion Service, Elasticsearch, and PDFIngestor, check out: