Extract Text from PDFs with the Pdftools SDK
What is the Pdftools SDK?
The Pdftools SDK is a comprehensive PDF library that enables software developers to create, modify, convert, and validate PDF documents across .NET Core, C/C++, C#, Java, Python, and VB. It supports a wide range of functionalities including document assembly, PDF/A conversion, digital signatures, annotations, encryption, and content extraction. Designed for flexibility and performance, the SDK streamlines PDF handling for diverse applications, from generating ZUGFeRD invoices to powering AI workflows.
For more information visit the Pdftools SDK Product Summary or have a look at the Getting Start Guides. Feel free to contact us for more information.
Important Note: The code samples provided here are for demonstration purposes and learning. While we strive to maintain these examples, they are not officially supported. For production use, please refer to our official documentation.
Introduction
Extracting text from PDFs can unlock a wealth of possibilities, from building searchable archives to powering machine learning pipelines. This tutorial introduces PDFIngestor, a C# application that leverages the Pdftools SDK and Pdf Toolbox Add-On to extract and structure text from PDF files.
PDFIngestor demonstrates how to:
- Extract raw text from PDFs
- Convert extracted text into a structured format
- Retrieve and export PDF metadata
- Output results as JSON for indexing in search engines like Elasticsearch or powering your LLM RAG pipelines.
Source Code
The full application is available on GitHub:
See also PDF Search Engine with Pdftools Conversion Service and Elasticsearch.
Why Extract PDF Text?
Extracting and structuring text from PDFs has several practical applications:
- Search Engine Indexing – Make PDF content searchable in Elasticsearch or other search platforms.
- Machine Learning Pipelines – Use structured PDF content to train models or provide data for Retrieval-Augmented Generation (RAG) in large language models (LLMs).
- Content Summarization – Extract relevant data from large PDF archives and generate concise summaries.
How PDFIngestor Works
PDFIngestor is a lightweight C# application that processes PDFs using the Pdftools SDK and Pdf Toolbox Add-On. It extracts text, metadata, and formatting information, outputting everything in JSON format.
The application utilizes the following namespaces from the Pdftools SDK:
using PdfTools.FourHeights.PdfToolbox;
using PdfTools.FourHeights.PdfToolbox.Geometry.Real;
using PdfTools.FourHeights.PdfToolbox.Pdf.Content;
using Document = PdfTools.FourHeights.PdfToolbox.Pdf.Document;
Key Features
- Text Extraction – Extracts plain and structured text from PDFs.
- Metadata Retrieval – Captures information such as author, conformance level, and fonts used.
- JSON Output – Converts extracted data to JSON for easy indexing or further processing.
- Integration Ready – Directly integrates with Elasticsearch or other data storage systems.
Use Cases
- Enterprise Document Management – Process large volumes of documents and make them easily searchable.
- Legal and Compliance – Extract key clauses, terms, and metadata from PDFs.
- AI/ML Pipelines – Provide PDF content as input to AI models for summarization, question answering, or data augmentation.
Building and Running PDFIngestor
To build and run the application, follow the step-by-step guide available in the GitHub repository:
This guide covers prerequisites, configuring the Pdftools SDK, adding license keys, and running the application in different modes (watch/execute).
Getting Started with Pdftools SDK
PDFIngestor requires the Pdftools SDK and Pdf Toolbox Add-On. These tools provide a comprehensive suite for PDF manipulation and text extraction.
Example Workflow
- Convert incoming documents to PDF using Pdftools Conversion Service.
- Use PDFIngestor to extract text and metadata.
- Index extracted data into Elasticsearch.
- Power search engines, AI pipelines, or reporting systems with structured PDF data.
Related Projects
For a complete PDF search engine using Pdftools Conversion Service, Elasticsearch, and PDFIngestor, check out: