Intelligent Document Data Extractor

DExtr - Data Extractor is a new progressive tool for automated data extraction of valuable data from various document classes: invoices, bills, contracts, payslips ...


  • Intelligent document classification
  • Intelligent data recognition
  • Intelligent data extractors
  • Rich Document Representation & API


  • Flexible document layout
  • Flexible extraction flow
  • Flexible data output format
  • Powerful toolset

How it works

Load your documents


Data recognition

Extract valuable data


Use output


Load your documents

DExtr can handle both text documents and images

Data recognition

DExtr provides set of recognizers - word, date, number, money, address and more

Extract valuable data

DExtr easily extracts valuable data with embedded operations - filter, merge, sort, validate and more

Use output

Output results can be presented in any desirable data format

Automate your document processes

Does your company handle a lot of documents?
Does it involve manual labour?
DExtr helps you improve your processes, reduce manual work and cut your costs by:

  • Extracting valuable data in your documents and provides them for further processing
  • Performing intelligent classification of different types of documents
  • Significantly reducing document processing time

Load your Big Data

Looking for new data sources?
Searching for intelligent Data Mining Tool?
DExtr provides set of professional tools, which allows you to:

  • Reveal semantics in text documents
  • Provide document structure and OCR text recovery
  • Improve natural language processing and named entity recognition

Reveal semantics in unstructured and non-semantic data

Text reconstruction. Semantic analysis.

  • Dynamic Entity recognition
  • Natural language processing
  • Advanced pattern matching algorithms
  • Page layout vs. text stream dualism



Typical document processing is in time interval of under one second. DExtr is built on the Java 8 platform, which has the needed performance and tooling. DExtr engine uses efficiently implemented extraction strategies to boost performance using parallel recognition.

Ease of Integration

Web Services can be used to expose DExtr functionality to external or internal systems, as well as command line tool or DExtr tool library for direct batch processing. This allows straightforward integration with other systems.

Intelligent Classification

Classification is essential part of data extraction process to achieve better extraction results. DExtr can use machine learning techniques to classify documents based on recognized entities.


Highly expressive domain specific API and scripting language is used to leverage the configurability of extraction and classification process. DExtr provides Document Viewer (DView) as a visual desktop application for better document visualization, extraction configuration and debugging.

Rich Representation

Textual and Visual document representation is used simultaneously to extract data, as each bit of information helps recognize important patterns.

Advanced Extraction

State-of-the-art recognition algorithms detect relevant data, enabling DExtr to outperform template based solutions.

DExtr Document Viewer

Document Viewer (DView) is a desktop application for:

  • Document visualization
  • Configuration of data extraction
  • Entity inspection

DView is a part of DExtr tools, used to facilitate DExtr in process of document data extraction. DExtr supports standard digital document formats, like PDF, DOC, TXT, XML, HOCR. DExtr can read the output of major OCRs, such as ABBYY and Tesseract.

Document Viewer