How to Extract Tabular Data from PDF [Part 2]

Here is the second part of the article ‘How to Extract Tabular Data from PDF.’ In the first part, we covered key challenges and explained the core principles of getting data out of PDF tables. Today, we finish our analysis of six software tools that are most often used for that purpose and provide a big comparative table where each tool is rated according to its ability to parse PDF tables and correctly extract data from them.

Excalibur is a web interface to extract tabular data from PDFs. Tool overview:

  • Type of software available: web application, needs local setup
  • Platforms: any modern web browser; local setup runs on Linux, MacOS, and Windows
  • Terms of use: free, open-source
  • Supported output formats: CSV, Excel, JSON, HTML
  • Notes: works only with text-based PDFs and not scanned documents

After uploading our sample file and parsing data from it via Excalibur, we got the following output:

Excalibur: the result of detection of tables in the sample document

Highlighted zones are parts of the original file detected by Excalibur as tables. At this step, data is captured correctly; extraneous elements as headers are not selected.

After the extraction procedure, we can find the only error in the tabular data preview: in the first row of the first table, two adjacent cells were mistakenly merged. But in comparison with the previous tools, we get the output closest to the original file.

Excalibur: preview of extracted tabular data

Summary: Excalibur demonstrates the best result at this point. Accurate detection of tables, all non-tabular data is skipped, no problems with multiline text in cells. The only mistake refers to the recognition of merged cells: unfortunately, their content is messed up.

OCR.space is a service converting scans or (smartphone) images of text documents into editable files by using Optical Character Recognition (OCR) technology. Tool overview:

  • Type of software available: online web application
  • Platforms: any modern web browser — all processing goes ‘in the Cloud’
  • Terms of use: free (up to 250 000 conversions) and paid ($20 per each 100 000 conversions)
  • Supported output formats: TXT, JSON

For extracting data from tables, it is recommended to enable the ‘Table recognition’ option. After uploading our sample file and parsing data from it via OCR.space, we got the following output:

OCR.space: preview of extracted tabular data

The result of the extraction is OCR’ed text sorted line by line — but we can not see a typical table structure with rows and columns. Data inside the cells are messed up with non-tabular data like headers and page numbers. The output format looks similar to TSV (tab-separated values). Also, the multiline text inside cells is split into rows.

Summary: The output document does not have a typical table structure: the data is presented as a sequence of text lines. All data from tables is extracted but not cleaned from the ‘extraneous content’ like headers and page numbers. Also, OCR.space failed to extract multiline text in cells correctly.

PDFTables is a cloud platform that allows users to convert PDF tables to Excel accurately, CSV, XML, or HTML without downloading any software. Tool overview:

  • Type of software available: online web application
  • Platforms: any modern web browser — all processing goes ‘in the Cloud’
  • Terms of use: free/paid (starting from $40 for 1000 pages) subscription plans
  • Supported output formats: Excel, CSV, XML or HTML
  • Notes: allows to convert multiple PDFs at once

After parsing the sample PDF file and extracting tabular data from it via PDFTables, we get the following result:

PDFTables: preview of extracted tabular data

Four separate tables of the original PDF are detected as a big, single table. We can also see that all headers are captured as the table element and included as additional, ‘non-original’ cells inside the table. Moreover, cells with multiline text are split into multiple table rows. And finally, the cells of the first table of the sample file are mistakenly merged.

Summary: When it comes to getting tabular data out of PDF docs, PDFTables is the least effective tool. The only task it coped with is the correct detection of cells separated with small margins. But other challenges of ‘extra formatting’ can not be completed with the help of PDFTables.

A comparative pivot table of software tools

In this study, we compared siх software tools — Tabula, PDFTron, Amazon Textract, Excalibur, OCR.space, and PDFTables — by performing their core functions of parsing PDF tables and extracting data from them. We estimated each tool by their ability to complete the following five tasks:

  • If there are multiple tables on a page — detect them all separately
  • If non-tabular data is on a page — skip it and do not include it in the extraction result
  • If any cell contains multiline text — it should not be split into multiple table rows
  • If there are cells spanned on more than one row/column — they should be recognized correctly, at least separately from other cells
  • If there are small margins between non-tabular data and a table or between different cells and their content inside the table — they should be recognized separately

If we see from the extraction result that the task is completed successfully, a tool gets 1 point. If we see any mistakes and inconsistencies compared to the original PDF file’s tabular data, it receives 0 points.

Here is a comparative pivot table with all the results:

A comparative pivot table with the results

1. Excalibur is the ‘winner’ of the study. It successfully coped with most of the challenges of ‘extra formatting’ except incorrect column & row spanning. Thus it can be recommended as the #1 choice for extracting tabular data from PDFs.

2. Tabula and PDFTron demonstrated quite satisfying results. While Tabula better identifies and excludes non-tabular data (headers and a page number), PDFTron better deliminates cells with multiline text inside.

3. Amazon Textract and OCR.space got only 2 points of 5. In the extraction result provided by Amazon Textract, we see a large amount of data loss and messed the tables’ order. The OCR.space’s result contains non-tabular data mistakenly included, and multiline content split into multiple table rows.

4. PDFTables failed to complete most of the tasks. It was the only tool that did not recognize four tables of the sample PDF file as four separate ones. Moreover, we can find non-tabular content, split multiline text, and mistakenly merged cells in the extraction result.

5. The most challenging task was to detect column & row spanning correctly: none of the tools fully coped with that task. If you have a complicated table with many cells spanned on more than one row/column, you should look for another software for PDF data extraction.

Follow UpsilonIT on Medium for the latest tech know-hows, best software development practices, and insightful case studies from industry experts! Remember: your claps and comments are fuel for our future posts!

We help startups and small & medium businesses build software that matters.