docling
docling copied to clipboard
docling vs GROBID
Issue: Comparing GROBID and Docling for Parsing Scholarly Publications
My Use Case
We need to parse and extract all relevant information from (1000s) of scholarly publications, such as metadata, full text, titles, references, tables, and more. For this purpose, we've been using GROBID, an open-source library for extracting structured information from scientific and scholarly documents, especially PDFs. GROBID has been utilized in the Semantic Scholar Open Research Corpus (S2ORC), a massive dataset of scholarly articles maintained by the Allen Institute for AI (AI2). GROBID has proven to be robust in handling different document layouts and complexities - for us as well!
Recently, I came across Docling, and so I decided to evaluate Docling's performance and output quality to understand its potential advantages or limitations compared to GROBID.
Experiment Setup
To compare GROBID and Docling fairly, I conducted an experiment using the same 10 randomly selected scholarly PDFs, each larger than 30 KB. Here is a summary of the setup:
-
Environment Setup:
- Both GROBID and Docling were installed locally, ensuring they could utilize available GPU resources (NVIDIA CUDA).
- GROBID was run in a Docker container, while Docling was installed in a Python virtual environment with GPU acceleration enabled (I THINK!).
-
Data Selection:
- Ten random PDFs larger than 30 KB were selected from a dataset of scholarly publications, representing various lengths and complexities, focusing on documents with rich metadata, references, and tables.
-
Conversion Process:
-
GROBID: The
/api/processFulltextDocument
endpoint was used to extract metadata, full text, references, and other elements. Results were saved in XML format, and performance metrics, such as processing time per document, were recorded. - Docling: Its Python API was used with OCR and table structure detection enabled. Docling processed each PDF, exporting the results in JSON, Markdown, and other formats. Performance metrics were also captured.
-
GROBID: The
-
Output and Comparison:
- Two separate output directories were created: one for GROBID's results (
converted_output_grobid
) and one for Docling's results (converted_output_docling
). - A consolidated JSON file was generated to compare both sets of results, including processing time and any errors encountered.
- The results will also be evaluated by a frontier large language model (LLM) to assess the quality of the parsing, providing a more nuanced understanding of how each library handles different elements (e.g., metadata, references, tables).
- Two separate output directories were created: one for GROBID's results (
Questions and Observations
My early experiments show that GROBID is significantly faster than Docling. I suspect Docling appears to use the GPU less effectively, which may explain the (MUCH) longer processing times.
Given this, I have the following questions:
-
Is My Setup Fair and Correct?
- Are GROBID and Docling configured correctly for a fair comparison?
- Are there any changes or optimizations that I should consider to ensure both libraries run optimally?
-
Can I Optimize GPU Usage for Docling?
- Docling seems to be using my GPU less efficiently than GROBID. Are there any parameters or configurations that could improve its performance?
-
Does Docling Focus on Different Use Cases?
- It seems that Docling may be optimized for use cases other than academic document parsing, such as enterprise data extraction. I would appreciate it if the developers could clarify:
- Should Docling be expected to perform well on scholarly documents?
- Are there any specific settings or adjustments I should make to give Docling a more equitable opportunity in this comparison?
- It seems that Docling may be optimized for use cases other than academic document parsing, such as enterprise data extraction. I would appreciate it if the developers could clarify:
Additional Context
-
System Specifications:
- Operating System: Windows 11 Pro N 64-bit
- CPU: AMD Ryzen 7 3700X (8-Core, 7nm Technology)
- RAM: 96.0 GB Dual-Channel DDR4 @ 1052 MHz
- Graphics: NVIDIA GeForce RTX 4090 (MSI)
- Motherboard: Gigabyte Technology Co., Ltd. X570 AORUS ELITE (AM4)
- Storage: Multiple SSDs with significant free space (see attached screenshot for details)
-
GROBID Docker Command:
docker run --name GROBID --rm --gpus all --init --ulimit core=0 -p 9070:8070 -p 9081:8071 grobid/grobid:0.8.0
- Docling Command: Python API with GPU support enabled (BUT maybe not optimally?).
Thank you for your insights!