tabula-extractor
tabula-extractor copied to clipboard
Extract tables from PDF files
While using `tabula-extractor` to parse [this PDF](http://www.ok.gov/able/documents/M27.pdf) (pages 1 - 151), I ran into some interesting issues: 1. While there are no visible 'ruling lines', the rows are colored differently...
Test file: https://s3.amazonaws.com/metawilm/TabulaTest.pdf (I couldn't upload file here) Choosing "Auto-detect tables" and then "Preview & Export Extracted Data" leads to: Tabula API version: 1.0.0 Filename: TabulaTest.pdf Internal Server Error (500)...
Hi, I've just discovered tabula-extractor and it's given me great results so far. Thanks for your work! One problem I have though is that in the PDFs I get, tables...
Here's a case that we might want to look into: https://www.dropbox.com/s/0i6ae5kgtcy0frb/s-013163.pdf It's definitely a "spreadsheet", but the lines-of-text / ruling-lines ratio is way below/above the heuristic's defined threshold.
So rather than doing like this ``` extractor = Tabula::Extraction::ObjectExtractor.new(pdf_filename, [1])).extract extractor.each &:whatever extractor.close! ``` we can do something like ``` extractor = Tabula::Extraction::ObjectExtractor.new(pdf_filename, [1])).extract do |extractor| extractor.each &:whatever end...
This diff between two versions https://github.com/tabulapdf/tabula-extractor/compare/bb24fa9b98a2...f4e291c6e32a seems to have started a failing test: bb24fa9b98a2 pass: https://travis-ci.org/tabulapdf/tabula-extractor/builds/27415350 f4e291c6e32a fail: https://travis-ci.org/tabulapdf/tabula-extractor/builds/32744778 (see history https://travis-ci.org/tabulapdf/tabula-extractor/builds ) it's literally just turning off PDFBox logging...
- Move all of its methods to `Rectangle2D` - Make the "rectangular" entities (`TextElement`, `Page`, etc) inherit from `Rectangle2D` (we reopened it, anyway)
The test suite code has become a unmaintainable mess. Let's clean that up. While we're at it, we should merge [`icdar-groundtruth-tests`](https://github.com/jazzido/tabula-extractor/tree/icdar-groundtruth-tests) into `master`.
Feature request. I'd love to see an command line option to get informations about rectangles found by TableGuesser. A "dry run mode" to see what portion of PDF tabula-extractor will...
Write some real tests for the behavior in bin/tabula I made a list of commands that all worked properly with my recent changes in https://github.com/jazzido/tabula-extractor/blob/pre07/test/test_bin_tabula.sh, but that's obviously not a...