camelot icon indicating copy to clipboard operation
camelot copied to clipboard

Decimal Points Missed Entirely by New Version

Open sometimesabird opened this issue 2 years ago • 3 comments

Describe the bug

Decimal points are sometimes not read by the program despite being in the pdf text. I.e., it reads "1.5" as "15". The is a new bug, as version 0.7.2 was working correctly. The current version (0.10.1) as well as 0.7.3 both fail.

Steps to reproduce the bug

  1. Download this table.
  2. Install camelot: pip install camelot-py==0.10.1
  3. Run

camelot -p all -o "test-NEW.csv" -f csv -split -strip ".\n" lattice -scale 100 -copy v "369746.pdf"

  1. Install older version: pip install camelot-py==0.7.2
  2. Run

camelot -p all -o "test-OLD.csv" -f csv -split -strip ".\n" lattice -scale 100 -copy v "369746.pdf"

  1. Open test-NEW-page-1-table-1.csv and test-OLD-page-1-table-1.csv.

Expected behavior

Line 2 of test-OLD.csv is what we should have:

"SMM camera at Donetsk Filtration Station (15km N of Donetsk)","0.5-1.5km","S","Recorded","2","Projectile","From E to W","N/K","31-Jan","19:35"

Line 2 of test-NEW.csv is misread: "SMM camera at Donetsk Filtration Station (15km N of Donetsk)","05-15km","S","Recorded","1","Projectile","From E to W","N/K","31-Jan","19:34"

(Note that the same thing happens to the column name located in the first row -- "No." is converted into "No".)

PDF

PDF

Environment

  • OS: Garuda Linux (Arch-based)
  • Python version: 3.10.2
  • Numpy version: 1.22.2-1
  • OpenCV version: 4.5.5-3
  • Ghostscript version: 9.55.0-4
  • Camelot version: 0.10.1

sometimesabird avatar Mar 12 '22 23:03 sometimesabird

Sounds more like a bug that has been fixed.. you seem to be passing '.' in the strip argument.. that is supposed to strip the decimal points.

ramSeraph avatar Mar 13 '22 08:03 ramSeraph

Oh, so it's mean to strip any of the characters, not this particular sequence?

sometimesabird avatar Mar 13 '22 20:03 sometimesabird

It looks that way from the code..

https://github.com/camelot-dev/camelot/blob/644bbe7c6d57b95aefa2f049a9aacdbc061cc04f/camelot/utils.py#L503-L505

It used to only strip at the end of the line, but now it strips from the whole line.

It was changed in this commit.

But even in its previous version it was always any of the characters.. It looks like.

ramSeraph avatar Mar 13 '22 22:03 ramSeraph