tabula-java icon indicating copy to clipboard operation
tabula-java copied to clipboard

Implement line color filter

Open jazzpi opened this issue 4 years ago • 1 comments

Addresses #21.

I've never worked with PDFBox before, so I hope this is the right approach -- it works at least for this file (without the color filter, some of the underlines for the hyperlinks are detected as rulings, which splits those rows). However, it doesn't work for this file (without the color filter, it simply detects all cells as separate). With the color filter, it exports the following CSV:

A,B
4","2
5,6

Is this an issue with the color filter or is it related to the red and black lines crossing?

Other notes:

  • I'm not sure if this is a good way to pass the line color filter argument to the ObjectExtractorStreamEngine.
  • I haven't added tests yet. I should hopefully have some time next week to debug further and add them.
  • I couldn't come up with a sensible short-style command line option, so I only added a long-style one.

jazzpi avatar May 05 '21 13:05 jazzpi

... so after a couple hours of debugging I just realized that this happens because the line returns used by tabula-java are carriage returns instead of line feeds, which means the beginning of the line is overwritten, and it actually works just fine.

I've added a test as well and think this is ready for review.

jazzpi avatar May 11 '21 13:05 jazzpi