camelot icon indicating copy to clipboard operation
camelot copied to clipboard

Double line Break - Camelot switches characters around

Open shakirshakeelzargar opened this issue 5 years ago • 2 comments

I'm trying to parse tables in a PDF using Camelot. The cells have multiple lines of texts in them, and some have an empty line separating portions of the text:

First line
Second line

Third line

I would expect this to be parsed as First line\nSecond line\n\nThird line (notice the double line breaks), but I get this instead: T\nFirst line\nSecond line\nhird line. The first character after a double-line-break moves to the beginning of the text, and I only get a single line-break instead.

I also tried using tabula, but that one messes up de entire table (data-frame actually) when there is an empty row in the table, and also in case of some words it puts a space between the characters.

shakirshakeelzargar avatar Oct 16 '20 22:10 shakirshakeelzargar

For those looking for a solution, I have found a workaround that works excellent. I have posted my solution here : https://stackoverflow.com/questions/64317363/camelot-switches-characters-around/64946264#64946264

@vinayak-mehta Any updates on this issue???

ashir3097 avatar Nov 21 '20 18:11 ashir3097

Maybe you can solve this issue by reducing the value of the LAParams(char_margin=default 2.0) parameter.

You can set the parameters yourself with

camelot.read_pdf(DOCUMENT, pages="all", layout_kwargs{"char_margin": 0.5})

for example. Maybe some other parameters have to be changed. But char_margin is here the first I have in mind.

mssnglnk avatar Sep 06 '21 10:09 mssnglnk