docconv icon indicating copy to clipboard operation
docconv copied to clipboard

pdftotext -layout

Open MikhailKlemin opened this issue 7 years ago • 3 comments

Hello! I woud use -layout option of pdftotext for that I guess I have to change body, err := exec.Command("pdftotext", "-q", "-nopgbrk", "-enc", "UTF-8", "-eol", "unix", f.Name(), "-").Output() to add -layout am I correct?

MikhailKlemin avatar Nov 02 '17 14:11 MikhailKlemin

@MikhailKlemin yes you would, however it may be worth making that default for all. Do you have any text examples showing the difference with and without the layout option?

mish15 avatar Nov 03 '17 00:11 mish15

Hi For me it makes a lot of sense, since usually I apply a lot of regex after converting to TXT, and -layout really helps to fight the mess. I attached an example with screenshots.
Here are source PDF and convert to txt with and without layout option https://transfer.sh/WJzz/examples.zip

MikhailKlemin avatar Nov 03 '17 07:11 MikhailKlemin

@MikhailKlemin we normally have to clean up the whitespace, so we'd need to test this internally to see what happens. I think it's worth adding as an option. I would look at adding some ENV options to control this. What's your timeframe?

mish15 avatar Nov 06 '17 22:11 mish15