docconv
docconv copied to clipboard
pdftotext -layout
Hello!
I woud use -layout
option of pdftotext for that I guess I have to change
body, err := exec.Command("pdftotext", "-q", "-nopgbrk", "-enc", "UTF-8", "-eol", "unix", f.Name(), "-").Output()
to add -layout
am I correct?
@MikhailKlemin yes you would, however it may be worth making that default for all. Do you have any text examples showing the difference with and without the layout
option?
Hi
For me it makes a lot of sense, since usually I apply a lot of regex after converting to TXT, and -layout
really helps to fight the mess. I attached an example with screenshots.
Here are source PDF and convert to txt with and without layout option
https://transfer.sh/WJzz/examples.zip
@MikhailKlemin we normally have to clean up the whitespace, so we'd need to test this internally to see what happens. I think it's worth adding as an option. I would look at adding some ENV options to control this. What's your timeframe?