Input/Export from/to file and stdin/stdout

Open matthiasbock opened this issue 2 years ago • 1 comments

Hi,

Currently, when exporting PDF content it is only possible to specify the name of the directory to which exported text files shall be written (outDir):

$ pdfcpu extract
usage: pdfcpu extract -m(ode) i(mage)|f(ont)|c(ontent)|p(age)|m(eta) [-p(ages) selectedPages] inFile outDir

It would be very useful if it were possible to specify filenames instead:

Export all PDF pages to one file:

$ pdfcpu extract -m content -o all_pages.txt some.pdf

Export one page to file:

$ pdfcpu extract -m content -p 1 -o page1.txt some.pdf

Export selected pages to the distinct files:

$ pdfcpu extract -m content -p 1 -o page1.txt -p 2 -o page2.txt some.pdf

Export selected pages to the same file:

$ pdfcpu extract -m content -p 1 -o pages1+3.txt -p 2 -o page2.txt -p 3 -o pages1+3.txt some.pdf

$ pdfcpu extract -m content -p 1,3 -o pages1+3.txt -p 2 -o page2.txt some.pdf

In particular, it would be useful, if stdin could be used to input a PDF file and stdout to write the exported content. This would enable PDF processing on the shell using pipes:

Read PDF input from stdin:

$ curl https://internet/some.pdf | pdfcpu extract -m content -o some_pages.txt -

Export text to stdout:

$ pdfcpu extract -m content -o - some.pdf | fgrep "Chapter 3:"

Best, Matthias

Feb 08 '24 17:02 matthiasbock

Hello! Support for shell piping is a useful addition. As far as your suggested addition to the extract command line processing I'd rather leave that up to the calling script. I am also not in favour of using -o repeatedly within one command. And if we're starting to use -o than that would have to change for all pdfcpu commands.

Feb 25 '24 10:02 hhrutter