pdf-toolbox icon indicating copy to clipboard operation
pdf-toolbox copied to clipboard

workflow to extract text

Open eflister opened this issue 5 years ago • 1 comments

hi - i just used #master to do the common thing of extracting all text from a pdf. it worked, thanks for the nice library! it took a while to figure out how to do it and required more contortions than i expected. perhaps you could add some api support for such a basic task? here's what i wound up with, is this what you expect users to do?

main = do
  withPdfFile "file.pdf" $ \pdf -> do
    txt <- extract pdf =<< catalogPageNode =<< documentCatalog =<< document pdf

extract pdf = (T.concat <$>) . (traverse ((extract' =<<) . loadPageNode pdf) =<<) . pageNodeKids
  where
    extract' (PageTreeLeaf tn) = pageExtractText tn
    extract' (PageTreeNode tn) = extract pdf tn

eflister avatar Sep 11 '20 16:09 eflister

Yeah, simpler API would be great. Though I'm not sure how exactly it should look like. I'll think about it, thank you for the input.

Yuras avatar Sep 17 '20 19:09 Yuras