pdf-toolbox workflow to extract text

workflow to extract text

Open eflister opened this issue 5 years ago • 1 comments

hi - i just used #master to do the common thing of extracting all text from a pdf. it worked, thanks for the nice library! it took a while to figure out how to do it and required more contortions than i expected. perhaps you could add some api support for such a basic task? here's what i wound up with, is this what you expect users to do?

main = do
  withPdfFile "file.pdf" $ \pdf -> do
    txt <- extract pdf =<< catalogPageNode =<< documentCatalog =<< document pdf

extract pdf = (T.concat <$>) . (traverse ((extract' =<<) . loadPageNode pdf) =<<) . pageNodeKids
  where
    extract' (PageTreeLeaf tn) = pageExtractText tn
    extract' (PageTreeNode tn) = extract pdf tn

Sep 11 '20 16:09 eflister

Yeah, simpler API would be great. Though I'm not sure how exactly it should look like. I'll think about it, thank you for the input.

Sep 17 '20 19:09 Yuras

pdf-toolbox pdf-toolbox copied to clipboard

workflow to extract text

pdf-toolbox
pdf-toolbox copied to clipboard