pdftotree icon indicating copy to clipboard operation
pdftotree copied to clipboard

Im newbie.

Open Mohanrajkarnan opened this issue 4 years ago • 1 comments

I have requirement of extract pdf to Html5.

I have tried the below code which was able to extract text from pdf and created html but not structured as in pdf. -Missed images -Missed text positioning.

pdftotree.parse(pdf_file,html_path=htmlPath, favor_figures=True,model_type=None, model_path=None,visualize=False)

Please assist me as what am i missing.

Thanks Mohan

Mohanrajkarnan avatar Feb 22 '21 17:02 Mohanrajkarnan

Hi Mohan,

It sounds like you're trying to get an HTML representation that focuses on visually looking like the source PDF, is that correct? If so, pdftotree most likely isn't for you. The focus here is more on structural accuracy (e.g., tables end up in HTML tables), not faithfully representing a PDF document visually. Many PDF to HTML tools have a similar focus.

If my assumption is correct, then I'd suggest trying some other tools. I think pdftohtml.org is one that emphasizes visual accuracy, but I'm sure there are others as well.

lukehsiao avatar Feb 23 '21 00:02 lukehsiao