pdftotree
pdftotree copied to clipboard
Im newbie.
I have requirement of extract pdf to Html5.
I have tried the below code which was able to extract text from pdf and created html but not structured as in pdf. -Missed images -Missed text positioning.
pdftotree.parse(pdf_file,html_path=htmlPath, favor_figures=True,model_type=None, model_path=None,visualize=False)
Please assist me as what am i missing.
Thanks Mohan
Hi Mohan,
It sounds like you're trying to get an HTML representation that focuses on visually looking like the source PDF, is that correct? If so, pdftotree
most likely isn't for you. The focus here is more on structural accuracy (e.g., tables end up in HTML tables), not faithfully representing a PDF document visually. Many PDF to HTML tools have a similar focus.
If my assumption is correct, then I'd suggest trying some other tools. I think pdftohtml.org is one that emphasizes visual accuracy, but I'm sure there are others as well.