pdftotree issues

Im newbie.

1

I have requirement of extract pdf to Html5. I have tried the below code which was able to extract text from pdf and created html but not structured as in...

Mohanrajkarnan

Extract captions of images

2

I am using pdftotree to read pdf and extract images from pdf i have one requirement where i would like to extract image captions along with image so i request...

ashleo25

Images are not extracted properly

1

I am running pdftotree to generate html and executing fondure to generate mages Some of the images are not parsed properly Images are black and parsed properly Since we are...

ashleo25

Pdftotree generates lot of tmp folder

I am using pdftotree to parse pdf and run fonduer we are extracting images from the pdf When pdftotree struggles to parse the pdf it keeps on generating tmp folders....

ashleo25

Loss of information oftentimes in the last line of a table

7

**Describe the bug** I've tried the plain `pdftotree` command line utility on a few pdf files with tables, and found wherever there is a table structure, the last line is...

linM24

bug

Inconsistent data models for bbox

**Describe the bug** Data models that represent bounding boxes are inconsistent, which considerably degrades readability. For example, `bbox: List[float]` in the order of `(y0, x0, y1, x1)` at https://github.com/HazyResearch/pdftotree/blob/6ff4a7cb5fe6269e3c287664392e226ca45479d4/pdftotree/TreeExtract.py#L447 `bbox:...

HiromuHota

Switch from Tabula to Camelot?

2

**Is your feature request related to a problem? Please describe.** Switching from Tabula to Camelot have two advantages: 1. Tabula is Java, Camelot is Python. Switching to Camelot frees us...

HiromuHota

Add type annotations

**Is your feature request related to a problem? Please describe.** Codes are not type-annotated. **Describe the solution you'd like** Add type annotations and check with mypy **Describe alternatives you've considered**...

HiromuHota

help wanted

How to reproduce the vision model?

1

**Describe the bug** A vision-based model has been introduced at #29. Are there a script and a dataset to reproduce this model? **To Reproduce** No way to reproduce the model....

HiromuHota

Enhancement using pdf-to-svg to get underlined and struck-out text formatting

4

I have had good results by converting a pdf to a series of svg (scalable vector graphics; an xml format) files with the open source tool [mupdf](https://mupdf.com/). I then use...

clayms

pdftotree
pdftotree copied to clipboard

Metadata

Im newbie.

Extract captions of images

Images are not extracted properly

Pdftotree generates lot of tmp folder

Loss of information oftentimes in the last line of a table

Inconsistent data models for bbox

Switch from Tabula to Camelot?

Add type annotations

How to reproduce the vision model?

Enhancement using pdf-to-svg to get underlined and struck-out text formatting

← Metadata

Owner

Metadata

pdftotree pdftotree copied to clipboard

Metadata

← Metadata

Owner

Metadata

pdftotree
pdftotree copied to clipboard