hotpdf icon indicating copy to clipboard operation
hotpdf copied to clipboard

hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six

Results 7 hotpdf issues
Sort by recently updated
recently updated
newest added

I have a use case where I have output strings that look like: `"John William Doe01/01/1999Continuing Graduate"` But for an arbitrary name, date, and student type ("Continuing Graduate" vs "Continuing...

enhancement

Currently, we only support case-sensitive searching, it would be helpful if there existed a flag like `ignore_case=True` in `find_text` for case-insensitive searching.

enhancement

Some characters like € is not readable: text 339,45 € is read as 339,45 cid(128) If needed I can send the pdf, cant add here It seems like cause by...

bug
pdfminer.six

Currently, to load a large pdf file (i.e. [bible.pdf](https://github.com/weareprestatech/hotpdf/files/13933240/bible.pdf) of ~700pages) it takes 72 secs. Here the profile data: ![image](https://github.com/weareprestatech/hotpdf/assets/106533898/31076e12-f303-493c-b384-39d85d6fefc2) As we can see the biggest bottleneck is the [load_memory_map](https://github.com/weareprestatech/hotpdf/blob/main/hotpdf/memory_map.py)...

enhancement

Make the version tag in pyproject.toml dynamic. Take it from latest release (?) What about locally? Ref: #49 @krishnasism

CI/CD

Right now to load the Bible (an example of a big [pdf file](https://github.com/weareprestatech/hotpdf/files/13933140/The-Holy-Bible-King-James-Version.pdf) of ~700pages) the memory usage skyrockets to around 1.3 GiB. The big memory allocation is of course...

enhancement

@callegarimattia I identified one place we could optimise. In Span Map instead of storing all "HotCharacters" in the Span, we can only store the str values. This way we will...

enhancement