hotpdf
hotpdf copied to clipboard
hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six
I have a use case where I have output strings that look like: `"John William Doe01/01/1999Continuing Graduate"` But for an arbitrary name, date, and student type ("Continuing Graduate" vs "Continuing...
Currently, we only support case-sensitive searching, it would be helpful if there existed a flag like `ignore_case=True` in `find_text` for case-insensitive searching.
Some characters like € is not readable: text 339,45 € is read as 339,45 cid(128) If needed I can send the pdf, cant add here It seems like cause by...
Currently, to load a large pdf file (i.e. [bible.pdf](https://github.com/weareprestatech/hotpdf/files/13933240/bible.pdf) of ~700pages) it takes 72 secs. Here the profile data: data:image/s3,"s3://crabby-images/1b6c9/1b6c9b7d5ca0518f5622633990f23268395b1b85" alt="image" As we can see the biggest bottleneck is the [load_memory_map](https://github.com/weareprestatech/hotpdf/blob/main/hotpdf/memory_map.py)...
Make the version tag in pyproject.toml dynamic. Take it from latest release (?) What about locally? Ref: #49 @krishnasism
Right now to load the Bible (an example of a big [pdf file](https://github.com/weareprestatech/hotpdf/files/13933140/The-Holy-Bible-King-James-Version.pdf) of ~700pages) the memory usage skyrockets to around 1.3 GiB. The big memory allocation is of course...
@callegarimattia I identified one place we could optimise. In Span Map instead of storing all "HotCharacters" in the Span, we can only store the str values. This way we will...