minecart icon indicating copy to clipboard operation
minecart copied to clipboard

Simple, Pythonic extraction of text, shapes and images from PDFs

Results 12 minecart issues
Sort by recently updated
recently updated
newest added

I am trying to read an image in PNG format from PDF file. I get the following error. ``` Traceback (most recent call last): File "D:\workspace\pdfextraction\pdfextract.py", line 10, in im...

for page in doc.iter_pages(): im = page.shapes for shape in page.shapes: print(shape.path) This returns the below output what is "m" , 'l' ? [('m', 2023.52, 875.9599999999999), ('l', 2023.52, 848.24), ('l',...

I am getting a warning when using **doc.getPage()** method. warning as follows: WARNING:root:Invalid zlib bytes: error('Error -3 while decompressing data: incorrect header check'), b'l\x85c\xd0;\xac\x14\xa6\x9cB\x11KCU\x1bd`8!\xb4\x16\xfd\x1e\x08\x01~\xb9J\xc5\xbb?\xc3\xed\x07\x9fQ\xf9\xe3\x7f\xd6o\x9bb\xdd\x15\x84D\xb1\xfc\xb6\xaf\xa6\xac\xe7\x10\x01n\x99\xc7\xb0\xe7\xd5\xda\xadi\x9d\xdaT\xdb\x14\xba\xbb\r+Q9\xa7\xac\x02W9W\xee\xc2w\xaaM\x96\xd4@H\xdd\xddk\xdbU\x8e\x83\xf5\x18\xe9l\xa5\x06\x96\xe9j\xa3\xb6\xec\xf0\xcd?^U\xcc\xc5\xab\x7f\x1e\x92\xf1}?f{:\x02\xeb\xf8k\xbe\xae\xefP|y\x817\x89\xec\xb4\xab=\x8cE\xae\xbc\xc2\xa4\x06r$\xd2c\xb7\x9a\x0b\x80\x03\xe8X\xc9\xb4\x9cf}\xb8\x12\x16p\xb7\'_\xa5`j\x1e\x92\x90j\xa8\xc8X\x0c\x7fD\xd3\x84\x85\x93\xb5\x96\xe0\x86\x0f\x8b\xday\x03\t\x01\xe3\x87~%\x87@\x0c42\xae\xf0He\xd1\xca\x05\xd0\xe1>g\x92\xa8%R\x1f.\xaf\xfed\xbfF\x7fUu\xdbW\xa9_\x8cj6\xaa\'t\xf7\xec/\x8fd\xdf\x13\x7f_!\x96|\xd3k\xd4\x9cD\xec\\?\xd7\xc9\x8a+\x80\x9b&\xa1\xab[T\x8f\x97sW\xf4S\xc7\x92\x1c\xab\xb70\xe7Z\x13\xa0ke-\xed\xb0\x10\x9a\xc3\xf7\xa8=\xdb,\xac-\xf8\x13\xaa\x8c\x9aV/s+jf7\x8a\xeeP\x0f\x01\xb9)o\xa5\xd0@\xdd\x1d\xf8\x9cn\x14n7\xf99f\xdd\xe6\rk\x08\xfd\xa6\xe0\x9e:\xc8\x04\xb9\xc2\xb7\xeaQ\xf2\xb6\xb6(\xb3\xa0{\xd2\xab\xf47\xd1\xcf\xae\xd6d1A\x07\x8eH\x8b\xc8q+b\x07\xa9{\xb0/R;\x02\xff\x8e\rz\x9f^\xd7\xf2\x1b\xben\xf9\xf0\xe3\xb0\xed\xbcM5\x7fZ\xa3\x8a`\x9b\xce\xbb\x81\xe4\xb9\xd7\xd5\x1bQ\x9d.\xe64\xbb;c.t$\xdd\x0c\xbf\x95\xd9\xe7\xcfkcT|\xbd_G\x8d??\x05\xf3\xa1\x95\x11aUM!\x99\x9e\xc8\xe4+2\xf0\xba\xcfw\x84\x8b\t\x07\x8c/:"\xb7,\xa6\xad\xb8\x85\xf1\xda\xa0\x15\x06p\xac\xc2\x1f\x96\x8f\xc2\xe4\x85\x9f>\x0e\xfa*\xa9\xac\xba\xf0RK\xe4^lj/\xc6\x8b\xf7\xb2\x1f\x13K\xf3\x8f\x05A\x94\xd8\x1f\xf6}]\x0e\x19w\x8c\x06\xc8\x91\x89CC\x05Qr\xa5\xe7\x825\x97\xe0\x13L \xa0A\xc5i\x16{aI\xd0\x84=J\xd9\xab\xcb\xad\x80\xef$K\x02\xb98\xb2\xbbBC\x04\x80@\xf8\xa1\x89\x90\xd5\x98\xfd\x92\x14d\x07\x11\xdd!Y{\x9b\xc9\xd2\x89\xcckU\x05\xf8\xf5P\x9e\x9e\xde\xdc\xa4\xc3^\x86\x1f\xd4\xf0V\x02\x07\x94~:\x13W\xb3\x9dLK\x99\xa2\x91\xceC\xe0\xd2M\xb1\xf1`\xc15|\x11\xef\x84\xb5"\xf4\xb3\xa8J\x8b\r\x9b-\xc6\x82\xd2/\x8byj\x97\xce\x1e\xa4\x80\xc8_\xbb\x13Lm\xdf\xf8N\xe8dd\x88\xcf\xbeh\xfe\x08\x8b\x17\x89>N#\xd4\xf7%z\x88\x16d\x99\x06\xbc\xecb&\x07\xf4\xca}\xc33@\xc9$\xd2nx@d(2uo\xb75\xbd\x99\xbc{\x9ah\x9ccu\x92?\xc7-\x1e\nFU\x0e\xc2\x8b\x1f\xe4MLs\x07\xb4^\x9b\x88u\xa7\xfd\xc9\x85r\n\xf2&>\x16\x8e\n\xd7\xb6J.\xb9\xf3\xcf\x130\xb8\xac\xca:\xc9\x8e\x8f\xf00\x8er~`|>\x14\xf1\x86\x86\x9d\xaf\x98\x13\xfcZ\x9e\xa6\x03+\xe8[;`!Q\xf0\xf6-\xf3\x1e\xe5\xd7c\xc6\xcb\x11iv\x0e\x18\x7f\x0b\xac\xe38[\xd0!\xe9\xb5T4\x9d#\xa1n"\xe9\x12?A\xce+\xcf\x8b\xd8_M\xc1#^\x03\x90\x88\x0e/\x06N\xb7A\xec\x18\xa3\xc5\xeb\x8b\x19\xd7\xe2\xcc\xf1\x16\x94\x11@\xee\xec\xa0\x0e\xcc\xfe\x97aM\x19\xe8\x82\x7f\xa8\xdf`\xf0\xf6\xa2\xa8U\xb0\x07\x91Q\x94\xb9\xa7\xa4\x14\x97\xab\x85\x15f\x05\xbf}\x94\xc0\xa0I\xef\x9d\xb7\xee5\x1b\xd43\xf4:;\x93f\x9e\xc4\xcb\xab\xd3\x94[\xa3\xd0\xb6\x07\x91\xb5-`\x19\r\xdamnM\xce\x18R\x80M(\x11*\xe8\xeb\xc1(\x10' No images are getting extracted

As it is seen, at as_pil method of Image Class considers filter, colorspace and bits that are attributes of LTImage object at pdfminer package. But, why these are not used...

`pdfminer3k` was removed from Pip a few days ago. This completely broke `pip install minecart`, as `pdfminer3k` is a dependency of `minecart`. Since then, a different user seems to have...

@felipeochoa it appears that `pdfminer3k` has been removed from `pip` https://pypi.org/project/pdfminer3k/ . Consequently, `pip install minecart` no longer works: ``` > pip install minecart Collecting minecart Using cached minecart-0.3.0-py3-none-any.whl (23...

It would be great to be able to check the type of rendering applied to a given lettering (whether it was stroked, filled, used as clipping, etc.)

enhancement

There are 2 images in the pdf which i am trying to read, 1st is the logo. 2nd is the handwritten Sign. The library is able to read the logo...

Hello, I am working with a database .PDFs containing research articles in a niche set of academic areas. I am hoping to extract all of the Figures and captions. In...

bug
help wanted

Using the `PIL.Image.transform` QUAD method, we could apply the CTM to the image data, extracting a screenshot view of the image.

enhancement