PageIndex icon indicating copy to clipboard operation
PageIndex copied to clipboard

initial run, key error

Open bsenftner opened this issue 8 months ago • 7 comments

Hi, I tried downloading the repo and following the usage instructions. I get:

File "/home/bsenftner/PageIndex/utils.py", line 498, in convert_physical_index_to_int if isinstance(data[i]['physical_index'], str): KeyError: 'physical_index'

after this output:

Parsing PDF... start find_toc_pages toc found start detect_page_index start find_toc_pages index not found process_toc_no_page_numbers start_index: 1 start toc_transformer divide page_list to groups 2

I gave page_index a 53 page PDF that is 4 software white papers pasted together. (All on machine learning.)

I did not see any instruction for which Python version, and I happen to have an active project using 3.9, so I created a conda environment for PageIndex that is Python 3.9. Do I need a later version?

bsenftner avatar Apr 02 '25 16:04 bsenftner

Thanks, Blake, for raising the issue. The project’s still in early development, so you might run into a few bugs or issues—especially with PDF types that haven’t been tested yet.

If you're comfortable sharing the document you're testing, we can look into the problem and use it to make the tool more robust. You can either paste the link here or send it privately via our Discord: https://discord.gg/jCPay8hY.

zmtomorrow avatar Apr 03 '25 05:04 zmtomorrow

I used this PDF, which was posted to HackerNews a few days ago: https://www.cs.tufts.edu/~nr/cs257/archive/mads-tofte/four-lectures.pdf

I'll try other PDFs. This was my first attempt, and since it crashed, I chose to stop investing time. Projects such as yours are very hot right now, and I'm finding dark patterns in some, so I've become wary. I did take the time to spin through the source and appreciated the logical treatment I saw, so I'm here with guarded hope!

bsenftner avatar Apr 03 '25 14:04 bsenftner

Hi Blake, I ran a quick test on the file, and it produced the correct structure on our end. You can see the result here: four-lectures_structure.json. I'll check whether the issue you encountered might be related to the Python version and will update the corresponding environment settings accordingly. I’ll let you know as soon as it's ready.

zmtomorrow avatar Apr 04 '25 02:04 zmtomorrow

I updated the repo this morning, tried it again, and it ran to finish fine. Thank for your work, and a fine project I'll be digging into.

bsenftner avatar Apr 07 '25 17:04 bsenftner

Thanks, Blake. I also noticed sometimes some complex or old pdfs are hard to recognize by PyPDF, which may cause some errors and instability. You can also try our hosted API version (https://pageindex.vectify.ai/), which uses our OCR model to give better results. You could leave your email in this form to receive a quota of 1,000 free pages.

zmtomorrow avatar Apr 08 '25 15:04 zmtomorrow

Hi Blake, I also noticed your reply in Hacker News (https://news.ycombinator.com/item?id=43548690), I wrote a brief reply there as well and wondering if that makes sense or not. I am happy to talk more about your problem if you are interested. We encountered something similar in the financial use case as well.

zmtomorrow avatar Apr 09 '25 15:04 zmtomorrow

Thank you for the continued interest and support. I did a mini-update at Hacker News, mentioning that PageRank is working for me. You understand these graph architectures better than I, I'm still wrapping my head around the space. Learning the parts and pipelines.

bsenftner avatar Apr 09 '25 17:04 bsenftner