PageIndex
PageIndex copied to clipboard
initial run, key error
Hi, I tried downloading the repo and following the usage instructions. I get:
File "/home/bsenftner/PageIndex/utils.py", line 498, in convert_physical_index_to_int if isinstance(data[i]['physical_index'], str): KeyError: 'physical_index'
after this output:
Parsing PDF... start find_toc_pages toc found start detect_page_index start find_toc_pages index not found process_toc_no_page_numbers start_index: 1 start toc_transformer divide page_list to groups 2
I gave page_index a 53 page PDF that is 4 software white papers pasted together. (All on machine learning.)
I did not see any instruction for which Python version, and I happen to have an active project using 3.9, so I created a conda environment for PageIndex that is Python 3.9. Do I need a later version?
Thanks, Blake, for raising the issue. The project’s still in early development, so you might run into a few bugs or issues—especially with PDF types that haven’t been tested yet.
If you're comfortable sharing the document you're testing, we can look into the problem and use it to make the tool more robust. You can either paste the link here or send it privately via our Discord: https://discord.gg/jCPay8hY.
I used this PDF, which was posted to HackerNews a few days ago: https://www.cs.tufts.edu/~nr/cs257/archive/mads-tofte/four-lectures.pdf
I'll try other PDFs. This was my first attempt, and since it crashed, I chose to stop investing time. Projects such as yours are very hot right now, and I'm finding dark patterns in some, so I've become wary. I did take the time to spin through the source and appreciated the logical treatment I saw, so I'm here with guarded hope!
Hi Blake, I ran a quick test on the file, and it produced the correct structure on our end. You can see the result here: four-lectures_structure.json. I'll check whether the issue you encountered might be related to the Python version and will update the corresponding environment settings accordingly. I’ll let you know as soon as it's ready.
I updated the repo this morning, tried it again, and it ran to finish fine. Thank for your work, and a fine project I'll be digging into.
Thanks, Blake. I also noticed sometimes some complex or old pdfs are hard to recognize by PyPDF, which may cause some errors and instability. You can also try our hosted API version (https://pageindex.vectify.ai/), which uses our OCR model to give better results. You could leave your email in this form to receive a quota of 1,000 free pages.
Hi Blake, I also noticed your reply in Hacker News (https://news.ycombinator.com/item?id=43548690), I wrote a brief reply there as well and wondering if that makes sense or not. I am happy to talk more about your problem if you are interested. We encountered something similar in the financial use case as well.
Thank you for the continued interest and support. I did a mini-update at Hacker News, mentioning that PageRank is working for me. You understand these graph architectures better than I, I'm still wrapping my head around the space. Learning the parts and pipelines.