remarks icon indicating copy to clipboard operation
remarks copied to clipboard

Upgrade to rM 2.8

Open folofjc opened this issue 3 years ago • 9 comments

In the last version of the rM software, they added a functionality where they will attempt to adjust your hand highlights to the text. So if the pdf has text (not an image of text), it will "reformat" the highlight to be a block over the text.

The highlight extraction for this no longer works in remarks. I tested this by making a malformatted highlight on a pdf (one that rM would not "refactor") and it shows up just fine after remarks in the annotated pdf. But the highlights that rM "refactors" do not show up.

I tried to look through the parse_rm_file function, but I could not easily see what the problem would be without digging into the rM output to see how the new highlights are stored.

Any chance for an update?

folofjc avatar Sep 07 '21 14:09 folofjc

Okay, I did some digging around in the new highlights json file. It looks pretty helpful - it even gives you the highlighted text! So might require some re-writing, but looks like you could keep the old highlighting part, and just add the new one. The new one should already be done, I think all you would have to do is take the rect with the height and width and just apply it.

folofjc avatar Sep 07 '21 14:09 folofjc

I created a pull request with my attempt to fix this. I also tested importing into Zotero and it works like it should! Zotero is actually worse at identifying that annotations that rM, haha. Since rM parses the text under the highlight, I wonder if there is a way to "store" the text with the annotation rectangle??

folofjc avatar Sep 10 '21 12:09 folofjc

@folofjc Thanks so much for this fix. It works great for me. I had to change one line in your code in remarks.py

rm_highlight_file = pathlib.Path(f"{input_dir}/{path.stem}.highlights/{rm_file.stem}.json")

For me, the rm_file.stem points to a number while the highlights file seems to be the page id for me.

h_fname = pages[page_idx]
rm_highlight_file = pathlib.Path(f"{input_dir}/{path.stem}.highlights/{h_fname}.json")

czarrar avatar Feb 23 '22 04:02 czarrar

Hi @czarrar. That is interesting. I don't have that issue. Are you on 2.12? I am still on 2.11 and it still works great. I have added a few protections to my code (if you try to extract highlights on a file with no highlights, it crashes; also, if your highlights are too small, it crashes. I had a highlight that rM made that was one dimensional; the rect that rM set had the same vertices). So I will check this out and maybe push another update to my master.

folofjc avatar Feb 23 '22 05:02 folofjc

hi @czarrar. I just checked, and the original code still works for me. I still have the {path_stem}.highlights/{rm_file.stem}.json as the file with all the highlights.

What is the value of your h_fname? Is it different than rm_file.stem? rm_file.stem should be the UUID of the pdf.

folofjc avatar Feb 23 '22 06:02 folofjc

@folofjc For the few files I tried, rm_file.stem is a number like 0 or 1. While my actual highlights json files are UUID like 1d111126-d36a-42b7-b35e-bc4ef80f3711.json.

To work for both our cases, I can make the following change instead:

try: # line 86
    page_idx = pages.index(f"{rm_file.stem}")
    rm_highlight_file = pathlib.Path(f"{input_dir}/{path.stem}.highlights/{rm_file.stem}.json") # added
except:
    page_idx = int(f"{rm_file.stem}")
    h_fname = pages[page_idx] # added
    rm_highlight_file = pathlib.Path(f"{input_dir}/{path.stem}.highlights/{h_fname}.json") # added

pages seems to be a list with the UUID for each page. In your case, the try statement should work. In my case it will run what is in the exception part. I'm not sure why the output can be these two types.

czarrar avatar Feb 23 '22 15:02 czarrar

@czarrar Huh. I do not get that at all. How many rm_files do you have? It looks like the problem is that for for loop is giving you the index of the loop, instead of the value of item in the list (which python is not supposed to do). What python are you using? Can you give me the value of rm_files as well?

folofjc avatar Feb 24 '22 09:02 folofjc

It's confusing to me too @folofjc. Sorry I didn't respond to your earlier message. I have 2.12 and Python 3.8.3, and thanks for those new additions.

My rm_files are 0.rm, 1.rm, 2.rm, 6.rm, and 9.rm (so 5 of them). My highlight files are 2fa55ecf-b917-450c-94bc-5dfc71246750.json, 6da0c15c-e1dc-4978-adc3-6f14da8a0761.json, 9abb565d-1ee0-4668-94e1-1053493feffd.json, 8860e3fc-5dfb-48ce-8b2a-f5d9f3f06c4b.json, f5c597fa-5620-4710-b48c-494335c934b2.json. Here are my files for this one document if you want to take a look: https://www.dropbox.com/s/coskv9skfb4vrti/zarrar_remarks_demo.zip?dl=0.

czarrar avatar Feb 25 '22 03:02 czarrar

Oh, wow. So your rM is actually storing the pages with those numbers? So it isn't python, it is the rM. Did you use rsync to get the files? So the problem is that list_pages_uuid is getting the pages from the 3b289986-47e4-472f-94fc-377350e8d2f6.content file, which has the pages as UUIDs, which is the same as in your .highlights directory. However, the pages in the folder 3b289986-47e4-472f-94fc-377350e8d2f6 are numbered as page numbers like 0, 1, 6, etc. That is really odd.

In my files, the contents of that folder have the same UUID for the pages in the .highlights folder.

So since @lucasrla put that except clause in there originally, it must have been for this reason. My question, is why one rM would store the files with the UUIDs and another would store it with page numbers???

Can you ssh into your rM and see if that folder stores them with the page numbers on the rM itself?

folofjc avatar Feb 25 '22 08:02 folofjc