remarks
remarks copied to clipboard
Upgrade to rM 2.8
In the last version of the rM software, they added a functionality where they will attempt to adjust your hand highlights to the text. So if the pdf has text (not an image of text), it will "reformat" the highlight to be a block over the text.
The highlight extraction for this no longer works in remarks. I tested this by making a malformatted highlight on a pdf (one that rM would not "refactor") and it shows up just fine after remarks in the annotated pdf. But the highlights that rM "refactors" do not show up.
I tried to look through the parse_rm_file
function, but I could not easily see what the problem would be without digging into the rM output to see how the new highlights are stored.
Any chance for an update?
Okay, I did some digging around in the new highlights json file. It looks pretty helpful - it even gives you the highlighted text! So might require some re-writing, but looks like you could keep the old highlighting part, and just add the new one. The new one should already be done, I think all you would have to do is take the rect
with the height and width and just apply it.
I created a pull request with my attempt to fix this. I also tested importing into Zotero and it works like it should! Zotero is actually worse at identifying that annotations that rM, haha. Since rM parses the text under the highlight, I wonder if there is a way to "store" the text with the annotation rectangle??
@folofjc Thanks so much for this fix. It works great for me. I had to change one line in your code in remarks.py
rm_highlight_file = pathlib.Path(f"{input_dir}/{path.stem}.highlights/{rm_file.stem}.json")
For me, the rm_file.stem
points to a number while the highlights file seems to be the page id for me.
h_fname = pages[page_idx]
rm_highlight_file = pathlib.Path(f"{input_dir}/{path.stem}.highlights/{h_fname}.json")
Hi @czarrar. That is interesting. I don't have that issue. Are you on 2.12? I am still on 2.11 and it still works great. I have added a few protections to my code (if you try to extract highlights on a file with no highlights, it crashes; also, if your highlights are too small, it crashes. I had a highlight that rM made that was one dimensional; the rect that rM set had the same vertices). So I will check this out and maybe push another update to my master.
hi @czarrar. I just checked, and the original code still works for me. I still have the {path_stem}.highlights/{rm_file.stem}.json
as the file with all the highlights.
What is the value of your h_fname
? Is it different than rm_file.stem
? rm_file.stem
should be the UUID of the pdf.
@folofjc For the few files I tried, rm_file.stem
is a number like 0 or 1. While my actual highlights json files are UUID like 1d111126-d36a-42b7-b35e-bc4ef80f3711.json
.
To work for both our cases, I can make the following change instead:
try: # line 86
page_idx = pages.index(f"{rm_file.stem}")
rm_highlight_file = pathlib.Path(f"{input_dir}/{path.stem}.highlights/{rm_file.stem}.json") # added
except:
page_idx = int(f"{rm_file.stem}")
h_fname = pages[page_idx] # added
rm_highlight_file = pathlib.Path(f"{input_dir}/{path.stem}.highlights/{h_fname}.json") # added
pages
seems to be a list with the UUID for each page. In your case, the try statement should work. In my case it will run what is in the exception part. I'm not sure why the output can be these two types.
@czarrar Huh. I do not get that at all. How many rm_files
do you have? It looks like the problem is that for for
loop is giving you the index of the loop, instead of the value of item in the list (which python is not supposed to do). What python are you using? Can you give me the value of rm_files
as well?
It's confusing to me too @folofjc. Sorry I didn't respond to your earlier message. I have 2.12 and Python 3.8.3, and thanks for those new additions.
My rm_files are 0.rm, 1.rm, 2.rm, 6.rm, and 9.rm (so 5 of them). My highlight files are 2fa55ecf-b917-450c-94bc-5dfc71246750.json, 6da0c15c-e1dc-4978-adc3-6f14da8a0761.json, 9abb565d-1ee0-4668-94e1-1053493feffd.json, 8860e3fc-5dfb-48ce-8b2a-f5d9f3f06c4b.json, f5c597fa-5620-4710-b48c-494335c934b2.json. Here are my files for this one document if you want to take a look: https://www.dropbox.com/s/coskv9skfb4vrti/zarrar_remarks_demo.zip?dl=0.
Oh, wow. So your rM is actually storing the pages with those numbers? So it isn't python, it is the rM. Did you use rsync
to get the files? So the problem is that list_pages_uuid
is getting the pages from the 3b289986-47e4-472f-94fc-377350e8d2f6.content
file, which has the pages as UUIDs, which is the same as in your .highlights
directory. However, the pages in the folder 3b289986-47e4-472f-94fc-377350e8d2f6
are numbered as page numbers like 0, 1, 6, etc. That is really odd.
In my files, the contents of that folder have the same UUID for the pages in the .highlights
folder.
So since @lucasrla put that except
clause in there originally, it must have been for this reason. My question, is why one rM would store the files with the UUIDs and another would store it with page numbers???
Can you ssh
into your rM and see if that folder stores them with the page numbers on the rM itself?