Dreaming out loud, ToC generation, Highlights, and Exports
Hi,
Just stumbled on Sioyek, and it could be the one to kick my faithful Okular out...
Just looking at the feature list, and considering scriptability, could Sioyek allow exporting (to Markdown, actually) ToC, titles and highlights from PDFs ?
Of course, proper management of titles and highlights would be better, but maybe this is more easily feasible using a highlighting convention for title levels and content types ?
This kind of feature set has been a dream of mine for so many years...
Thanks in advance for your insights and even more for sharing your work TY J
It is currently possible to export the annotations (including highlights, bookmarks, etc.) to a json file (using export command). It is not possible to export toc though.
You can use pdftk on a PDF to export the toc
Perhaps the following script achieves what you want? Please note that it might not be the cleanest implementation, because I more thought of it as a PoC and wanted to see whether something like this even is what you imagined. :)
import sqlite3
import sys
from pathlib import Path
from time import sleep
import pymupdf
from sioyek.sioyek import AbsoluteDocumentPos, Sioyek, clean_path
if __name__ == "__main__": if len(sys.argv) > 1:
SIOYEK_PATH = clean_path(sys.argv[1])
LOCAL_DATABASE_FILE = clean_path(sys.argv[2])
SHARED_DATABASE_FILE = clean_path(sys.argv[3])
doc_path = clean_path(sys.argv[4])
else:
print("Did not receive the necessary arguments.")
exit(1)
sioyek = Sioyek(SIOYEK_PATH, LOCAL_DATABASE_FILE, SHARED_DATABASE_FILE)
# Get title and table of contents
with pymupdf.open(doc_path) as doc:
doc_title = (
doc_path if doc.metadata["title"] == "" else doc.metadata["title"]
)
toc = []
for t in doc.get_toc(simple=False):
page = doc.load_page(t[0])
# Get coordinates in mupdf coordinate space (which is also used by sioyek)
c = t[3]["to"] * page.transformation_matrix
# Filter only relevent data from toc, namely level, text, page and x/y positions
toc.append(["toc_entry"] + t[0:3] + [c.x, c.y])
# Get highlights
doc = sioyek.get_document(doc_path)
document_hash = doc.get_hash()
connection = sqlite3.connect(SHARED_DATABASE_FILE)
cursor = connection.execute(
f"SELECT type, desc, begin_x, begin_y from highlights WHERE document_path = '{document_hash}'"
)
highlights = []
for row in cursor:
highlights.append(row)
connection.close()
# Convert the absolute document positions to page-dependent document positions
highlights_new = []
for h in highlights:
pos = doc.to_document(AbsoluteDocumentPos(h[2], h[3]))
# Since the page count starts with zero here, we need to add 1
page, x_pos, y_pos = pos.page + 1, pos.offset_x, pos.offset_y
highlights_new.append(
["highlight"] + list(h[0:2]) + [page, x_pos, y_pos]
)
doc.close()
# Sort ToC entries and highlights by position
sorted_exports = sorted(toc + highlights_new, key=lambda x: (x[3], x[5]))
# Generate output in markdown syntax
output = f"""---
document_hash: {document_hash}
document_path: {doc_path}
---
# Highlights from {doc_title}
"""
for s in sorted_exports:
if s[0] == "toc_entry":
output += "#" * s[1] + " " + s[2]
elif s[0] == "highlight":
# Not sure, whether we even need to consider newlines since sioyek might not even save them,
# but better be safe than sorry.
output += f"Highlight of type {s[1]}:\n > {"\n >".join(s[2].split("\n"))}"
output += "\n" * 2
# Save output to markdown file
p_doc = Path(doc_path)
p_output = p_doc.parent.joinpath(p_doc.stem + "_export.md")
if p_output.exists():
sioyek.set_status_string(f"File {p_output} already exists!")
else:
sioyek.set_status_string(f"Sucessfully exported to {p_output}.")
p_output.write_text(output)
sleep(5)
sioyek.set_status_string(" ")
You could add the following line to your prefs_user.config (with the appropriate path added) and afterwards call the script from sioyek as _export (or additionally assign a keyboard shortcut).
new_command _export python <path_to_the_above_script> "%{sioyek_path}" "%{local_database}" "%{shared_database}" "%{file_path}"
Note that you might have to change "%{sioyek_path}" to the path of your sioyek-AppImage in case you installed it as such.
Please also note that there might be some bugs, but from my limited testing it seems to work pretty well. I'd expect documents which are typeset with multiple columns to not play very well with the script, because it sorts the highlights and ToC entries by page and y-coordinate before considering the x-coordinate, but I'm not sure how one could easily improve upon that situation.
thanks for this script! exactly what I was looking for
Awesome, I'm glad that it helps! :)