sioyek icon indicating copy to clipboard operation
sioyek copied to clipboard

Dreaming out loud, ToC generation, Highlights, and Exports

Open silopolis opened this issue 1 year ago • 5 comments

Hi,

Just stumbled on Sioyek, and it could be the one to kick my faithful Okular out...

Just looking at the feature list, and considering scriptability, could Sioyek allow exporting (to Markdown, actually) ToC, titles and highlights from PDFs ?

Of course, proper management of titles and highlights would be better, but maybe this is more easily feasible using a highlighting convention for title levels and content types ?

This kind of feature set has been a dream of mine for so many years...

Thanks in advance for your insights and even more for sharing your work TY J

silopolis avatar Oct 29 '24 10:10 silopolis

It is currently possible to export the annotations (including highlights, bookmarks, etc.) to a json file (using export command). It is not possible to export toc though.

ahrm avatar Oct 29 '24 11:10 ahrm

You can use pdftk on a PDF to export the toc

KorigamiK avatar Feb 07 '25 23:02 KorigamiK

Perhaps the following script achieves what you want? Please note that it might not be the cleanest implementation, because I more thought of it as a PoC and wanted to see whether something like this even is what you imagined. :)

import sqlite3
import sys
from pathlib import Path
from time import sleep

import pymupdf
from sioyek.sioyek import AbsoluteDocumentPos, Sioyek, clean_path


if __name__ == "__main__": if len(sys.argv) > 1:
        SIOYEK_PATH = clean_path(sys.argv[1])
        LOCAL_DATABASE_FILE = clean_path(sys.argv[2])
        SHARED_DATABASE_FILE = clean_path(sys.argv[3])
        doc_path = clean_path(sys.argv[4])
    else:
        print("Did not receive the necessary arguments.")
        exit(1)

    sioyek = Sioyek(SIOYEK_PATH, LOCAL_DATABASE_FILE, SHARED_DATABASE_FILE)

    # Get title and table of contents
    with pymupdf.open(doc_path) as doc:
        doc_title = (
            doc_path if doc.metadata["title"] == "" else doc.metadata["title"]
        )
        toc = []
        for t in doc.get_toc(simple=False):
            page = doc.load_page(t[0])
            # Get coordinates in mupdf coordinate space (which is also used by sioyek)
            c = t[3]["to"] * page.transformation_matrix
            # Filter only relevent data from toc, namely level, text, page and x/y positions
            toc.append(["toc_entry"] + t[0:3] + [c.x, c.y])

    # Get highlights
    doc = sioyek.get_document(doc_path)
    document_hash = doc.get_hash()
    connection = sqlite3.connect(SHARED_DATABASE_FILE)
    cursor = connection.execute(
        f"SELECT type, desc, begin_x, begin_y from highlights WHERE document_path = '{document_hash}'"
    )
    highlights = []
    for row in cursor:
        highlights.append(row)
    connection.close()
    # Convert the absolute document positions to page-dependent document positions
    highlights_new = []
    for h in highlights:
        pos = doc.to_document(AbsoluteDocumentPos(h[2], h[3]))
        # Since the page count starts with zero here, we need to add 1
        page, x_pos, y_pos = pos.page + 1, pos.offset_x, pos.offset_y
        highlights_new.append(
            ["highlight"] + list(h[0:2]) + [page, x_pos, y_pos]
        )
    doc.close()

    # Sort ToC entries and highlights by position
    sorted_exports = sorted(toc + highlights_new, key=lambda x: (x[3], x[5]))

    # Generate output in markdown syntax
    output = f"""---
document_hash: {document_hash}
document_path: {doc_path}
---
# Highlights from {doc_title}

"""
    for s in sorted_exports:
        if s[0] == "toc_entry":
            output += "#" * s[1] + " " + s[2]
        elif s[0] == "highlight":
            # Not sure, whether we even need to consider newlines since sioyek might not even save them,
            # but better be safe than sorry.
            output += f"Highlight of type {s[1]}:\n > {"\n >".join(s[2].split("\n"))}"
        output += "\n" * 2

    # Save output to markdown file
    p_doc = Path(doc_path)
    p_output = p_doc.parent.joinpath(p_doc.stem + "_export.md")
    if p_output.exists():
        sioyek.set_status_string(f"File {p_output} already exists!")
    else:
        sioyek.set_status_string(f"Sucessfully exported to {p_output}.")
        p_output.write_text(output)
    sleep(5)
    sioyek.set_status_string(" ")

You could add the following line to your prefs_user.config (with the appropriate path added) and afterwards call the script from sioyek as _export (or additionally assign a keyboard shortcut).

new_command _export python <path_to_the_above_script> "%{sioyek_path}" "%{local_database}" "%{shared_database}" "%{file_path}"

Note that you might have to change "%{sioyek_path}" to the path of your sioyek-AppImage in case you installed it as such.


Please also note that there might be some bugs, but from my limited testing it seems to work pretty well. I'd expect documents which are typeset with multiple columns to not play very well with the script, because it sorts the highlights and ToC entries by page and y-coordinate before considering the x-coordinate, but I'm not sure how one could easily improve upon that situation.

fdf-uni avatar Mar 11 '25 14:03 fdf-uni

thanks for this script! exactly what I was looking for

gregorylearns avatar Aug 02 '25 04:08 gregorylearns

Awesome, I'm glad that it helps! :)

fdf-uni avatar Sep 23 '25 10:09 fdf-uni