PyMuPDF icon indicating copy to clipboard operation
PyMuPDF copied to clipboard

Kernel Python3 segfault error 4 in Linux machine running PyNuPDF V1.20.1

Open weironglue opened this issue 3 years ago • 3 comments

Please provide all mandatory information!

Describe the bug (mandatory)

A clear and concise description of what the bug is.

Please see the following screenshot of error in running the PyMuPDF V1.20.1. The program in Linux machine will stop by itself because of too many “kernel Python3 segfault error 4” errors without any warning or error message.

When I run this

To Reproduce (mandatory)

Explain the steps to reproduce the behavior, For example, include a minimal code snippet, example files, etc.

The Linux machine will stop running because PyMuPDF generates many Error 4 errors. The Linux system quits running after reporting 10-20 Error 4 errors. The only error that we can see is from the system log of Linux machine (as shown in the following screen shot).

Expected behavior (optional)

Describe what you expected to happen (if not obvious).

Screenshots (optional)

If applicable, add screenshots to help explain your problem.

PyMuPDF_Error

Your configuration (mandatory)

  • Operating system, potentially version and bitness
  • Python version, bitness
  • PyMuPDF version, installation method (wheel or generated from source).

OP system: Oracle Linux op system V 5.4.17 Python Version: 3.7.3 PyMuPdf version: 1.20.1

For example, the output of print(sys.version, "\n", sys.platform, "\n", fitz.__doc__) would be sufficient (for the first two bullets).

Additional context (optional)

Add any other context about the problem here.

weironglue avatar Sep 25 '22 20:09 weironglue

I need the script (a minimal part of it only!) plus data files which cause the error. I need to reproduce the error on my machine. The console log alone tells me nothing. You forgot to mention, how you installed the package.

JorjMcKie avatar Sep 25 '22 20:09 JorjMcKie

Are there any other PyMuPDF scripts running successfully? Were there problems running earlier versions of PyMuPDF?

JorjMcKie avatar Sep 25 '22 20:09 JorjMcKie

Still waiting for material to reproduce the error ...

JorjMcKie avatar Sep 28 '22 13:09 JorjMcKie

I am going to close this because of lack of supporting evidence / reproducibility.

JorjMcKie avatar Oct 01 '22 13:10 JorjMcKie

Hi, I sent you an email yesterday.  I am sending you the attached file again now. Again, the codes is running without kernel error in PyMuPDF 1.19.5. Please reply to my email to confirm you have received it. Thanks, Wei

On Saturday, October 1, 2022 at 06:44:01 AM PDT, Jorj X. McKie ***@***.***> wrote:  

I am going to close this because of lack of supporting evidence / reproducibility.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

def draw_clin_label(page, annot_text, annot_loc, font_size, fill_color, keywords): """Helper method to draw a new clinical summary label onto the page. These are placed parallel to where their associate keyword is on the page.

Arguments
=========
page: PyMuPDF page
    PyMuPDF representation of a page.

annot_text: string
    The text to write into the label.

annot_loc: PyMuPDF Rectangle
    PyMuPDF page rectangle area in which to draw the label text.
    Only used in case of emergency, e.g. keyword wasn't found so must fall back to 2nd row.

font_size: int
    Fontsize to use. 14 for organs, 18 for everything else.

fill_color: PyMuPDF RGB color tuple
    Color to use for the label.

keywords: list of tuples
    List of tuples of (keyword, frequency) for the given category on the page.
    Used to search page text to determine label placement.

Returns
=======
Label drawn onto page. The label is placed on left margin across from the keyword if the keyword is found;
otherwise, it draws it in the 2nd row.
"""
found, lower_opacity = False, False
top_keyword_loc = None
align = fitz.TEXT_ALIGN_CENTER
# Special handling for 180-degree rotated pages - in these cases, multiply each comparison by derotation matrix
# so that y coordinates are mapped correctly. This breaks behavior for 270-degree pages so do with flag structure
do_derotate = page.rotation == 180
# Search page text for all keywords that were identified earlier in the pipeline
for keyword, frequency in keywords:
    keyword_loc = None
    keyword_locs = page.search_for(keyword)
    # Pick the appearance of the keyword in the page with the lowest y0, in case there are multiple appearances
    if keyword_locs:
        if do_derotate:
            keyword_loc = min([loc * page.derotation_matrix for loc in keyword_locs], key=lambda k: k.y0)
        else:
            keyword_loc = min(keyword_locs, key=lambda k: k.y0)
    # If keyword wasn't found, try getting page words, stripping punctuation, and checking that way
    # This is necessary because PyMuPDF search_for() includes punctuation and the keyword tokens don't
    else:
        all_words = page.get_text("words")
        all_word_strings = [re.sub(TOKEN_WORDS, "", w[4]).lower() for w in all_words]
        # Handle token apostrophe s-es
        # Tokens will look like "marfan s syndrome" but PyMuPDF would be "marfan's syndrome"
        apostrophe = re.search(APOSTROPHE_S, keyword)
        # If found, replace " s" with "'s"; either way, split words into a list
        if apostrophe:
            span = apostrophe.span()
            keyword_words = (keyword[:span[0]] + "'s" + keyword[span[1]:]).split(" ")
        else:
            keyword_words = keyword.split(" ")
        found = False
        # Check for presence of keyword in word string list
        for i in range(len(all_word_strings) - len(keyword_words) + 1):
            # Stop searching if we found a match
            if found:
                break
            # Loop through keywords
            for j in range(len(keyword_words)):
                current_word = all_word_strings[i + j]
                # Handle apostrophes - if the word on the page has apostrophe s, check both word equality and word
                # quality without the apostrophe s
                # This handles cases where the search keyword has an apostrophe s (legionnare s disease),
                # as well as cases where the page text has an apostrophe s (procedure: the patient's condition...)
                if current_word.endswith("'s"):
                    if current_word != keyword_words[j] and \
                            current_word[:len(current_word) - 2] != keyword_words[j]:
                        break
                # If the keyword wasn't found, stop looking at the current word
                elif current_word != keyword_words[j]:
                    break
            # If the loop is exhausted (meaning we found all necessary words), we have a hit
            # So pick the first word's x/y coordinates and use that for the target label
            else:
                start = all_words[i]
                if do_derotate:
                    keyword_loc = fitz.Rect(start[0], start[1], start[2], start[3]) * page.derotation_matrix
                else:
                    keyword_loc = fitz.Rect(start[0], start[1], start[2], start[3])
                found = True
                break
        # If the keyword STILL isn't found, try just searching for the first word
        if not found:
            keyword_locs = page.search_for(keyword.split(" ")[0])
            if keyword_locs:
                if do_derotate:
                    keyword_loc = min([loc * page.derotation_matrix for loc in keyword_locs], key=lambda k: k.y0)
                else:
                    keyword_loc = keyword_locs[0]
    # Keep the keyword location if it is the highest (vertically) on the page
    if not top_keyword_loc:
        top_keyword_loc = keyword_loc
    else:
        if keyword_loc.y0 < top_keyword_loc.y0:
            top_keyword_loc = keyword_loc
# If we found any kind of keyword hit, draw a label at the new location
if top_keyword_loc:
    found = True
    if do_derotate:
        page_bounds = page.bound() * page.derotation_matrix
    else:
        page_bounds = page.bound()
    offset = 0
    # Make sure label does not go outside page boundary - if it would, cap at page boundary
    if top_keyword_loc.y0 + VERTICAL_LABEL_LEN > page_bounds.y1:
        offset = (top_keyword_loc.y0 + VERTICAL_LABEL_LEN) - page_bounds.y1
    # Now, determine new annot_loc - use y values from keyword, but aligned with left margin, vertically
    # Include offset information determined by boundary and occlusion checks
    annot_loc = fitz.Rect(page_bounds.x0, top_keyword_loc.y0 - offset,
                          page_bounds.x0 + 15, (top_keyword_loc.y0 + VERTICAL_LABEL_LEN) - offset)
    # Make sure label does not occlude an existing label - if it does, keep whichever is more vertical on the page
    # Only way to do this is to get the rectangle of every label and cross-reference their coordinates
    for existing_annot in page.annots():
        existing_rect = existing_annot.rect
        existing_rect_mod = deepcopy(existing_rect)
        # Add a 15 pixel buffer in both directions before checking intersection for extra safety
        existing_rect_mod.y0 = existing_rect_mod.y0 - 15
        existing_rect_mod.y1 = existing_rect_mod.y1 + 15
        # If it intersects, determine which of the two labels is further up the page
        if existing_rect_mod.intersects(annot_loc):
            # If the existing label is the one that is further up, just put the new one below it
            if existing_rect.y0 < annot_loc.y0:
                to_move = annot_loc
                to_keep = existing_rect
                do_set_rect = False
            # Otherwise, prepare to move the old one down below the new one
            else:
                to_move = existing_rect
                to_keep = annot_loc
                do_set_rect = True
            # Move whichever one you are moving 5 pixels below the one you are not moving
            to_move.y0 = to_keep.y1 + 5
            # Do the same offset check again to check for running over the page length
            if to_move.y0 + VERTICAL_LABEL_LEN > page_bounds.y1:
                offset = (to_move.y0 + VERTICAL_LABEL_LEN) - page_bounds.y1
            else:
                offset = 0
            # Move the label you're moving
            to_move.y1 = (to_move.y0 + VERTICAL_LABEL_LEN) - offset
            # Call set_rect() if you're moving the existing label, as that is how you move an existing label
            if do_set_rect:
                existing_annot.set_rect(existing_rect)
            # Otherwise, just set the new annotation location to the moved label
            else:
                annot_loc = to_move
    # Finally, look inside annot_loc to see if there is any text in it; if there is, also lower opacity
    # Do this by searching in annot_loc plus a slight extra x buffer to ensure confidence and cover edge cases
    search_loc = fitz.Rect(annot_loc.x0, annot_loc.y0, annot_loc.x1 + 5, annot_loc.y1)
    if page.get_textbox(search_loc):
        lower_opacity = True
        LOGGER.debug("Found overlap between text and clinical summary label on %s page %i, lowering opacity." %
                     (page.parent.name, page.number + 1))
# If we didn't find any keywords on the page, fall back to top-of-page logic (2nd row with diagnostics)
# This will revert fontsize, remove line breaks from the annotation text, and un-center the label text
if not found:
    font_size = 18
    annot_text = annot_text.replace("\n", "")
    align = fitz.TEXT_ALIGN_LEFT
# Draw actual label now
label_annot = page.add_freetext_annot(annot_loc * page.derotation_matrix, annot_text, fontsize=font_size,
                                      fill_color=fill_color, text_color=BLACK, align=align, rotate=page.rotation)
label_annot.set_info({"title": "NLP"})
# Set opacity to 50% if we determined there's overlap with any text
if lower_opacity:
    label_annot.update(opacity=0.50, fill_color=fill_color)

weironglue avatar Oct 01 '22 21:10 weironglue

I can see you new post here, and I also received it as an e-mail in my inbox. But I did not receive your input file - not here, not in my e-mail inbox.

JorjMcKie avatar Oct 01 '22 21:10 JorjMcKie

Hi, I am still missing your input file. Jorj

Gesendet von Mailhttps://go.microsoft.com/fwlink/?LinkId=550986 für Windows

Von: @.> Gesendet: Samstag, 1. Oktober 2022 17:22 An: @.> Cc: Jorj X. @.>; State @.> Betreff: Re: [pymupdf/PyMuPDF] Kernel Python3 segfault error 4 in Linux machine running PyNuPDF V1.20.1 (Issue #1937)

Hi, I sent you an email yesterday. I am sending you the attached file again now. Again, the codes is running without kernel error in PyMuPDF 1.19.5. Please reply to my email to confirm you have received it. Thanks, Wei

On Saturday, October 1, 2022 at 06:44:01 AM PDT, Jorj X. McKie @.***> wrote:

I am going to close this because of lack of supporting evidence / reproducibility.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

def draw_clin_label(page, annot_text, annot_loc, font_size, fill_color, keywords): """Helper method to draw a new clinical summary label onto the page. These are placed parallel to where their associate keyword is on the page.

Arguments

page: PyMuPDF page PyMuPDF representation of a page.

annot_text: string The text to write into the label.

annot_loc: PyMuPDF Rectangle PyMuPDF page rectangle area in which to draw the label text. Only used in case of emergency, e.g. keyword wasn't found so must fall back to 2nd row.

font_size: int Fontsize to use. 14 for organs, 18 for everything else.

fill_color: PyMuPDF RGB color tuple Color to use for the label.

keywords: list of tuples List of tuples of (keyword, frequency) for the given category on the page. Used to search page text to determine label placement.

Returns

Label drawn onto page. The label is placed on left margin across from the keyword if the keyword is found; otherwise, it draws it in the 2nd row. """ found, lower_opacity = False, False top_keyword_loc = None align = fitz.TEXT_ALIGN_CENTER

Special handling for 180-degree rotated pages - in these cases, multiply each comparison by derotation matrix

so that y coordinates are mapped correctly. This breaks behavior for 270-degree pages so do with flag structure

do_derotate = page.rotation == 180

Search page text for all keywords that were identified earlier in the pipeline

for keyword, frequency in keywords: keyword_loc = None keyword_locs = page.search_for(keyword)

Pick the appearance of the keyword in the page with the lowest y0, in case there are multiple appearances

if keyword_locs: if do_derotate: keyword_loc = min([loc * page.derotation_matrix for loc in keyword_locs], key=lambda k: k.y0) else: keyword_loc = min(keyword_locs, key=lambda k: k.y0)

If keyword wasn't found, try getting page words, stripping punctuation, and checking that way

This is necessary because PyMuPDF search_for() includes punctuation and the keyword tokens don't

else: all_words = page.get_text("words") all_word_strings = [re.sub(TOKEN_WORDS, "", w[4]).lower() for w in all_words]

Handle token apostrophe s-es

Tokens will look like "marfan s syndrome" but PyMuPDF would be "marfan's syndrome"

apostrophe = re.search(APOSTROPHE_S, keyword)

If found, replace " s" with "'s"; either way, split words into a list

if apostrophe: span = apostrophe.span() keyword_words = (keyword[:span[0]] + "'s" + keyword[span[1]:]).split(" ") else: keyword_words = keyword.split(" ") found = False

Check for presence of keyword in word string list

for i in range(len(all_word_strings) - len(keyword_words) + 1):

Stop searching if we found a match

if found: break

Loop through keywords

for j in range(len(keyword_words)): current_word = all_word_strings[i + j]

Handle apostrophes - if the word on the page has apostrophe s, check both word equality and word

quality without the apostrophe s

This handles cases where the search keyword has an apostrophe s (legionnare s disease),

as well as cases where the page text has an apostrophe s (procedure: the patient's condition...)

if current_word.endswith("'s"): if current_word != keyword_words[j] and
current_word[:len(current_word) - 2] != keyword_words[j]: break

If the keyword wasn't found, stop looking at the current word

elif current_word != keyword_words[j]: break

If the loop is exhausted (meaning we found all necessary words), we have a hit

So pick the first word's x/y coordinates and use that for the target label

else: start = all_words[i] if do_derotate: keyword_loc = fitz.Rect(start[0], start[1], start[2], start[3]) * page.derotation_matrix else: keyword_loc = fitz.Rect(start[0], start[1], start[2], start[3]) found = True break

If the keyword STILL isn't found, try just searching for the first word

if not found: keyword_locs = page.search_for(keyword.split(" ")[0]) if keyword_locs: if do_derotate: keyword_loc = min([loc * page.derotation_matrix for loc in keyword_locs], key=lambda k: k.y0) else: keyword_loc = keyword_locs[0]

Keep the keyword location if it is the highest (vertically) on the page

if not top_keyword_loc: top_keyword_loc = keyword_loc else: if keyword_loc.y0 < top_keyword_loc.y0: top_keyword_loc = keyword_loc

If we found any kind of keyword hit, draw a label at the new location

if top_keyword_loc: found = True if do_derotate: page_bounds = page.bound() * page.derotation_matrix else: page_bounds = page.bound() offset = 0

Make sure label does not go outside page boundary - if it would, cap at page boundary

if top_keyword_loc.y0 + VERTICAL_LABEL_LEN > page_bounds.y1: offset = (top_keyword_loc.y0 + VERTICAL_LABEL_LEN) - page_bounds.y1

Now, determine new annot_loc - use y values from keyword, but aligned with left margin, vertically

Include offset information determined by boundary and occlusion checks

annot_loc = fitz.Rect(page_bounds.x0, top_keyword_loc.y0 - offset, page_bounds.x0 + 15, (top_keyword_loc.y0 + VERTICAL_LABEL_LEN) - offset)

Make sure label does not occlude an existing label - if it does, keep whichever is more vertical on the page

Only way to do this is to get the rectangle of every label and cross-reference their coordinates

for existing_annot in page.annots(): existing_rect = existing_annot.rect existing_rect_mod = deepcopy(existing_rect)

Add a 15 pixel buffer in both directions before checking intersection for extra safety

existing_rect_mod.y0 = existing_rect_mod.y0 - 15 existing_rect_mod.y1 = existing_rect_mod.y1 + 15

If it intersects, determine which of the two labels is further up the page

if existing_rect_mod.intersects(annot_loc):

If the existing label is the one that is further up, just put the new one below it

if existing_rect.y0 < annot_loc.y0: to_move = annot_loc to_keep = existing_rect do_set_rect = False

Otherwise, prepare to move the old one down below the new one

else: to_move = existing_rect to_keep = annot_loc do_set_rect = True

Move whichever one you are moving 5 pixels below the one you are not moving

to_move.y0 = to_keep.y1 + 5

Do the same offset check again to check for running over the page length

if to_move.y0 + VERTICAL_LABEL_LEN > page_bounds.y1: offset = (to_move.y0 + VERTICAL_LABEL_LEN) - page_bounds.y1 else: offset = 0

Move the label you're moving

to_move.y1 = (to_move.y0 + VERTICAL_LABEL_LEN) - offset

Call set_rect() if you're moving the existing label, as that is how you move an existing label

if do_set_rect: existing_annot.set_rect(existing_rect)

Otherwise, just set the new annotation location to the moved label

else: annot_loc = to_move

Finally, look inside annot_loc to see if there is any text in it; if there is, also lower opacity

Do this by searching in annot_loc plus a slight extra x buffer to ensure confidence and cover edge cases

search_loc = fitz.Rect(annot_loc.x0, annot_loc.y0, annot_loc.x1 + 5, annot_loc.y1) if page.get_textbox(search_loc): lower_opacity = True LOGGER.debug("Found overlap between text and clinical summary label on %s page %i, lowering opacity." % (page.parent.name, page.number + 1))

If we didn't find any keywords on the page, fall back to top-of-page logic (2nd row with diagnostics)

This will revert fontsize, remove line breaks from the annotation text, and un-center the label text

if not found: font_size = 18 annot_text = annot_text.replace("\n", "") align = fitz.TEXT_ALIGN_LEFT

Draw actual label now

label_annot = page.add_freetext_annot(annot_loc * page.derotation_matrix, annot_text, fontsize=font_size, fill_color=fill_color, text_color=BLACK, align=align, rotate=page.rotation) label_annot.set_info({"title": "NLP"})

Set opacity to 50% if we determined there's overlap with any text

if lower_opacity: label_annot.update(opacity=0.50, fill_color=fill_color)

— Reply to this email directly, view it on GitHubhttps://github.com/pymupdf/PyMuPDF/issues/1937#issuecomment-1264484596, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB7IDIUPKI6N4EC5POSSUK3WBCTQTANCNFSM6AAAAAAQVHJP4A. You are receiving this because you modified the open/close state.Message ID: @.***>

JorjMcKie avatar Oct 01 '22 21:10 JorjMcKie

Hi, I am sending you the codes as follows.  I also attached it to my email one more time. Please let me know if you receive it. Thanks, Wei def draw_clin_label(page, annot_text, annot_loc, font_size, fill_color, keywords):    """Helper method to draw a new clinical summary label onto the page.    These are placed parallel to where their associate keyword is on the page.     Arguments    =========    page: PyMuPDF page        PyMuPDF representation of a page.     annot_text: string        The text to write into the label.     annot_loc: PyMuPDF Rectangle        PyMuPDF page rectangle area in which to draw the label text.        Only used in case of emergency, e.g. keyword wasn't found so must fall back to 2nd row.     font_size: int        Fontsize to use. 14 for organs, 18 for everything else.     fill_color: PyMuPDF RGB color tuple        Color to use for the label.     keywords: list of tuples        List of tuples of (keyword, frequency) for the given category on the page.        Used to search page text to determine label placement.     Returns    =======    Label drawn onto page. The label is placed on left margin across from the keyword if the keyword is found;    otherwise, it draws it in the 2nd row.    """    found, lower_opacity = False, False    top_keyword_loc = None    align = fitz.TEXT_ALIGN_CENTER    # Special handling for 180-degree rotated pages - in these cases, multiply each comparison by derotation matrix    # so that y coordinates are mapped correctly. This breaks behavior for 270-degree pages so do with flag structure    do_derotate = page.rotation == 180    # Search page text for all keywords that were identified earlier in the pipeline    for keyword, frequency in keywords:        keyword_loc = None        keyword_locs = page.search_for(keyword)        # Pick the appearance of the keyword in the page with the lowest y0, in case there are multiple appearances        if keyword_locs:            if do_derotate:                keyword_loc = min([loc * page.derotation_matrix for loc in keyword_locs], key=lambda k: k.y0)            else:                keyword_loc = min(keyword_locs, key=lambda k: k.y0)        # If keyword wasn't found, try getting page words, stripping punctuation, and checking that way        # This is necessary because PyMuPDF search_for() includes punctuation and the keyword tokens don't        else:            all_words = page.get_text("words")            all_word_strings = [re.sub(TOKEN_WORDS, "", w[4]).lower() for w in all_words]            # Handle token apostrophe s-es            # Tokens will look like "marfan s syndrome" but PyMuPDF would be "marfan's syndrome"            apostrophe = re.search(APOSTROPHE_S, keyword)            # If found, replace " s" with "'s"; either way, split words into a list            if apostrophe:                span = apostrophe.span()                keyword_words = (keyword[:span[0]] + "'s" + keyword[span[1]:]).split(" ")            else:                keyword_words = keyword.split(" ")            found = False            # Check for presence of keyword in word string list            for i in range(len(all_word_strings) - len(keyword_words) + 1):                # Stop searching if we found a match                if found:                    break                # Loop through keywords                for j in range(len(keyword_words)):                    current_word = all_word_strings[i + j]                    # Handle apostrophes - if the word on the page has apostrophe s, check both word equality and word                    # quality without the apostrophe s                    # This handles cases where the search keyword has an apostrophe s (legionnare s disease),                    # as well as cases where the page text has an apostrophe s (procedure: the patient's condition...)                    if current_word.endswith("'s"):                        if current_word != keyword_words[j] and \                                current_word[:len(current_word) - 2] != keyword_words[j]:                            break                    # If the keyword wasn't found, stop looking at the current word                    elif current_word != keyword_words[j]:                        break                # If the loop is exhausted (meaning we found all necessary words), we have a hit                # So pick the first word's x/y coordinates and use that for the target label                else:                    start = all_words[i]                    if do_derotate:                        keyword_loc = fitz.Rect(start[0], start[1], start[2], start[3]) * page.derotation_matrix                    else:                        keyword_loc = fitz.Rect(start[0], start[1], start[2], start[3])                    found = True                    break            # If the keyword STILL isn't found, try just searching for the first word            if not found:                keyword_locs = page.search_for(keyword.split(" ")[0])                if keyword_locs:                    if do_derotate:                        keyword_loc = min([loc * page.derotation_matrix for loc in keyword_locs], key=lambda k: k.y0)                    else:                        keyword_loc = keyword_locs[0]        # Keep the keyword location if it is the highest (vertically) on the page        if not top_keyword_loc:            top_keyword_loc = keyword_loc        else:            if keyword_loc.y0 < top_keyword_loc.y0:                top_keyword_loc = keyword_loc    # If we found any kind of keyword hit, draw a label at the new location    if top_keyword_loc:        found = True        if do_derotate:            page_bounds = page.bound() * page.derotation_matrix        else:            page_bounds = page.bound()        offset = 0        # Make sure label does not go outside page boundary - if it would, cap at page boundary        if top_keyword_loc.y0 + VERTICAL_LABEL_LEN > page_bounds.y1:            offset = (top_keyword_loc.y0 + VERTICAL_LABEL_LEN) - page_bounds.y1        # Now, determine new annot_loc - use y values from keyword, but aligned with left margin, vertically        # Include offset information determined by boundary and occlusion checks        annot_loc = fitz.Rect(page_bounds.x0, top_keyword_loc.y0 - offset,                              page_bounds.x0 + 15, (top_keyword_loc.y0 + VERTICAL_LABEL_LEN) - offset)        # Make sure label does not occlude an existing label - if it does, keep whichever is more vertical on the page        # Only way to do this is to get the rectangle of every label and cross-reference their coordinates        for existing_annot in page.annots():            existing_rect = existing_annot.rect            existing_rect_mod = deepcopy(existing_rect)            # Add a 15 pixel buffer in both directions before checking intersection for extra safety            existing_rect_mod.y0 = existing_rect_mod.y0 - 15            existing_rect_mod.y1 = existing_rect_mod.y1 + 15            # If it intersects, determine which of the two labels is further up the page            if existing_rect_mod.intersects(annot_loc):                # If the existing label is the one that is further up, just put the new one below it                if existing_rect.y0 < annot_loc.y0:                    to_move = annot_loc                    to_keep = existing_rect                    do_set_rect = False                # Otherwise, prepare to move the old one down below the new one                else:                    to_move = existing_rect                    to_keep = annot_loc                    do_set_rect = True                # Move whichever one you are moving 5 pixels below the one you are not moving                to_move.y0 = to_keep.y1 + 5                # Do the same offset check again to check for running over the page length                if to_move.y0 + VERTICAL_LABEL_LEN > page_bounds.y1:                    offset = (to_move.y0 + VERTICAL_LABEL_LEN) - page_bounds.y1                else:                    offset = 0                # Move the label you're moving                to_move.y1 = (to_move.y0 + VERTICAL_LABEL_LEN) - offset                # Call set_rect() if you're moving the existing label, as that is how you move an existing label                if do_set_rect:                    existing_annot.set_rect(existing_rect)                # Otherwise, just set the new annotation location to the moved label                else:                    annot_loc = to_move        # Finally, look inside annot_loc to see if there is any text in it; if there is, also lower opacity        # Do this by searching in annot_loc plus a slight extra x buffer to ensure confidence and cover edge cases        search_loc = fitz.Rect(annot_loc.x0, annot_loc.y0, annot_loc.x1 + 5, annot_loc.y1)        if page.get_textbox(search_loc):            lower_opacity = True            LOGGER.debug("Found overlap between text and clinical summary label on %s page %i, lowering opacity." %                         (page.parent.name, page.number + 1))    # If we didn't find any keywords on the page, fall back to top-of-page logic (2nd row with diagnostics)    # This will revert fontsize, remove line breaks from the annotation text, and un-center the label text    if not found:        font_size = 18        annot_text = annot_text.replace("\n", "")        align = fitz.TEXT_ALIGN_LEFT    # Draw actual label now    label_annot = page.add_freetext_annot(annot_loc * page.derotation_matrix, annot_text, fontsize=font_size,                                          fill_color=fill_color, text_color=BLACK, align=align, rotate=page.rotation)    label_annot.set_info({"title": "NLP"})    # Set opacity to 50% if we determined there's overlap with any text    if lower_opacity:        label_annot.update(opacity=0.50, fill_color=fill_color)

On Saturday, October 1, 2022 at 02:50:19 PM PDT, Jorj X. McKie ***@***.***> wrote:  

Hi, I am still missing your input file. Jorj

Gesendet von Mailhttps://go.microsoft.com/fwlink/?LinkId=550986 für Windows

Von: @.> Gesendet: Samstag, 1. Oktober 2022 17:22 An: @.> Cc: Jorj X. @.>; State @.> Betreff: Re: [pymupdf/PyMuPDF] Kernel Python3 segfault error 4 in Linux machine running PyNuPDF V1.20.1 (Issue #1937)

Hi, I sent you an email yesterday. I am sending you the attached file again now. Again, the codes is running without kernel error in PyMuPDF 1.19.5. Please reply to my email to confirm you have received it. Thanks, Wei

On Saturday, October 1, 2022 at 06:44:01 AM PDT, Jorj X. McKie @.***> wrote:

I am going to close this because of lack of supporting evidence / reproducibility.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

def draw_clin_label(page, annot_text, annot_loc, font_size, fill_color, keywords): """Helper method to draw a new clinical summary label onto the page. These are placed parallel to where their associate keyword is on the page.

Arguments

page: PyMuPDF page PyMuPDF representation of a page.

annot_text: string The text to write into the label.

annot_loc: PyMuPDF Rectangle PyMuPDF page rectangle area in which to draw the label text. Only used in case of emergency, e.g. keyword wasn't found so must fall back to 2nd row.

font_size: int Fontsize to use. 14 for organs, 18 for everything else.

fill_color: PyMuPDF RGB color tuple Color to use for the label.

keywords: list of tuples List of tuples of (keyword, frequency) for the given category on the page. Used to search page text to determine label placement.

Returns

Label drawn onto page. The label is placed on left margin across from the keyword if the keyword is found; otherwise, it draws it in the 2nd row. """ found, lower_opacity = False, False top_keyword_loc = None align = fitz.TEXT_ALIGN_CENTER

Special handling for 180-degree rotated pages - in these cases, multiply each comparison by derotation matrix

so that y coordinates are mapped correctly. This breaks behavior for 270-degree pages so do with flag structure

do_derotate = page.rotation == 180

Search page text for all keywords that were identified earlier in the pipeline

for keyword, frequency in keywords: keyword_loc = None keyword_locs = page.search_for(keyword)

Pick the appearance of the keyword in the page with the lowest y0, in case there are multiple appearances

if keyword_locs: if do_derotate: keyword_loc = min([loc * page.derotation_matrix for loc in keyword_locs], key=lambda k: k.y0) else: keyword_loc = min(keyword_locs, key=lambda k: k.y0)

If keyword wasn't found, try getting page words, stripping punctuation, and checking that way

This is necessary because PyMuPDF search_for() includes punctuation and the keyword tokens don't

else: all_words = page.get_text("words") all_word_strings = [re.sub(TOKEN_WORDS, "", w[4]).lower() for w in all_words]

Handle token apostrophe s-es

Tokens will look like "marfan s syndrome" but PyMuPDF would be "marfan's syndrome"

apostrophe = re.search(APOSTROPHE_S, keyword)

If found, replace " s" with "'s"; either way, split words into a list

if apostrophe: span = apostrophe.span() keyword_words = (keyword[:span[0]] + "'s" + keyword[span[1]:]).split(" ") else: keyword_words = keyword.split(" ") found = False

Check for presence of keyword in word string list

for i in range(len(all_word_strings) - len(keyword_words) + 1):

Stop searching if we found a match

if found: break

Loop through keywords

for j in range(len(keyword_words)): current_word = all_word_strings[i + j]

Handle apostrophes - if the word on the page has apostrophe s, check both word equality and word

quality without the apostrophe s

This handles cases where the search keyword has an apostrophe s (legionnare s disease),

as well as cases where the page text has an apostrophe s (procedure: the patient's condition...)

if current_word.endswith("'s"): if current_word != keyword_words[j] and
current_word[:len(current_word) - 2] != keyword_words[j]: break

If the keyword wasn't found, stop looking at the current word

elif current_word != keyword_words[j]: break

If the loop is exhausted (meaning we found all necessary words), we have a hit

So pick the first word's x/y coordinates and use that for the target label

else: start = all_words[i] if do_derotate: keyword_loc = fitz.Rect(start[0], start[1], start[2], start[3]) * page.derotation_matrix else: keyword_loc = fitz.Rect(start[0], start[1], start[2], start[3]) found = True break

If the keyword STILL isn't found, try just searching for the first word

if not found: keyword_locs = page.search_for(keyword.split(" ")[0]) if keyword_locs: if do_derotate: keyword_loc = min([loc * page.derotation_matrix for loc in keyword_locs], key=lambda k: k.y0) else: keyword_loc = keyword_locs[0]

Keep the keyword location if it is the highest (vertically) on the page

if not top_keyword_loc: top_keyword_loc = keyword_loc else: if keyword_loc.y0 < top_keyword_loc.y0: top_keyword_loc = keyword_loc

If we found any kind of keyword hit, draw a label at the new location

if top_keyword_loc: found = True if do_derotate: page_bounds = page.bound() * page.derotation_matrix else: page_bounds = page.bound() offset = 0

Make sure label does not go outside page boundary - if it would, cap at page boundary

if top_keyword_loc.y0 + VERTICAL_LABEL_LEN > page_bounds.y1: offset = (top_keyword_loc.y0 + VERTICAL_LABEL_LEN) - page_bounds.y1

Now, determine new annot_loc - use y values from keyword, but aligned with left margin, vertically

Include offset information determined by boundary and occlusion checks

annot_loc = fitz.Rect(page_bounds.x0, top_keyword_loc.y0 - offset, page_bounds.x0 + 15, (top_keyword_loc.y0 + VERTICAL_LABEL_LEN) - offset)

Make sure label does not occlude an existing label - if it does, keep whichever is more vertical on the page

Only way to do this is to get the rectangle of every label and cross-reference their coordinates

for existing_annot in page.annots(): existing_rect = existing_annot.rect existing_rect_mod = deepcopy(existing_rect)

Add a 15 pixel buffer in both directions before checking intersection for extra safety

existing_rect_mod.y0 = existing_rect_mod.y0 - 15 existing_rect_mod.y1 = existing_rect_mod.y1 + 15

If it intersects, determine which of the two labels is further up the page

if existing_rect_mod.intersects(annot_loc):

If the existing label is the one that is further up, just put the new one below it

if existing_rect.y0 < annot_loc.y0: to_move = annot_loc to_keep = existing_rect do_set_rect = False

Otherwise, prepare to move the old one down below the new one

else: to_move = existing_rect to_keep = annot_loc do_set_rect = True

Move whichever one you are moving 5 pixels below the one you are not moving

to_move.y0 = to_keep.y1 + 5

Do the same offset check again to check for running over the page length

if to_move.y0 + VERTICAL_LABEL_LEN > page_bounds.y1: offset = (to_move.y0 + VERTICAL_LABEL_LEN) - page_bounds.y1 else: offset = 0

Move the label you're moving

to_move.y1 = (to_move.y0 + VERTICAL_LABEL_LEN) - offset

Call set_rect() if you're moving the existing label, as that is how you move an existing label

if do_set_rect: existing_annot.set_rect(existing_rect)

Otherwise, just set the new annotation location to the moved label

else: annot_loc = to_move

Finally, look inside annot_loc to see if there is any text in it; if there is, also lower opacity

Do this by searching in annot_loc plus a slight extra x buffer to ensure confidence and cover edge cases

search_loc = fitz.Rect(annot_loc.x0, annot_loc.y0, annot_loc.x1 + 5, annot_loc.y1) if page.get_textbox(search_loc): lower_opacity = True LOGGER.debug("Found overlap between text and clinical summary label on %s page %i, lowering opacity." % (page.parent.name, page.number + 1))

If we didn't find any keywords on the page, fall back to top-of-page logic (2nd row with diagnostics)

This will revert fontsize, remove line breaks from the annotation text, and un-center the label text

if not found: font_size = 18 annot_text = annot_text.replace("\n", "") align = fitz.TEXT_ALIGN_LEFT

Draw actual label now

label_annot = page.add_freetext_annot(annot_loc * page.derotation_matrix, annot_text, fontsize=font_size, fill_color=fill_color, text_color=BLACK, align=align, rotate=page.rotation) label_annot.set_info({"title": "NLP"})

Set opacity to 50% if we determined there's overlap with any text

if lower_opacity: label_annot.update(opacity=0.50, fill_color=fill_color)

— Reply to this email directly, view it on GitHubhttps://github.com/pymupdf/PyMuPDF/issues/1937#issuecomment-1264484596, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB7IDIUPKI6N4EC5POSSUK3WBCTQTANCNFSM6AAAAAAQVHJP4A. You are receiving this because you modified the open/close state.Message ID: @.***>

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

def draw_clin_label(page, annot_text, annot_loc, font_size, fill_color, keywords): """Helper method to draw a new clinical summary label onto the page. These are placed parallel to where their associate keyword is on the page.

Arguments
=========
page: PyMuPDF page
    PyMuPDF representation of a page.

annot_text: string
    The text to write into the label.

annot_loc: PyMuPDF Rectangle
    PyMuPDF page rectangle area in which to draw the label text.
    Only used in case of emergency, e.g. keyword wasn't found so must fall back to 2nd row.

font_size: int
    Fontsize to use. 14 for organs, 18 for everything else.

fill_color: PyMuPDF RGB color tuple
    Color to use for the label.

keywords: list of tuples
    List of tuples of (keyword, frequency) for the given category on the page.
    Used to search page text to determine label placement.

Returns
=======
Label drawn onto page. The label is placed on left margin across from the keyword if the keyword is found;
otherwise, it draws it in the 2nd row.
"""
found, lower_opacity = False, False
top_keyword_loc = None
align = fitz.TEXT_ALIGN_CENTER
# Special handling for 180-degree rotated pages - in these cases, multiply each comparison by derotation matrix
# so that y coordinates are mapped correctly. This breaks behavior for 270-degree pages so do with flag structure
do_derotate = page.rotation == 180
# Search page text for all keywords that were identified earlier in the pipeline
for keyword, frequency in keywords:
    keyword_loc = None
    keyword_locs = page.search_for(keyword)
    # Pick the appearance of the keyword in the page with the lowest y0, in case there are multiple appearances
    if keyword_locs:
        if do_derotate:
            keyword_loc = min([loc * page.derotation_matrix for loc in keyword_locs], key=lambda k: k.y0)
        else:
            keyword_loc = min(keyword_locs, key=lambda k: k.y0)
    # If keyword wasn't found, try getting page words, stripping punctuation, and checking that way
    # This is necessary because PyMuPDF search_for() includes punctuation and the keyword tokens don't
    else:
        all_words = page.get_text("words")
        all_word_strings = [re.sub(TOKEN_WORDS, "", w[4]).lower() for w in all_words]
        # Handle token apostrophe s-es
        # Tokens will look like "marfan s syndrome" but PyMuPDF would be "marfan's syndrome"
        apostrophe = re.search(APOSTROPHE_S, keyword)
        # If found, replace " s" with "'s"; either way, split words into a list
        if apostrophe:
            span = apostrophe.span()
            keyword_words = (keyword[:span[0]] + "'s" + keyword[span[1]:]).split(" ")
        else:
            keyword_words = keyword.split(" ")
        found = False
        # Check for presence of keyword in word string list
        for i in range(len(all_word_strings) - len(keyword_words) + 1):
            # Stop searching if we found a match
            if found:
                break
            # Loop through keywords
            for j in range(len(keyword_words)):
                current_word = all_word_strings[i + j]
                # Handle apostrophes - if the word on the page has apostrophe s, check both word equality and word
                # quality without the apostrophe s
                # This handles cases where the search keyword has an apostrophe s (legionnare s disease),
                # as well as cases where the page text has an apostrophe s (procedure: the patient's condition...)
                if current_word.endswith("'s"):
                    if current_word != keyword_words[j] and \
                            current_word[:len(current_word) - 2] != keyword_words[j]:
                        break
                # If the keyword wasn't found, stop looking at the current word
                elif current_word != keyword_words[j]:
                    break
            # If the loop is exhausted (meaning we found all necessary words), we have a hit
            # So pick the first word's x/y coordinates and use that for the target label
            else:
                start = all_words[i]
                if do_derotate:
                    keyword_loc = fitz.Rect(start[0], start[1], start[2], start[3]) * page.derotation_matrix
                else:
                    keyword_loc = fitz.Rect(start[0], start[1], start[2], start[3])
                found = True
                break
        # If the keyword STILL isn't found, try just searching for the first word
        if not found:
            keyword_locs = page.search_for(keyword.split(" ")[0])
            if keyword_locs:
                if do_derotate:
                    keyword_loc = min([loc * page.derotation_matrix for loc in keyword_locs], key=lambda k: k.y0)
                else:
                    keyword_loc = keyword_locs[0]
    # Keep the keyword location if it is the highest (vertically) on the page
    if not top_keyword_loc:
        top_keyword_loc = keyword_loc
    else:
        if keyword_loc.y0 < top_keyword_loc.y0:
            top_keyword_loc = keyword_loc
# If we found any kind of keyword hit, draw a label at the new location
if top_keyword_loc:
    found = True
    if do_derotate:
        page_bounds = page.bound() * page.derotation_matrix
    else:
        page_bounds = page.bound()
    offset = 0
    # Make sure label does not go outside page boundary - if it would, cap at page boundary
    if top_keyword_loc.y0 + VERTICAL_LABEL_LEN > page_bounds.y1:
        offset = (top_keyword_loc.y0 + VERTICAL_LABEL_LEN) - page_bounds.y1
    # Now, determine new annot_loc - use y values from keyword, but aligned with left margin, vertically
    # Include offset information determined by boundary and occlusion checks
    annot_loc = fitz.Rect(page_bounds.x0, top_keyword_loc.y0 - offset,
                          page_bounds.x0 + 15, (top_keyword_loc.y0 + VERTICAL_LABEL_LEN) - offset)
    # Make sure label does not occlude an existing label - if it does, keep whichever is more vertical on the page
    # Only way to do this is to get the rectangle of every label and cross-reference their coordinates
    for existing_annot in page.annots():
        existing_rect = existing_annot.rect
        existing_rect_mod = deepcopy(existing_rect)
        # Add a 15 pixel buffer in both directions before checking intersection for extra safety
        existing_rect_mod.y0 = existing_rect_mod.y0 - 15
        existing_rect_mod.y1 = existing_rect_mod.y1 + 15
        # If it intersects, determine which of the two labels is further up the page
        if existing_rect_mod.intersects(annot_loc):
            # If the existing label is the one that is further up, just put the new one below it
            if existing_rect.y0 < annot_loc.y0:
                to_move = annot_loc
                to_keep = existing_rect
                do_set_rect = False
            # Otherwise, prepare to move the old one down below the new one
            else:
                to_move = existing_rect
                to_keep = annot_loc
                do_set_rect = True
            # Move whichever one you are moving 5 pixels below the one you are not moving
            to_move.y0 = to_keep.y1 + 5
            # Do the same offset check again to check for running over the page length
            if to_move.y0 + VERTICAL_LABEL_LEN > page_bounds.y1:
                offset = (to_move.y0 + VERTICAL_LABEL_LEN) - page_bounds.y1
            else:
                offset = 0
            # Move the label you're moving
            to_move.y1 = (to_move.y0 + VERTICAL_LABEL_LEN) - offset
            # Call set_rect() if you're moving the existing label, as that is how you move an existing label
            if do_set_rect:
                existing_annot.set_rect(existing_rect)
            # Otherwise, just set the new annotation location to the moved label
            else:
                annot_loc = to_move
    # Finally, look inside annot_loc to see if there is any text in it; if there is, also lower opacity
    # Do this by searching in annot_loc plus a slight extra x buffer to ensure confidence and cover edge cases
    search_loc = fitz.Rect(annot_loc.x0, annot_loc.y0, annot_loc.x1 + 5, annot_loc.y1)
    if page.get_textbox(search_loc):
        lower_opacity = True
        LOGGER.debug("Found overlap between text and clinical summary label on %s page %i, lowering opacity." %
                     (page.parent.name, page.number + 1))
# If we didn't find any keywords on the page, fall back to top-of-page logic (2nd row with diagnostics)
# This will revert fontsize, remove line breaks from the annotation text, and un-center the label text
if not found:
    font_size = 18
    annot_text = annot_text.replace("\n", "")
    align = fitz.TEXT_ALIGN_LEFT
# Draw actual label now
label_annot = page.add_freetext_annot(annot_loc * page.derotation_matrix, annot_text, fontsize=font_size,
                                      fill_color=fill_color, text_color=BLACK, align=align, rotate=page.rotation)
label_annot.set_info({"title": "NLP"})
# Set opacity to 50% if we determined there's overlap with any text
if lower_opacity:
    label_annot.update(opacity=0.50, fill_color=fill_color)

def draw_clin_label(page, annot_text, annot_loc, font_size, fill_color, keywords): """Helper method to draw a new clinical summary label onto the page. These are placed parallel to where their associate keyword is on the page.

Arguments
=========
page: PyMuPDF page
    PyMuPDF representation of a page.

annot_text: string
    The text to write into the label.

annot_loc: PyMuPDF Rectangle
    PyMuPDF page rectangle area in which to draw the label text.
    Only used in case of emergency, e.g. keyword wasn't found so must fall back to 2nd row.

font_size: int
    Fontsize to use. 14 for organs, 18 for everything else.

fill_color: PyMuPDF RGB color tuple
    Color to use for the label.

keywords: list of tuples
    List of tuples of (keyword, frequency) for the given category on the page.
    Used to search page text to determine label placement.

Returns
=======
Label drawn onto page. The label is placed on left margin across from the keyword if the keyword is found;
otherwise, it draws it in the 2nd row.
"""
found, lower_opacity = False, False
top_keyword_loc = None
align = fitz.TEXT_ALIGN_CENTER
# Special handling for 180-degree rotated pages - in these cases, multiply each comparison by derotation matrix
# so that y coordinates are mapped correctly. This breaks behavior for 270-degree pages so do with flag structure
do_derotate = page.rotation == 180
# Search page text for all keywords that were identified earlier in the pipeline
for keyword, frequency in keywords:
    keyword_loc = None
    keyword_locs = page.search_for(keyword)
    # Pick the appearance of the keyword in the page with the lowest y0, in case there are multiple appearances
    if keyword_locs:
        if do_derotate:
            keyword_loc = min([loc * page.derotation_matrix for loc in keyword_locs], key=lambda k: k.y0)
        else:
            keyword_loc = min(keyword_locs, key=lambda k: k.y0)
    # If keyword wasn't found, try getting page words, stripping punctuation, and checking that way
    # This is necessary because PyMuPDF search_for() includes punctuation and the keyword tokens don't
    else:
        all_words = page.get_text("words")
        all_word_strings = [re.sub(TOKEN_WORDS, "", w[4]).lower() for w in all_words]
        # Handle token apostrophe s-es
        # Tokens will look like "marfan s syndrome" but PyMuPDF would be "marfan's syndrome"
        apostrophe = re.search(APOSTROPHE_S, keyword)
        # If found, replace " s" with "'s"; either way, split words into a list
        if apostrophe:
            span = apostrophe.span()
            keyword_words = (keyword[:span[0]] + "'s" + keyword[span[1]:]).split(" ")
        else:
            keyword_words = keyword.split(" ")
        found = False
        # Check for presence of keyword in word string list
        for i in range(len(all_word_strings) - len(keyword_words) + 1):
            # Stop searching if we found a match
            if found:
                break
            # Loop through keywords
            for j in range(len(keyword_words)):
                current_word = all_word_strings[i + j]
                # Handle apostrophes - if the word on the page has apostrophe s, check both word equality and word
                # quality without the apostrophe s
                # This handles cases where the search keyword has an apostrophe s (legionnare s disease),
                # as well as cases where the page text has an apostrophe s (procedure: the patient's condition...)
                if current_word.endswith("'s"):
                    if current_word != keyword_words[j] and \
                            current_word[:len(current_word) - 2] != keyword_words[j]:
                        break
                # If the keyword wasn't found, stop looking at the current word
                elif current_word != keyword_words[j]:
                    break
            # If the loop is exhausted (meaning we found all necessary words), we have a hit
            # So pick the first word's x/y coordinates and use that for the target label
            else:
                start = all_words[i]
                if do_derotate:
                    keyword_loc = fitz.Rect(start[0], start[1], start[2], start[3]) * page.derotation_matrix
                else:
                    keyword_loc = fitz.Rect(start[0], start[1], start[2], start[3])
                found = True
                break
        # If the keyword STILL isn't found, try just searching for the first word
        if not found:
            keyword_locs = page.search_for(keyword.split(" ")[0])
            if keyword_locs:
                if do_derotate:
                    keyword_loc = min([loc * page.derotation_matrix for loc in keyword_locs], key=lambda k: k.y0)
                else:
                    keyword_loc = keyword_locs[0]
    # Keep the keyword location if it is the highest (vertically) on the page
    if not top_keyword_loc:
        top_keyword_loc = keyword_loc
    else:
        if keyword_loc.y0 < top_keyword_loc.y0:
            top_keyword_loc = keyword_loc
# If we found any kind of keyword hit, draw a label at the new location
if top_keyword_loc:
    found = True
    if do_derotate:
        page_bounds = page.bound() * page.derotation_matrix
    else:
        page_bounds = page.bound()
    offset = 0
    # Make sure label does not go outside page boundary - if it would, cap at page boundary
    if top_keyword_loc.y0 + VERTICAL_LABEL_LEN > page_bounds.y1:
        offset = (top_keyword_loc.y0 + VERTICAL_LABEL_LEN) - page_bounds.y1
    # Now, determine new annot_loc - use y values from keyword, but aligned with left margin, vertically
    # Include offset information determined by boundary and occlusion checks
    annot_loc = fitz.Rect(page_bounds.x0, top_keyword_loc.y0 - offset,
                          page_bounds.x0 + 15, (top_keyword_loc.y0 + VERTICAL_LABEL_LEN) - offset)
    # Make sure label does not occlude an existing label - if it does, keep whichever is more vertical on the page
    # Only way to do this is to get the rectangle of every label and cross-reference their coordinates
    for existing_annot in page.annots():
        existing_rect = existing_annot.rect
        existing_rect_mod = deepcopy(existing_rect)
        # Add a 15 pixel buffer in both directions before checking intersection for extra safety
        existing_rect_mod.y0 = existing_rect_mod.y0 - 15
        existing_rect_mod.y1 = existing_rect_mod.y1 + 15
        # If it intersects, determine which of the two labels is further up the page
        if existing_rect_mod.intersects(annot_loc):
            # If the existing label is the one that is further up, just put the new one below it
            if existing_rect.y0 < annot_loc.y0:
                to_move = annot_loc
                to_keep = existing_rect
                do_set_rect = False
            # Otherwise, prepare to move the old one down below the new one
            else:
                to_move = existing_rect
                to_keep = annot_loc
                do_set_rect = True
            # Move whichever one you are moving 5 pixels below the one you are not moving
            to_move.y0 = to_keep.y1 + 5
            # Do the same offset check again to check for running over the page length
            if to_move.y0 + VERTICAL_LABEL_LEN > page_bounds.y1:
                offset = (to_move.y0 + VERTICAL_LABEL_LEN) - page_bounds.y1
            else:
                offset = 0
            # Move the label you're moving
            to_move.y1 = (to_move.y0 + VERTICAL_LABEL_LEN) - offset
            # Call set_rect() if you're moving the existing label, as that is how you move an existing label
            if do_set_rect:
                existing_annot.set_rect(existing_rect)
            # Otherwise, just set the new annotation location to the moved label
            else:
                annot_loc = to_move
    # Finally, look inside annot_loc to see if there is any text in it; if there is, also lower opacity
    # Do this by searching in annot_loc plus a slight extra x buffer to ensure confidence and cover edge cases
    search_loc = fitz.Rect(annot_loc.x0, annot_loc.y0, annot_loc.x1 + 5, annot_loc.y1)
    if page.get_textbox(search_loc):
        lower_opacity = True
        LOGGER.debug("Found overlap between text and clinical summary label on %s page %i, lowering opacity." %
                     (page.parent.name, page.number + 1))
# If we didn't find any keywords on the page, fall back to top-of-page logic (2nd row with diagnostics)
# This will revert fontsize, remove line breaks from the annotation text, and un-center the label text
if not found:
    font_size = 18
    annot_text = annot_text.replace("\n", "")
    align = fitz.TEXT_ALIGN_LEFT
# Draw actual label now
label_annot = page.add_freetext_annot(annot_loc * page.derotation_matrix, annot_text, fontsize=font_size,
                                      fill_color=fill_color, text_color=BLACK, align=align, rotate=page.rotation)
label_annot.set_info({"title": "NLP"})
# Set opacity to 50% if we determined there's overlap with any text
if lower_opacity:
    label_annot.update(opacity=0.50, fill_color=fill_color)

weironglue avatar Oct 01 '22 22:10 weironglue

We are talking past each other: The code alone does not help! I need the file plus the code.

JorjMcKie avatar Oct 01 '22 22:10 JorjMcKie

Hi Jorj, We are running the Medical records in PDF files, which have Personal Identification Information and Personal Health Information that can't be released.  This is governed by the law and there is no exception.   Please see if it is possible to debug without the actual PDF files.  Sorry for the inconvenience.

Thanks, Wei On Saturday, October 1, 2022 at 03:28:11 PM PDT, Jorj X. McKie @.***> wrote:

We are talking past each other: The code alone does not help! I need the file plus the code.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

weironglue avatar Oct 11 '22 08:10 weironglue

Please see if it is possible to debug without the actual PDF files. Sorry for the inconvenience.

Then I am sorry to say, that we cannot help. PyMuPDF has ways to remove (sensitive) text: determine the respective text positions and overlay them with redaction annotations, apply the redactions and save to a new file. The resulting file will no longer contain this sensitive information. You can prove to your authority by extracting text and confirming that no sensitive information is contained any more. Then confirm that that new file still causes your problem. If so, you can use it to report the problem.

Another idea: We are currently working on the next PyMuPDF version. Possibly it would solve your problem. I could send you that version - however for Ubuntu Linux Python 3.10 only (or Windows), not Oracle Linux Python 3.7. Please drop me a note if that would help.

JorjMcKie avatar Oct 11 '22 10:10 JorjMcKie

Hi, I can upgrade our system to use Python 3.10.  Please send me that version, I can try it. Thanks, Wei

On Tuesday, October 11, 2022 at 03:49:27 AM PDT, Jorj X. McKie ***@***.***> wrote:  

Please see if it is possible to debug without the actual PDF files. Sorry for the inconvenience.

Then I am sorry to say, that we cannot help. PyMuPDF has ways to remove (sensitive) text: determine the respective text positions and overlay them with redaction annotations, apply the redactions and save to a new file. The resulting file will no longer contain this sensitive information. You can prove to your authority by extracting text and confirming that no sensitive information is contained any more. Then confirm that that new file still causes your problem. If so, you can use it to report the problem.

Another idea: We are currently working on the next PyMuPDF version. Possibly it would solve your problem. I could send you that version - however for Ubuntu Linux Python 3.10 only (or Windows), not Oracle Linux Python 3.7. Please drop me a note if that would help.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

weironglue avatar Oct 14 '22 03:10 weironglue

PyMuPDF-1.20.4-cp310-cp310-linux_x86_64.zip Unzip and then do python -m pip install -U PyMuPDF-1.20.4-cp310-cp310-linux_x86_64.whl

JorjMcKie avatar Oct 14 '22 08:10 JorjMcKie

I hope your machine can accept the platform tag "-linux_x86_64". If not you will have to wait for the official version.

JorjMcKie avatar Oct 14 '22 08:10 JorjMcKie