Please provide all mandatory information!

Describe the bug (mandatory)

A clear and concise description of what the bug is.

Please see the following screenshot of error in running the PyMuPDF V1.20.1. The program in Linux machine will stop by itself because of too many “kernel Python3 segfault error 4” errors without any warning or error message.

When I run this

To Reproduce (mandatory)

Explain the steps to reproduce the behavior, For example, include a minimal code snippet, example files, etc.

The Linux machine will stop running because PyMuPDF generates many Error 4 errors. The Linux system quits running after reporting 10-20 Error 4 errors. The only error that we can see is from the system log of Linux machine (as shown in the following screen shot).

Expected behavior (optional)

Describe what you expected to happen (if not obvious).

Screenshots (optional)

If applicable, add screenshots to help explain your problem.

PyMuPDF_Error

Your configuration (mandatory)

Operating system, potentially version and bitness
Python version, bitness
PyMuPDF version, installation method (wheel or generated from source).

OP system: Oracle Linux op system V 5.4.17 Python Version: 3.7.3 PyMuPdf version: 1.20.1

For example, the output of print(sys.version, "\n", sys.platform, "\n", fitz.__doc__) would be sufficient (for the first two bullets).

Additional context (optional)

Add any other context about the problem here.

Sep 25 '22 20:09 weironglue

I need the script (a minimal part of it only!) plus data files which cause the error. I need to reproduce the error on my machine. The console log alone tells me nothing. You forgot to mention, how you installed the package.

Sep 25 '22 20:09 JorjMcKie

Are there any other PyMuPDF scripts running successfully? Were there problems running earlier versions of PyMuPDF?

Sep 25 '22 20:09 JorjMcKie

Still waiting for material to reproduce the error ...

Sep 28 '22 13:09 JorjMcKie

I am going to close this because of lack of supporting evidence / reproducibility.

Oct 01 '22 13:10 JorjMcKie

Hi, I sent you an email yesterday. I am sending you the attached file again now. Again, the codes is running without kernel error in PyMuPDF 1.19.5. Please reply to my email to confirm you have received it. Thanks, Wei

On Saturday, October 1, 2022 at 06:44:01 AM PDT, Jorj X. McKie ***@***.***> wrote:

I am going to close this because of lack of supporting evidence / reproducibility.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

def draw_clin_label(page, annot_text, annot_loc, font_size, fill_color, keywords): """Helper method to draw a new clinical summary label onto the page. These are placed parallel to where their associate keyword is on the page.

Arguments
=========
page: PyMuPDF page
    PyMuPDF representation of a page.

annot_text: string
    The text to write into the label.

annot_loc: PyMuPDF Rectangle
    PyMuPDF page rectangle area in which to draw the label text.
    Only used in case of emergency, e.g. keyword wasn't found so must fall back to 2nd row.

font_size: int
    Fontsize to use. 14 for organs, 18 for everything else.

fill_color: PyMuPDF RGB color tuple
    Color to use for the label.

keywords: list of tuples
    List of tuples of (keyword, frequency) for the given category on the page.
    Used to search page text to determine label placement.

Returns
=======
Label drawn onto page. The label is placed on left margin across from the keyword if the keyword is found;
otherwise, it draws it in the 2nd row.
"""
found, lower_opacity = False, False
top_keyword_loc = None
align = fitz.TEXT_ALIGN_CENTER
# Special handling for 180-degree rotated pages - in these cases, multiply each comparison by derotation matrix
# so that y coordinates are mapped correctly. This breaks behavior for 270-degree pages so do with flag structure
do_derotate = page.rotation == 180
# Search page text for all keywords that were identified earlier in the pipeline
for keyword, frequency in keywords:
    keyword_loc = None
    keyword_locs = page.search_for(keyword)
    # Pick the appearance of the keyword in the page with the lowest y0, in case there are multiple appearances
    if keyword_locs:
        if do_derotate:
            keyword_loc = min([loc * page.derotation_matrix for loc in keyword_locs], key=lambda k: k.y0)
        else:
            keyword_loc = min(keyword_locs, key=lambda k: k.y0)
    # If keyword wasn't found, try getting page words, stripping punctuation, and checking that way
    # This is necessary because PyMuPDF search_for() includes punctuation and the keyword tokens don't
    else:
        all_words = page.get_text("words")
        all_word_strings = [re.sub(TOKEN_WORDS, "", w[4]).lower() for w in all_words]
        # Handle token apostrophe s-es
        # Tokens will look like "marfan s syndrome" but PyMuPDF would be "marfan's syndrome"
        apostrophe = re.search(APOSTROPHE_S, keyword)
        # If found, replace " s" with "'s"; either way, split words into a list
        if apostrophe:
            span = apostrophe.span()
            keyword_words = (keyword[:span[0]] + "'s" + keyword[span[1]:]).split(" ")
        else:
            keyword_words = keyword.split(" ")
        found = False
        # Check for presence of keyword in word string list
        for i in range(len(all_word_strings) - len(keyword_words) + 1):
            # Stop searching if we found a match
            if found:
                break
            # Loop through keywords
            for j in range(len(keyword_words)):
                current_word = all_word_strings[i + j]
                # Handle apostrophes - if the word on the page has apostrophe s, check both word equality and word
                # quality without the apostrophe s
                # This handles cases where the search keyword has an apostrophe s (legionnare s disease),
                # as well as cases where the page text has an apostrophe s (procedure: the patient's condition...)
                if current_word.endswith("'s"):
                    if current_word != keyword_words[j] and \
                            current_word[:len(current_word) - 2] != keyword_words[j]:
                        break
                # If the keyword wasn't found, stop looking at the current word
                elif current_word != keyword_words[j]:
                    break
            # If the loop is exhausted (meaning we found all necessary words), we have a hit
            # So pick the first word's x/y coordinates and use that for the target label
            else:
                start = all_words[i]
                if do_derotate:
                    keyword_loc = fitz.Rect(start[0], start[1], start[2], start[3]) * page.derotation_matrix
                else:
                    keyword_loc = fitz.Rect(start[0], start[1], start[2], start[3])
                found = True
                break
        # If the keyword STILL isn't found, try just searching for the first word
        if not found:
            keyword_locs = page.search_for(keyword.split(" ")[0])
            if keyword_locs:
                if do_derotate:
                    keyword_loc = min([loc * page.derotation_matrix for loc in keyword_locs], key=lambda k: k.y0)
                else:
                    keyword_loc = keyword_locs[0]
    # Keep the keyword location if it is the highest (vertically) on the page
    if not top_keyword_loc:
        top_keyword_loc = keyword_loc
    else:
        if keyword_loc.y0 < top_keyword_loc.y0:
            top_keyword_loc = keyword_loc
# If we found any kind of keyword hit, draw a label at the new location
if top_keyword_loc:
    found = True
    if do_derotate:
        page_bounds = page.bound() * page.derotation_matrix
    else:
        page_bounds = page.bound()
    offset = 0
    # Make sure label does not go outside page boundary - if it would, cap at page boundary
    if top_keyword_loc.y0 + VERTICAL_LABEL_LEN > page_bounds.y1:
        offset = (top_keyword_loc.y0 + VERTICAL_LABEL_LEN) - page_bounds.y1
    # Now, determine new annot_loc - use y values from keyword, but aligned with left margin, vertically
    # Include offset information determined by boundary and occlusion checks
    annot_loc = fitz.Rect(page_bounds.x0, top_keyword_loc.y0 - offset,
                          page_bounds.x0 + 15, (top_keyword_loc.y0 + VERTICAL_LABEL_LEN) - offset)
    # Make sure label does not occlude an existing label - if it does, keep whichever is more vertical on the page
    # Only way to do this is to get the rectangle of every label and cross-reference their coordinates
    for existing_annot in page.annots():
        existing_rect = existing_annot.rect
        existing_rect_mod = deepcopy(existing_rect)
        # Add a 15 pixel buffer in both directions before checking intersection for extra safety
        existing_rect_mod.y0 = existing_rect_mod.y0 - 15
        existing_rect_mod.y1 = existing_rect_mod.y1 + 15
        # If it intersects, determine which of the two labels is further up the page
        if existing_rect_mod.intersects(annot_loc):
            # If the existing label is the one that is further up, just put the new one below it
            if existing_rect.y0 < annot_loc.y0:
                to_move = annot_loc
                to_keep = existing_rect
                do_set_rect = False
            # Otherwise, prepare to move the old one down below the new one
            else:
                to_move = existing_rect
                to_keep = annot_loc
                do_set_rect = True
            # Move whichever one you are moving 5 pixels below the one you are not moving
            to_move.y0 = to_keep.y1 + 5
            # Do the same offset check again to check for running over the page length
            if to_move.y0 + VERTICAL_LABEL_LEN > page_bounds.y1:
                offset = (to_move.y0 + VERTICAL_LABEL_LEN) - page_bounds.y1
            else:
                offset = 0
            # Move the label you're moving
            to_move.y1 = (to_move.y0 + VERTICAL_LABEL_LEN) - offset
            # Call set_rect() if you're moving the existing label, as that is how you move an existing label
            if do_set_rect:
                existing_annot.set_rect(existing_rect)
            # Otherwise, just set the new annotation location to the moved label
            else:
                annot_loc = to_move
    # Finally, look inside annot_loc to see if there is any text in it; if there is, also lower opacity
    # Do this by searching in annot_loc plus a slight extra x buffer to ensure confidence and cover edge cases
    search_loc = fitz.Rect(annot_loc.x0, annot_loc.y0, annot_loc.x1 + 5, annot_loc.y1)
    if page.get_textbox(search_loc):
        lower_opacity = True
        LOGGER.debug("Found overlap between text and clinical summary label on %s page %i, lowering opacity." %
                     (page.parent.name, page.number + 1))
# If we didn't find any keywords on the page, fall back to top-of-page logic (2nd row with diagnostics)
# This will revert fontsize, remove line breaks from the annotation text, and un-center the label text
if not found:
    font_size = 18
    annot_text = annot_text.replace("\n", "")
    align = fitz.TEXT_ALIGN_LEFT
# Draw actual label now
label_annot = page.add_freetext_annot(annot_loc * page.derotation_matrix, annot_text, fontsize=font_size,
                                      fill_color=fill_color, text_color=BLACK, align=align, rotate=page.rotation)
label_annot.set_info({"title": "NLP"})
# Set opacity to 50% if we determined there's overlap with any text
if lower_opacity:
    label_annot.update(opacity=0.50, fill_color=fill_color)

Oct 01 '22 21:10 weironglue

I can see you new post here, and I also received it as an e-mail in my inbox. But I did not receive your input file - not here, not in my e-mail inbox.

Oct 01 '22 21:10 JorjMcKie

Hi, I am still missing your input file. Jorj

Gesendet von Mailhttps://go.microsoft.com/fwlink/?LinkId=550986 für Windows

Von: @.> Gesendet: Samstag, 1. Oktober 2022 17:22 An: @.> Cc: Jorj X. @.>; State @.> Betreff: Re: [pymupdf/PyMuPDF] Kernel Python3 segfault error 4 in Linux machine running PyNuPDF V1.20.1 (Issue #1937)

Hi, I sent you an email yesterday. I am sending you the attached file again now. Again, the codes is running without kernel error in PyMuPDF 1.19.5. Please reply to my email to confirm you have received it. Thanks, Wei

On Saturday, October 1, 2022 at 06:44:01 AM PDT, Jorj X. McKie @.***> wrote:

I am going to close this because of lack of supporting evidence / reproducibility.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

def draw_clin_label(page, annot_text, annot_loc, font_size, fill_color, keywords): """Helper method to draw a new clinical summary label onto the page. These are placed parallel to where their associate keyword is on the page.

Arguments

page: PyMuPDF page PyMuPDF representation of a page.

annot_text: string The text to write into the label.

annot_loc: PyMuPDF Rectangle PyMuPDF page rectangle area in which to draw the label text. Only used in case of emergency, e.g. keyword wasn't found so must fall back to 2nd row.

font_size: int Fontsize to use. 14 for organs, 18 for everything else.

fill_color: PyMuPDF RGB color tuple Color to use for the label.

keywords: list of tuples List of tuples of (keyword, frequency) for the given category on the page. Used to search page text to determine label placement.

Returns

Label drawn onto page. The label is placed on left margin across from the keyword if the keyword is found; otherwise, it draws it in the 2nd row. """ found, lower_opacity = False, False top_keyword_loc = None align = fitz.TEXT_ALIGN_CENTER

Special handling for 180-degree rotated pages - in these cases, multiply each comparison by derotation matrix

so that y coordinates are mapped correctly. This breaks behavior for 270-degree pages so do with flag structure

do_derotate = page.rotation == 180

Search page text for all keywords that were identified earlier in the pipeline

for keyword, frequency in keywords: keyword_loc = None keyword_locs = page.search_for(keyword)

Pick the appearance of the keyword in the page with the lowest y0, in case there are multiple appearances

if keyword_locs: if do_derotate: keyword_loc = min([loc * page.derotation_matrix for loc in keyword_locs], key=lambda k: k.y0) else: keyword_loc = min(keyword_locs, key=lambda k: k.y0)

If keyword wasn't found, try getting page words, stripping punctuation, and checking that way

This is necessary because PyMuPDF search_for() includes punctuation and the keyword tokens don't

else: all_words = page.get_text("words") all_word_strings = [re.sub(TOKEN_WORDS, "", w[4]).lower() for w in all_words]

Handle token apostrophe s-es

Tokens will look like "marfan s syndrome" but PyMuPDF would be "marfan's syndrome"

apostrophe = re.search(APOSTROPHE_S, keyword)

If found, replace " s" with "'s"; either way, split words into a list

if apostrophe: span = apostrophe.span() keyword_words = (keyword[:span[0]] + "'s" + keyword[span[1]:]).split(" ") else: keyword_words = keyword.split(" ") found = False

Check for presence of keyword in word string list

for i in range(len(all_word_strings) - len(keyword_words) + 1):

Stop searching if we found a match

if found: break

Loop through keywords

for j in range(len(keyword_words)): current_word = all_word_strings[i + j]

Handle apostrophes - if the word on the page has apostrophe s, check both word equality and word

quality without the apostrophe s

This handles cases where the search keyword has an apostrophe s (legionnare s disease),

as well as cases where the page text has an apostrophe s (procedure: the patient's condition...)

if current_word.endswith("'s"): if current_word != keyword_words[j] and
current_word[:len(current_word) - 2] != keyword_words[j]: break

If the keyword wasn't found, stop looking at the current word

elif current_word != keyword_words[j]: break

If the loop is exhausted (meaning we found all necessary words), we have a hit

So pick the first word's x/y coordinates and use that for the target label

else: start = all_words[i] if do_derotate: keyword_loc = fitz.Rect(start[0], start[1], start[2], start[3]) * page.derotation_matrix else: keyword_loc = fitz.Rect(start[0], start[1], start[2], start[3]) found = True break

If the keyword STILL isn't found, try just searching for the first word

if not found: keyword_locs = page.search_for(keyword.split(" ")[0]) if keyword_locs: if do_derotate: keyword_loc = min([loc * page.derotation_matrix for loc in keyword_locs], key=lambda k: k.y0) else: keyword_loc = keyword_locs[0]

Keep the keyword location if it is the highest (vertically) on the page

if not top_keyword_loc: top_keyword_loc = keyword_loc else: if keyword_loc.y0 < top_keyword_loc.y0: top_keyword_loc = keyword_loc

If we found any kind of keyword hit, draw a label at the new location

if top_keyword_loc: found = True if do_derotate: page_bounds = page.bound() * page.derotation_matrix else: page_bounds = page.bound() offset = 0

Make sure label does not go outside page boundary - if it would, cap at page boundary

if top_keyword_loc.y0 + VERTICAL_LABEL_LEN > page_bounds.y1: offset = (top_keyword_loc.y0 + VERTICAL_LABEL_LEN) - page_bounds.y1

Now, determine new annot_loc - use y values from keyword, but aligned with left margin, vertically

Include offset information determined by boundary and occlusion checks

annot_loc = fitz.Rect(page_bounds.x0, top_keyword_loc.y0 - offset, page_bounds.x0 + 15, (top_keyword_loc.y0 + VERTICAL_LABEL_LEN) - offset)

Make sure label does not occlude an existing label - if it does, keep whichever is more vertical on the page

Only way to do this is to get the rectangle of every label and cross-reference their coordinates

for existing_annot in page.annots(): existing_rect = existing_annot.rect existing_rect_mod = deepcopy(existing_rect)

Add a 15 pixel buffer in both directions before checking intersection for extra safety

existing_rect_mod.y0 = existing_rect_mod.y0 - 15 existing_rect_mod.y1 = existing_rect_mod.y1 + 15

If it intersects, determine which of the two labels is further up the page

if existing_rect_mod.intersects(annot_loc):

If the existing label is the one that is further up, just put the new one below it

if existing_rect.y0 < annot_loc.y0: to_move = annot_loc to_keep = existing_rect do_set_rect = False

Otherwise, prepare to move the old one down below the new one

else: to_move = existing_rect to_keep = annot_loc do_set_rect = True

Move whichever one you are moving 5 pixels below the one you are not moving

to_move.y0 = to_keep.y1 + 5

Do the same offset check again to check for running over the page length

if to_move.y0 + VERTICAL_LABEL_LEN > page_bounds.y1: offset = (to_move.y0 + VERTICAL_LABEL_LEN) - page_bounds.y1 else: offset = 0

Move the label you're moving

to_move.y1 = (to_move.y0 + VERTICAL_LABEL_LEN) - offset

Call set_rect() if you're moving the existing label, as that is how you move an existing label

if do_set_rect: existing_annot.set_rect(existing_rect)

Otherwise, just set the new annotation location to the moved label

else: annot_loc = to_move

Finally, look inside annot_loc to see if there is any text in it; if there is, also lower opacity

Do this by searching in annot_loc plus a slight extra x buffer to ensure confidence and cover edge cases

search_loc = fitz.Rect(annot_loc.x0, annot_loc.y0, annot_loc.x1 + 5, annot_loc.y1) if page.get_textbox(search_loc): lower_opacity = True LOGGER.debug("Found overlap between text and clinical summary label on %s page %i, lowering opacity." % (page.parent.name, page.number + 1))

If we didn't find any keywords on the page, fall back to top-of-page logic (2nd row with diagnostics)

This will revert fontsize, remove line breaks from the annotation text, and un-center the label text

if not found: font_size = 18 annot_text = annot_text.replace("\n", "") align = fitz.TEXT_ALIGN_LEFT

Draw actual label now

label_annot = page.add_freetext_annot(annot_loc * page.derotation_matrix, annot_text, fontsize=font_size, fill_color=fill_color, text_color=BLACK, align=align, rotate=page.rotation) label_annot.set_info({"title": "NLP"})

Set opacity to 50% if we determined there's overlap with any text

if lower_opacity: label_annot.update(opacity=0.50, fill_color=fill_color)

— Reply to this email directly, view it on GitHubhttps://github.com/pymupdf/PyMuPDF/issues/1937#issuecomment-1264484596, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB7IDIUPKI6N4EC5POSSUK3WBCTQTANCNFSM6AAAAAAQVHJP4A. You are receiving this because you modified the open/close state.Message ID: @.***>

Oct 01 '22 21:10 JorjMcKie

Hi, I am sending you the codes as follows. I also attached it to my email one more time. Please let me know if you receive it. Thanks, Wei def draw_clin_label(page, annot_text, annot_loc, font_size, fill_color, keywords): """Helper method to draw a new clinical summary label onto the page. These are placed parallel to where their associate keyword is on the page. Arguments ========= page: PyMuPDF page PyMuPDF representation of a page. annot_text: string The text to write into the label. annot_loc: PyMuPDF Rectangle PyMuPDF page rectangle area in which to draw the label text. Only used in case of emergency, e.g. keyword wasn't found so must fall back to 2nd row. font_size: int Fontsize to use. 14 for organs, 18 for everything else. fill_color: PyMuPDF RGB color tuple Color to use for the label. keywords: list of tuples List of tuples of (keyword, frequency) for the given category on the page. Used to search page text to determine label placement. Returns ======= Label drawn onto page. The label is placed on left margin across from the keyword if the keyword is found; otherwise, it draws it in the 2nd row. """ found, lower_opacity = False, False top_keyword_loc = None align = fitz.TEXT_ALIGN_CENTER # Special handling for 180-degree rotated pages - in these cases, multiply each comparison by derotation matrix # so that y coordinates are mapped correctly. This breaks behavior for 270-degree pages so do with flag structure do_derotate = page.rotation == 180 # Search page text for all keywords that were identified earlier in the pipeline for keyword, frequency in keywords: keyword_loc = None keyword_locs = page.search_for(keyword) # Pick the appearance of the keyword in the page with the lowest y0, in case there are multiple appearances if keyword_locs: if do_derotate: keyword_loc = min([loc * page.derotation_matrix for loc in keyword_locs], key=lambda k: k.y0) else: keyword_loc = min(keyword_locs, key=lambda k: k.y0) # If keyword wasn't found, try getting page words, stripping punctuation, and checking that way # This is necessary because PyMuPDF search_for() includes punctuation and the keyword tokens don't else: all_words = page.get_text("words") all_word_strings = [re.sub(TOKEN_WORDS, "", w[4]).lower() for w in all_words] # Handle token apostrophe s-es # Tokens will look like "marfan s syndrome" but PyMuPDF would be "marfan's syndrome" apostrophe = re.search(APOSTROPHE_S, keyword) # If found, replace " s" with "'s"; either way, split words into a list if apostrophe: span = apostrophe.span() keyword_words = (keyword[:span[0]] + "'s" + keyword[span[1]:]).split(" ") else: keyword_words = keyword.split(" ") found = False # Check for presence of keyword in word string list for i in range(len(all_word_strings) - len(keyword_words) + 1): # Stop searching if we found a match if found: break # Loop through keywords for j in range(len(keyword_words)): current_word = all_word_strings[i + j] # Handle apostrophes - if the word on the page has apostrophe s, check both word equality and word # quality without the apostrophe s # This handles cases where the search keyword has an apostrophe s (legionnare s disease), # as well as cases where the page text has an apostrophe s (procedure: the patient's condition...) if current_word.endswith("'s"): if current_word != keyword_words[j] and \ current_word[:len(current_word) - 2] != keyword_words[j]: break # If the keyword wasn't found, stop looking at the current word elif current_word != keyword_words[j]: break # If the loop is exhausted (meaning we found all necessary words), we have a hit # So pick the first word's x/y coordinates and use that for the target label else: start = all_words[i] if do_derotate: keyword_loc = fitz.Rect(start[0], start[1], start[2], start[3]) * page.derotation_matrix else: keyword_loc = fitz.Rect(start[0], start[1], start[2], start[3]) found = True break # If the keyword STILL isn't found, try just searching for the first word if not found: keyword_locs = page.search_for(keyword.split(" ")[0]) if keyword_locs: if do_derotate: keyword_loc = min([loc * page.derotation_matrix for loc in keyword_locs], key=lambda k: k.y0) else: keyword_loc = keyword_locs[0] # Keep the keyword location if it is the highest (vertically) on the page if not top_keyword_loc: top_keyword_loc = keyword_loc else: if keyword_loc.y0 < top_keyword_loc.y0: top_keyword_loc = keyword_loc # If we found any kind of keyword hit, draw a label at the new location if top_keyword_loc: found = True if do_derotate: page_bounds = page.bound() * page.derotation_matrix else: page_bounds = page.bound() offset = 0 # Make sure label does not go outside page boundary - if it would, cap at page boundary if top_keyword_loc.y0 + VERTICAL_LABEL_LEN > page_bounds.y1: offset = (top_keyword_loc.y0 + VERTICAL_LABEL_LEN) - page_bounds.y1 # Now, determine new annot_loc - use y values from keyword, but aligned with left margin, vertically # Include offset information determined by boundary and occlusion checks annot_loc = fitz.Rect(page_bounds.x0, top_keyword_loc.y0 - offset, page_bounds.x0 + 15, (top_keyword_loc.y0 + VERTICAL_LABEL_LEN) - offset) # Make sure label does not occlude an existing label - if it does, keep whichever is more vertical on the page # Only way to do this is to get the rectangle of every label and cross-reference their coordinates for existing_annot in page.annots(): existing_rect = existing_annot.rect existing_rect_mod = deepcopy(existing_rect) # Add a 15 pixel buffer in both directions before checking intersection for extra safety existing_rect_mod.y0 = existing_rect_mod.y0 - 15 existing_rect_mod.y1 = existing_rect_mod.y1 + 15 # If it intersects, determine which of the two labels is further up the page if existing_rect_mod.intersects(annot_loc): # If the existing label is the one that is further up, just put the new one below it if existing_rect.y0 < annot_loc.y0: to_move = annot_loc to_keep = existing_rect do_set_rect = False # Otherwise, prepare to move the old one down below the new one else: to_move = existing_rect to_keep = annot_loc do_set_rect = True # Move whichever one you are moving 5 pixels below the one you are not moving to_move.y0 = to_keep.y1 + 5 # Do the same offset check again to check for running over the page length if to_move.y0 + VERTICAL_LABEL_LEN > page_bounds.y1: offset = (to_move.y0 + VERTICAL_LABEL_LEN) - page_bounds.y1 else: offset = 0 # Move the label you're moving to_move.y1 = (to_move.y0 + VERTICAL_LABEL_LEN) - offset # Call set_rect() if you're moving the existing label, as that is how you move an existing label if do_set_rect: existing_annot.set_rect(existing_rect) # Otherwise, just set the new annotation location to the moved label else: annot_loc = to_move # Finally, look inside annot_loc to see if there is any text in it; if there is, also lower opacity # Do this by searching in annot_loc plus a slight extra x buffer to ensure confidence and cover edge cases search_loc = fitz.Rect(annot_loc.x0, annot_loc.y0, annot_loc.x1 + 5, annot_loc.y1) if page.get_textbox(search_loc): lower_opacity = True LOGGER.debug("Found overlap between text and clinical summary label on %s page %i, lowering opacity." % (page.parent.name, page.number + 1)) # If we didn't find any keywords on the page, fall back to top-of-page logic (2nd row with diagnostics) # This will revert fontsize, remove line breaks from the annotation text, and un-center the label text if not found: font_size = 18 annot_text = annot_text.replace("\n", "") align = fitz.TEXT_ALIGN_LEFT # Draw actual label now label_annot = page.add_freetext_annot(annot_loc * page.derotation_matrix, annot_text, fontsize=font_size, fill_color=fill_color, text_color=BLACK, align=align, rotate=page.rotation) label_annot.set_info({"title": "NLP"}) # Set opacity to 50% if we determined there's overlap with any text if lower_opacity: label_annot.update(opacity=0.50, fill_color=fill_color)

On Saturday, October 1, 2022 at 02:50:19 PM PDT, Jorj X. McKie ***@***.***> wrote:

Hi, I am still missing your input file. Jorj

Gesendet von Mailhttps://go.microsoft.com/fwlink/?LinkId=550986 für Windows

Von: @.> Gesendet: Samstag, 1. Oktober 2022 17:22 An: @.> Cc: Jorj X. @.>; State @.> Betreff: Re: [pymupdf/PyMuPDF] Kernel Python3 segfault error 4 in Linux machine running PyNuPDF V1.20.1 (Issue #1937)

Hi, I sent you an email yesterday. I am sending you the attached file again now. Again, the codes is running without kernel error in PyMuPDF 1.19.5. Please reply to my email to confirm you have received it. Thanks, Wei

On Saturday, October 1, 2022 at 06:44:01 AM PDT, Jorj X. McKie @.***> wrote:

I am going to close this because of lack of supporting evidence / reproducibility.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

def draw_clin_label(page, annot_text, annot_loc, font_size, fill_color, keywords): """Helper method to draw a new clinical summary label onto the page. These are placed parallel to where their associate keyword is on the page.

Arguments

page: PyMuPDF page PyMuPDF representation of a page.

annot_text: string The text to write into the label.

annot_loc: PyMuPDF Rectangle PyMuPDF page rectangle area in which to draw the label text. Only used in case of emergency, e.g. keyword wasn't found so must fall back to 2nd row.

font_size: int Fontsize to use. 14 for organs, 18 for everything else.

fill_color: PyMuPDF RGB color tuple Color to use for the label.

keywords: list of tuples List of tuples of (keyword, frequency) for the given category on the page. Used to search page text to determine label placement.

Returns

Label drawn onto page. The label is placed on left margin across from the keyword if the keyword is found; otherwise, it draws it in the 2nd row. """ found, lower_opacity = False, False top_keyword_loc = None align = fitz.TEXT_ALIGN_CENTER

Special handling for 180-degree rotated pages - in these cases, multiply each comparison by derotation matrix

so that y coordinates are mapped correctly. This breaks behavior for 270-degree pages so do with flag structure

do_derotate = page.rotation == 180

Search page text for all keywords that were identified earlier in the pipeline

for keyword, frequency in keywords: keyword_loc = None keyword_locs = page.search_for(keyword)

Pick the appearance of the keyword in the page with the lowest y0, in case there are multiple appearances

if keyword_locs: if do_derotate: keyword_loc = min([loc * page.derotation_matrix for loc in keyword_locs], key=lambda k: k.y0) else: keyword_loc = min(keyword_locs, key=lambda k: k.y0)

If keyword wasn't found, try getting page words, stripping punctuation, and checking that way

This is necessary because PyMuPDF search_for() includes punctuation and the keyword tokens don't

else: all_words = page.get_text("words") all_word_strings = [re.sub(TOKEN_WORDS, "", w[4]).lower() for w in all_words]

Handle token apostrophe s-es

Tokens will look like "marfan s syndrome" but PyMuPDF would be "marfan's syndrome"

apostrophe = re.search(APOSTROPHE_S, keyword)

If found, replace " s" with "'s"; either way, split words into a list

if apostrophe: span = apostrophe.span() keyword_words = (keyword[:span[0]] + "'s" + keyword[span[1]:]).split(" ") else: keyword_words = keyword.split(" ") found = False

Check for presence of keyword in word string list

for i in range(len(all_word_strings) - len(keyword_words) + 1):

Stop searching if we found a match

if found: break

Loop through keywords

for j in range(len(keyword_words)): current_word = all_word_strings[i + j]

Handle apostrophes - if the word on the page has apostrophe s, check both word equality and word

quality without the apostrophe s

This handles cases where the search keyword has an apostrophe s (legionnare s disease),

as well as cases where the page text has an apostrophe s (procedure: the patient's condition...)

if current_word.endswith("'s"): if current_word != keyword_words[j] and
current_word[:len(current_word) - 2] != keyword_words[j]: break

If the keyword wasn't found, stop looking at the current word

elif current_word != keyword_words[j]: break

If the loop is exhausted (meaning we found all necessary words), we have a hit

So pick the first word's x/y coordinates and use that for the target label

else: start = all_words[i] if do_derotate: keyword_loc = fitz.Rect(start[0], start[1], start[2], start[3]) * page.derotation_matrix else: keyword_loc = fitz.Rect(start[0], start[1], start[2], start[3]) found = True break

If the keyword STILL isn't found, try just searching for the first word

if not found: keyword_locs = page.search_for(keyword.split(" ")[0]) if keyword_locs: if do_derotate: keyword_loc = min([loc * page.derotation_matrix for loc in keyword_locs], key=lambda k: k.y0) else: keyword_loc = keyword_locs[0]

Keep the keyword location if it is the highest (vertically) on the page

if not top_keyword_loc: top_keyword_loc = keyword_loc else: if keyword_loc.y0 < top_keyword_loc.y0: top_keyword_loc = keyword_loc

If we found any kind of keyword hit, draw a label at the new location

if top_keyword_loc: found = True if do_derotate: page_bounds = page.bound() * page.derotation_matrix else: page_bounds = page.bound() offset = 0

Make sure label does not go outside page boundary - if it would, cap at page boundary

if top_keyword_loc.y0 + VERTICAL_LABEL_LEN > page_bounds.y1: offset = (top_keyword_loc.y0 + VERTICAL_LABEL_LEN) - page_bounds.y1

Now, determine new annot_loc - use y values from keyword, but aligned with left margin, vertically

Include offset information determined by boundary and occlusion checks

annot_loc = fitz.Rect(page_bounds.x0, top_keyword_loc.y0 - offset, page_bounds.x0 + 15, (top_keyword_loc.y0 + VERTICAL_LABEL_LEN) - offset)

Make sure label does not occlude an existing label - if it does, keep whichever is more vertical on the page

Only way to do this is to get the rectangle of every label and cross-reference their coordinates

for existing_annot in page.annots(): existing_rect = existing_annot.rect existing_rect_mod = deepcopy(existing_rect)

Add a 15 pixel buffer in both directions before checking intersection for extra safety

existing_rect_mod.y0 = existing_rect_mod.y0 - 15 existing_rect_mod.y1 = existing_rect_mod.y1 + 15

If it intersects, determine which of the two labels is further up the page

if existing_rect_mod.intersects(annot_loc):

If the existing label is the one that is further up, just put the new one below it

if existing_rect.y0 < annot_loc.y0: to_move = annot_loc to_keep = existing_rect do_set_rect = False

Otherwise, prepare to move the old one down below the new one

else: to_move = existing_rect to_keep = annot_loc do_set_rect = True

Move whichever one you are moving 5 pixels below the one you are not moving

to_move.y0 = to_keep.y1 + 5

Do the same offset check again to check for running over the page length

if to_move.y0 + VERTICAL_LABEL_LEN > page_bounds.y1: offset = (to_move.y0 + VERTICAL_LABEL_LEN) - page_bounds.y1 else: offset = 0

Move the label you're moving

to_move.y1 = (to_move.y0 + VERTICAL_LABEL_LEN) - offset

Call set_rect() if you're moving the existing label, as that is how you move an existing label

if do_set_rect: existing_annot.set_rect(existing_rect)

Otherwise, just set the new annotation location to the moved label

else: annot_loc = to_move

Finally, look inside annot_loc to see if there is any text in it; if there is, also lower opacity

Do this by searching in annot_loc plus a slight extra x buffer to ensure confidence and cover edge cases

search_loc = fitz.Rect(annot_loc.x0, annot_loc.y0, annot_loc.x1 + 5, annot_loc.y1) if page.get_textbox(search_loc): lower_opacity = True LOGGER.debug("Found overlap between text and clinical summary label on %s page %i, lowering opacity." % (page.parent.name, page.number + 1))

If we didn't find any keywords on the page, fall back to top-of-page logic (2nd row with diagnostics)

This will revert fontsize, remove line breaks from the annotation text, and un-center the label text

if not found: font_size = 18 annot_text = annot_text.replace("\n", "") align = fitz.TEXT_ALIGN_LEFT

Draw actual label now

label_annot = page.add_freetext_annot(annot_loc * page.derotation_matrix, annot_text, fontsize=font_size, fill_color=fill_color, text_color=BLACK, align=align, rotate=page.rotation) label_annot.set_info({"title": "NLP"})

Set opacity to 50% if we determined there's overlap with any text

if lower_opacity: label_annot.update(opacity=0.50, fill_color=fill_color)

— Reply to this email directly, view it on GitHubhttps://github.com/pymupdf/PyMuPDF/issues/1937#issuecomment-1264484596, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB7IDIUPKI6N4EC5POSSUK3WBCTQTANCNFSM6AAAAAAQVHJP4A. You are receiving this because you modified the open/close state.Message ID: @.***>

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

def draw_clin_label(page, annot_text, annot_loc, font_size, fill_color, keywords): """Helper method to draw a new clinical summary label onto the page. These are placed parallel to where their associate keyword is on the page.

Arguments
=========
page: PyMuPDF page
    PyMuPDF representation of a page.

annot_text: string
    The text to write into the label.

annot_loc: PyMuPDF Rectangle
    PyMuPDF page rectangle area in which to draw the label text.
    Only used in case of emergency, e.g. keyword wasn't found so must fall back to 2nd row.

font_size: int
    Fontsize to use. 14 for organs, 18 for everything else.

fill_color: PyMuPDF RGB color tuple
    Color to use for the label.

keywords: list of tuples
    List of tuples of (keyword, frequency) for the given category on the page.
    Used to search page text to determine label placement.

Returns
=======
Label drawn onto page. The label is placed on left margin across from the keyword if the keyword is found;
otherwise, it draws it in the 2nd row.
"""
found, lower_opacity = False, False
top_keyword_loc = None
align = fitz.TEXT_ALIGN_CENTER
# Special handling for 180-degree rotated pages - in these cases, multiply each comparison by derotation matrix
# so that y coordinates are mapped correctly. This breaks behavior for 270-degree pages so do with flag structure
do_derotate = page.rotation == 180
# Search page text for all keywords that were identified earlier in the pipeline
for keyword, frequency in keywords:
    keyword_loc = None
    keyword_locs = page.search_for(keyword)
    # Pick the appearance of the keyword in the page with the lowest y0, in case there are multiple appearances
    if keyword_locs:
        if do_derotate:
            keyword_loc = min([loc * page.derotation_matrix for loc in keyword_locs], key=lambda k: k.y0)
        else:
            keyword_loc = min(keyword_locs, key=lambda k: k.y0)
    # If keyword wasn't found, try getting page words, stripping punctuation, and checking that way
    # This is necessary because PyMuPDF search_for() includes punctuation and the keyword tokens don't
    else:
        all_words = page.get_text("words")
        all_word_strings = [re.sub(TOKEN_WORDS, "", w[4]).lower() for w in all_words]
        # Handle token apostrophe s-es
        # Tokens will look like "marfan s syndrome" but PyMuPDF would be "marfan's syndrome"
        apostrophe = re.search(APOSTROPHE_S, keyword)
        # If found, replace " s" with "'s"; either way, split words into a list
        if apostrophe:
            span = apostrophe.span()
            keyword_words = (keyword[:span[0]] + "'s" + keyword[span[1]:]).split(" ")
        else:
            keyword_words = keyword.split(" ")
        found = False
        # Check for presence of keyword in word string list
        for i in range(len(all_word_strings) - len(keyword_words) + 1):
            # Stop searching if we found a match
            if found:
                break
            # Loop through keywords
            for j in range(len(keyword_words)):
                current_word = all_word_strings[i + j]
                # Handle apostrophes - if the word on the page has apostrophe s, check both word equality and word
                # quality without the apostrophe s
                # This handles cases where the search keyword has an apostrophe s (legionnare s disease),
                # as well as cases where the page text has an apostrophe s (procedure: the patient's condition...)
                if current_word.endswith("'s"):
                    if current_word != keyword_words[j] and \
                            current_word[:len(current_word) - 2] != keyword_words[j]:
                        break
                # If the keyword wasn't found, stop looking at the current word
                elif current_word != keyword_words[j]:
                    break
            # If the loop is exhausted (meaning we found all necessary words), we have a hit
            # So pick the first word's x/y coordinates and use that for the target label
            else:
                start = all_words[i]
                if do_derotate:
                    keyword_loc = fitz.Rect(start[0], start[1], start[2], start[3]) * page.derotation_matrix
                else:
                    keyword_loc = fitz.Rect(start[0], start[1], start[2], start[3])
                found = True
                break
        # If the keyword STILL isn't found, try just searching for the first word
        if not found:
            keyword_locs = page.search_for(keyword.split(" ")[0])
            if keyword_locs:
                if do_derotate:
                    keyword_loc = min([loc * page.derotation_matrix for loc in keyword_locs], key=lambda k: k.y0)
                else:
                    keyword_loc = keyword_locs[0]
    # Keep the keyword location if it is the highest (vertically) on the page
    if not top_keyword_loc:
        top_keyword_loc = keyword_loc
    else:
        if keyword_loc.y0 < top_keyword_loc.y0:
            top_keyword_loc = keyword_loc
# If we found any kind of keyword hit, draw a label at the new location
if top_keyword_loc:
    found = True
    if do_derotate:
        page_bounds = page.bound() * page.derotation_matrix
    else:
        page_bounds = page.bound()
    offset = 0
    # Make sure label does not go outside page boundary - if it would, cap at page boundary
    if top_keyword_loc.y0 + VERTICAL_LABEL_LEN > page_bounds.y1:
        offset = (top_keyword_loc.y0 + VERTICAL_LABEL_LEN) - page_bounds.y1
    # Now, determine new annot_loc - use y values from keyword, but aligned with left margin, vertically
    # Include offset information determined by boundary and occlusion checks
    annot_loc = fitz.Rect(page_bounds.x0, top_keyword_loc.y0 - offset,
                          page_bounds.x0 + 15, (top_keyword_loc.y0 + VERTICAL_LABEL_LEN) - offset)
    # Make sure label does not occlude an existing label - if it does, keep whichever is more vertical on the page
    # Only way to do this is to get the rectangle of every label and cross-reference their coordinates
    for existing_annot in page.annots():
        existing_rect = existing_annot.rect
        existing_rect_mod = deepcopy(existing_rect)
        # Add a 15 pixel buffer in both directions before checking intersection for extra safety
        existing_rect_mod.y0 = existing_rect_mod.y0 - 15
        existing_rect_mod.y1 = existing_rect_mod.y1 + 15
        # If it intersects, determine which of the two labels is further up the page
        if existing_rect_mod.intersects(annot_loc):
            # If the existing label is the one that is further up, just put the new one below it
            if existing_rect.y0 < annot_loc.y0:
                to_move = annot_loc
                to_keep = existing_rect
                do_set_rect = False
            # Otherwise, prepare to move the old one down below the new one
            else:
                to_move = existing_rect
                to_keep = annot_loc
                do_set_rect = True
            # Move whichever one you are moving 5 pixels below the one you are not moving
            to_move.y0 = to_keep.y1 + 5
            # Do the same offset check again to check for running over the page length
            if to_move.y0 + VERTICAL_LABEL_LEN > page_bounds.y1:
                offset = (to_move.y0 + VERTICAL_LABEL_LEN) - page_bounds.y1
            else:
                offset = 0
            # Move the label you're moving
            to_move.y1 = (to_move.y0 + VERTICAL_LABEL_LEN) - offset
            # Call set_rect() if you're moving the existing label, as that is how you move an existing label
            if do_set_rect:
                existing_annot.set_rect(existing_rect)
            # Otherwise, just set the new annotation location to the moved label
            else:
                annot_loc = to_move
    # Finally, look inside annot_loc to see if there is any text in it; if there is, also lower opacity
    # Do this by searching in annot_loc plus a slight extra x buffer to ensure confidence and cover edge cases
    search_loc = fitz.Rect(annot_loc.x0, annot_loc.y0, annot_loc.x1 + 5, annot_loc.y1)
    if page.get_textbox(search_loc):
        lower_opacity = True
        LOGGER.debug("Found overlap between text and clinical summary label on %s page %i, lowering opacity." %
                     (page.parent.name, page.number + 1))
# If we didn't find any keywords on the page, fall back to top-of-page logic (2nd row with diagnostics)
# This will revert fontsize, remove line breaks from the annotation text, and un-center the label text
if not found:
    font_size = 18
    annot_text = annot_text.replace("\n", "")
    align = fitz.TEXT_ALIGN_LEFT
# Draw actual label now
label_annot = page.add_freetext_annot(annot_loc * page.derotation_matrix, annot_text, fontsize=font_size,
                                      fill_color=fill_color, text_color=BLACK, align=align, rotate=page.rotation)
label_annot.set_info({"title": "NLP"})
# Set opacity to 50% if we determined there's overlap with any text
if lower_opacity:
    label_annot.update(opacity=0.50, fill_color=fill_color)

def draw_clin_label(page, annot_text, annot_loc, font_size, fill_color, keywords): """Helper method to draw a new clinical summary label onto the page. These are placed parallel to where their associate keyword is on the page.

Arguments
=========
page: PyMuPDF page
    PyMuPDF representation of a page.

annot_text: string
    The text to write into the label.

annot_loc: PyMuPDF Rectangle
    PyMuPDF page rectangle area in which to draw the label text.
    Only used in case of emergency, e.g. keyword wasn't found so must fall back to 2nd row.

font_size: int
    Fontsize to use. 14 for organs, 18 for everything else.

fill_color: PyMuPDF RGB color tuple
    Color to use for the label.

keywords: list of tuples
    List of tuples of (keyword, frequency) for the given category on the page.
    Used to search page text to determine label placement.

Returns
=======
Label drawn onto page. The label is placed on left margin across from the keyword if the keyword is found;
otherwise, it draws it in the 2nd row.
"""
found, lower_opacity = False, False
top_keyword_loc = None
align = fitz.TEXT_ALIGN_CENTER
# Special handling for 180-degree rotated pages - in these cases, multiply each comparison by derotation matrix
# so that y coordinates are mapped correctly. This breaks behavior for 270-degree pages so do with flag structure
do_derotate = page.rotation == 180
# Search page text for all keywords that were identified earlier in the pipeline
for keyword, frequency in keywords:
    keyword_loc = None
    keyword_locs = page.search_for(keyword)
    # Pick the appearance of the keyword in the page with the lowest y0, in case there are multiple appearances
    if keyword_locs:
        if do_derotate:
            keyword_loc = min([loc * page.derotation_matrix for loc in keyword_locs], key=lambda k: k.y0)
        else:
            keyword_loc = min(keyword_locs, key=lambda k: k.y0)
    # If keyword wasn't found, try getting page words, stripping punctuation, and checking that way
    # This is necessary because PyMuPDF search_for() includes punctuation and the keyword tokens don't
    else:
        all_words = page.get_text("words")
        all_word_strings = [re.sub(TOKEN_WORDS, "", w[4]).lower() for w in all_words]
        # Handle token apostrophe s-es
        # Tokens will look like "marfan s syndrome" but PyMuPDF would be "marfan's syndrome"
        apostrophe = re.search(APOSTROPHE_S, keyword)
        # If found, replace " s" with "'s"; either way, split words into a list
        if apostrophe:
            span = apostrophe.span()
            keyword_words = (keyword[:span[0]] + "'s" + keyword[span[1]:]).split(" ")
        else:
            keyword_words = keyword.split(" ")
        found = False
        # Check for presence of keyword in word string list
        for i in range(len(all_word_strings) - len(keyword_words) + 1):
            # Stop searching if we found a match
            if found:
                break
            # Loop through keywords
            for j in range(len(keyword_words)):
                current_word = all_word_strings[i + j]
                # Handle apostrophes - if the word on the page has apostrophe s, check both word equality and word
                # quality without the apostrophe s
                # This handles cases where the search keyword has an apostrophe s (legionnare s disease),
                # as well as cases where the page text has an apostrophe s (procedure: the patient's condition...)
                if current_word.endswith("'s"):
                    if current_word != keyword_words[j] and \
                            current_word[:len(current_word) - 2] != keyword_words[j]:
                        break
                # If the keyword wasn't found, stop looking at the current word
                elif current_word != keyword_words[j]:
                    break
            # If the loop is exhausted (meaning we found all necessary words), we have a hit
            # So pick the first word's x/y coordinates and use that for the target label
            else:
                start = all_words[i]
                if do_derotate:
                    keyword_loc = fitz.Rect(start[0], start[1], start[2], start[3]) * page.derotation_matrix
                else:
                    keyword_loc = fitz.Rect(start[0], start[1], start[2], start[3])
                found = True
                break
        # If the keyword STILL isn't found, try just searching for the first word
        if not found:
            keyword_locs = page.search_for(keyword.split(" ")[0])
            if keyword_locs:
                if do_derotate:
                    keyword_loc = min([loc * page.derotation_matrix for loc in keyword_locs], key=lambda k: k.y0)
                else:
                    keyword_loc = keyword_locs[0]
    # Keep the keyword location if it is the highest (vertically) on the page
    if not top_keyword_loc:
        top_keyword_loc = keyword_loc
    else:
        if keyword_loc.y0 < top_keyword_loc.y0:
            top_keyword_loc = keyword_loc
# If we found any kind of keyword hit, draw a label at the new location
if top_keyword_loc:
    found = True
    if do_derotate:
        page_bounds = page.bound() * page.derotation_matrix
    else:
        page_bounds = page.bound()
    offset = 0
    # Make sure label does not go outside page boundary - if it would, cap at page boundary
    if top_keyword_loc.y0 + VERTICAL_LABEL_LEN > page_bounds.y1:
        offset = (top_keyword_loc.y0 + VERTICAL_LABEL_LEN) - page_bounds.y1
    # Now, determine new annot_loc - use y values from keyword, but aligned with left margin, vertically
    # Include offset information determined by boundary and occlusion checks
    annot_loc = fitz.Rect(page_bounds.x0, top_keyword_loc.y0 - offset,
                          page_bounds.x0 + 15, (top_keyword_loc.y0 + VERTICAL_LABEL_LEN) - offset)
    # Make sure label does not occlude an existing label - if it does, keep whichever is more vertical on the page
    # Only way to do this is to get the rectangle of every label and cross-reference their coordinates
    for existing_annot in page.annots():
        existing_rect = existing_annot.rect
        existing_rect_mod = deepcopy(existing_rect)
        # Add a 15 pixel buffer in both directions before checking intersection for extra safety
        existing_rect_mod.y0 = existing_rect_mod.y0 - 15
        existing_rect_mod.y1 = existing_rect_mod.y1 + 15
        # If it intersects, determine which of the two labels is further up the page
        if existing_rect_mod.intersects(annot_loc):
            # If the existing label is the one that is further up, just put the new one below it
            if existing_rect.y0 < annot_loc.y0:
                to_move = annot_loc
                to_keep = existing_rect
                do_set_rect = False
            # Otherwise, prepare to move the old one down below the new one
            else:
                to_move = existing_rect
                to_keep = annot_loc
                do_set_rect = True
            # Move whichever one you are moving 5 pixels below the one you are not moving
            to_move.y0 = to_keep.y1 + 5
            # Do the same offset check again to check for running over the page length
            if to_move.y0 + VERTICAL_LABEL_LEN > page_bounds.y1:
                offset = (to_move.y0 + VERTICAL_LABEL_LEN) - page_bounds.y1
            else:
                offset = 0
            # Move the label you're moving
            to_move.y1 = (to_move.y0 + VERTICAL_LABEL_LEN) - offset
            # Call set_rect() if you're moving the existing label, as that is how you move an existing label
            if do_set_rect:
                existing_annot.set_rect(existing_rect)
            # Otherwise, just set the new annotation location to the moved label
            else:
                annot_loc = to_move
    # Finally, look inside annot_loc to see if there is any text in it; if there is, also lower opacity
    # Do this by searching in annot_loc plus a slight extra x buffer to ensure confidence and cover edge cases
    search_loc = fitz.Rect(annot_loc.x0, annot_loc.y0, annot_loc.x1 + 5, annot_loc.y1)
    if page.get_textbox(search_loc):
        lower_opacity = True
        LOGGER.debug("Found overlap between text and clinical summary label on %s page %i, lowering opacity." %
                     (page.parent.name, page.number + 1))
# If we didn't find any keywords on the page, fall back to top-of-page logic (2nd row with diagnostics)
# This will revert fontsize, remove line breaks from the annotation text, and un-center the label text
if not found:
    font_size = 18
    annot_text = annot_text.replace("\n", "")
    align = fitz.TEXT_ALIGN_LEFT
# Draw actual label now
label_annot = page.add_freetext_annot(annot_loc * page.derotation_matrix, annot_text, fontsize=font_size,
                                      fill_color=fill_color, text_color=BLACK, align=align, rotate=page.rotation)
label_annot.set_info({"title": "NLP"})
# Set opacity to 50% if we determined there's overlap with any text
if lower_opacity:
    label_annot.update(opacity=0.50, fill_color=fill_color)

Oct 01 '22 22:10 weironglue

We are talking past each other: The code alone does not help! I need the file plus the code.

Oct 01 '22 22:10 JorjMcKie

Hi Jorj, We are running the Medical records in PDF files, which have Personal Identification Information and Personal Health Information that can't be released. This is governed by the law and there is no exception. Please see if it is possible to debug without the actual PDF files. Sorry for the inconvenience.

Thanks, Wei On Saturday, October 1, 2022 at 03:28:11 PM PDT, Jorj X. McKie @.***> wrote:

We are talking past each other: The code alone does not help! I need the file plus the code.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

Oct 11 '22 08:10 weironglue

Please see if it is possible to debug without the actual PDF files. Sorry for the inconvenience.

Then I am sorry to say, that we cannot help. PyMuPDF has ways to remove (sensitive) text: determine the respective text positions and overlay them with redaction annotations, apply the redactions and save to a new file. The resulting file will no longer contain this sensitive information. You can prove to your authority by extracting text and confirming that no sensitive information is contained any more. Then confirm that that new file still causes your problem. If so, you can use it to report the problem.

Another idea: We are currently working on the next PyMuPDF version. Possibly it would solve your problem. I could send you that version - however for Ubuntu Linux Python 3.10 only (or Windows), not Oracle Linux Python 3.7. Please drop me a note if that would help.

Oct 11 '22 10:10 JorjMcKie

Hi, I can upgrade our system to use Python 3.10. Please send me that version, I can try it. Thanks, Wei

On Tuesday, October 11, 2022 at 03:49:27 AM PDT, Jorj X. McKie ***@***.***> wrote:

Please see if it is possible to debug without the actual PDF files. Sorry for the inconvenience.

Then I am sorry to say, that we cannot help. PyMuPDF has ways to remove (sensitive) text: determine the respective text positions and overlay them with redaction annotations, apply the redactions and save to a new file. The resulting file will no longer contain this sensitive information. You can prove to your authority by extracting text and confirming that no sensitive information is contained any more. Then confirm that that new file still causes your problem. If so, you can use it to report the problem.

Another idea: We are currently working on the next PyMuPDF version. Possibly it would solve your problem. I could send you that version - however for Ubuntu Linux Python 3.10 only (or Windows), not Oracle Linux Python 3.7. Please drop me a note if that would help.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

Oct 14 '22 03:10 weironglue

PyMuPDF-1.20.4-cp310-cp310-linux_x86_64.zip Unzip and then do python -m pip install -U PyMuPDF-1.20.4-cp310-cp310-linux_x86_64.whl

Oct 14 '22 08:10 JorjMcKie

I hope your machine can accept the platform tag "-linux_x86_64". If not you will have to wait for the official version.

Oct 14 '22 08:10 JorjMcKie

PyMuPDF PyMuPDF copied to clipboard

Kernel Python3 segfault error 4 in Linux machine running PyNuPDF V1.20.1

Describe the bug (mandatory)

To Reproduce (mandatory)

Expected behavior (optional)

Screenshots (optional)

Your configuration (mandatory)

Additional context (optional)

Arguments

Returns

Special handling for 180-degree rotated pages - in these cases, multiply each comparison by derotation matrix

so that y coordinates are mapped correctly. This breaks behavior for 270-degree pages so do with flag structure

Search page text for all keywords that were identified earlier in the pipeline

Pick the appearance of the keyword in the page with the lowest y0, in case there are multiple appearances

If keyword wasn't found, try getting page words, stripping punctuation, and checking that way

This is necessary because PyMuPDF search_for() includes punctuation and the keyword tokens don't

Handle token apostrophe s-es

Tokens will look like "marfan s syndrome" but PyMuPDF would be "marfan's syndrome"

If found, replace " s" with "'s"; either way, split words into a list

Check for presence of keyword in word string list

Stop searching if we found a match

Loop through keywords

Handle apostrophes - if the word on the page has apostrophe s, check both word equality and word

quality without the apostrophe s

This handles cases where the search keyword has an apostrophe s (legionnare s disease),

as well as cases where the page text has an apostrophe s (procedure: the patient's condition...)

If the keyword wasn't found, stop looking at the current word

If the loop is exhausted (meaning we found all necessary words), we have a hit

So pick the first word's x/y coordinates and use that for the target label

If the keyword STILL isn't found, try just searching for the first word

Keep the keyword location if it is the highest (vertically) on the page

If we found any kind of keyword hit, draw a label at the new location

Make sure label does not go outside page boundary - if it would, cap at page boundary

Now, determine new annot_loc - use y values from keyword, but aligned with left margin, vertically

Include offset information determined by boundary and occlusion checks

Make sure label does not occlude an existing label - if it does, keep whichever is more vertical on the page

Only way to do this is to get the rectangle of every label and cross-reference their coordinates

Add a 15 pixel buffer in both directions before checking intersection for extra safety

If it intersects, determine which of the two labels is further up the page

If the existing label is the one that is further up, just put the new one below it

Otherwise, prepare to move the old one down below the new one

Move whichever one you are moving 5 pixels below the one you are not moving

Do the same offset check again to check for running over the page length

Move the label you're moving

Call set_rect() if you're moving the existing label, as that is how you move an existing label

Otherwise, just set the new annotation location to the moved label

Finally, look inside annot_loc to see if there is any text in it; if there is, also lower opacity

Do this by searching in annot_loc plus a slight extra x buffer to ensure confidence and cover edge cases

If we didn't find any keywords on the page, fall back to top-of-page logic (2nd row with diagnostics)

This will revert fontsize, remove line breaks from the annotation text, and un-center the label text

Draw actual label now

Set opacity to 50% if we determined there's overlap with any text

Arguments

Returns

Special handling for 180-degree rotated pages - in these cases, multiply each comparison by derotation matrix

so that y coordinates are mapped correctly. This breaks behavior for 270-degree pages so do with flag structure

Search page text for all keywords that were identified earlier in the pipeline

Pick the appearance of the keyword in the page with the lowest y0, in case there are multiple appearances

If keyword wasn't found, try getting page words, stripping punctuation, and checking that way

This is necessary because PyMuPDF search_for() includes punctuation and the keyword tokens don't

Handle token apostrophe s-es

Tokens will look like "marfan s syndrome" but PyMuPDF would be "marfan's syndrome"

If found, replace " s" with "'s"; either way, split words into a list

Check for presence of keyword in word string list

Stop searching if we found a match

Loop through keywords

Handle apostrophes - if the word on the page has apostrophe s, check both word equality and word

quality without the apostrophe s

This handles cases where the search keyword has an apostrophe s (legionnare s disease),

as well as cases where the page text has an apostrophe s (procedure: the patient's condition...)

If the keyword wasn't found, stop looking at the current word

If the loop is exhausted (meaning we found all necessary words), we have a hit

So pick the first word's x/y coordinates and use that for the target label

If the keyword STILL isn't found, try just searching for the first word

Keep the keyword location if it is the highest (vertically) on the page

If we found any kind of keyword hit, draw a label at the new location

Make sure label does not go outside page boundary - if it would, cap at page boundary

Now, determine new annot_loc - use y values from keyword, but aligned with left margin, vertically

Include offset information determined by boundary and occlusion checks

Make sure label does not occlude an existing label - if it does, keep whichever is more vertical on the page

PyMuPDF
PyMuPDF copied to clipboard