PyMuPDF
PyMuPDF copied to clipboard
Kernel Python3 segfault error 4 in Linux machine running PyNuPDF V1.20.1
Please provide all mandatory information!
Describe the bug (mandatory)
A clear and concise description of what the bug is.
Please see the following screenshot of error in running the PyMuPDF V1.20.1. The program in Linux machine will stop by itself because of too many “kernel Python3 segfault error 4” errors without any warning or error message.
When I run this
To Reproduce (mandatory)
Explain the steps to reproduce the behavior, For example, include a minimal code snippet, example files, etc.
The Linux machine will stop running because PyMuPDF generates many Error 4 errors. The Linux system quits running after reporting 10-20 Error 4 errors. The only error that we can see is from the system log of Linux machine (as shown in the following screen shot).
Expected behavior (optional)
Describe what you expected to happen (if not obvious).
Screenshots (optional)
If applicable, add screenshots to help explain your problem.
Your configuration (mandatory)
- Operating system, potentially version and bitness
- Python version, bitness
- PyMuPDF version, installation method (wheel or generated from source).
OP system: Oracle Linux op system V 5.4.17 Python Version: 3.7.3 PyMuPdf version: 1.20.1
For example, the output of print(sys.version, "\n", sys.platform, "\n", fitz.__doc__) would be sufficient (for the first two bullets).
Additional context (optional)
Add any other context about the problem here.
I need the script (a minimal part of it only!) plus data files which cause the error. I need to reproduce the error on my machine. The console log alone tells me nothing. You forgot to mention, how you installed the package.
Are there any other PyMuPDF scripts running successfully? Were there problems running earlier versions of PyMuPDF?
Still waiting for material to reproduce the error ...
I am going to close this because of lack of supporting evidence / reproducibility.
Hi, I sent you an email yesterday. I am sending you the attached file again now. Again, the codes is running without kernel error in PyMuPDF 1.19.5. Please reply to my email to confirm you have received it. Thanks, Wei
On Saturday, October 1, 2022 at 06:44:01 AM PDT, Jorj X. McKie ***@***.***> wrote:
I am going to close this because of lack of supporting evidence / reproducibility.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
def draw_clin_label(page, annot_text, annot_loc, font_size, fill_color, keywords): """Helper method to draw a new clinical summary label onto the page. These are placed parallel to where their associate keyword is on the page.
Arguments
=========
page: PyMuPDF page
PyMuPDF representation of a page.
annot_text: string
The text to write into the label.
annot_loc: PyMuPDF Rectangle
PyMuPDF page rectangle area in which to draw the label text.
Only used in case of emergency, e.g. keyword wasn't found so must fall back to 2nd row.
font_size: int
Fontsize to use. 14 for organs, 18 for everything else.
fill_color: PyMuPDF RGB color tuple
Color to use for the label.
keywords: list of tuples
List of tuples of (keyword, frequency) for the given category on the page.
Used to search page text to determine label placement.
Returns
=======
Label drawn onto page. The label is placed on left margin across from the keyword if the keyword is found;
otherwise, it draws it in the 2nd row.
"""
found, lower_opacity = False, False
top_keyword_loc = None
align = fitz.TEXT_ALIGN_CENTER
# Special handling for 180-degree rotated pages - in these cases, multiply each comparison by derotation matrix
# so that y coordinates are mapped correctly. This breaks behavior for 270-degree pages so do with flag structure
do_derotate = page.rotation == 180
# Search page text for all keywords that were identified earlier in the pipeline
for keyword, frequency in keywords:
keyword_loc = None
keyword_locs = page.search_for(keyword)
# Pick the appearance of the keyword in the page with the lowest y0, in case there are multiple appearances
if keyword_locs:
if do_derotate:
keyword_loc = min([loc * page.derotation_matrix for loc in keyword_locs], key=lambda k: k.y0)
else:
keyword_loc = min(keyword_locs, key=lambda k: k.y0)
# If keyword wasn't found, try getting page words, stripping punctuation, and checking that way
# This is necessary because PyMuPDF search_for() includes punctuation and the keyword tokens don't
else:
all_words = page.get_text("words")
all_word_strings = [re.sub(TOKEN_WORDS, "", w[4]).lower() for w in all_words]
# Handle token apostrophe s-es
# Tokens will look like "marfan s syndrome" but PyMuPDF would be "marfan's syndrome"
apostrophe = re.search(APOSTROPHE_S, keyword)
# If found, replace " s" with "'s"; either way, split words into a list
if apostrophe:
span = apostrophe.span()
keyword_words = (keyword[:span[0]] + "'s" + keyword[span[1]:]).split(" ")
else:
keyword_words = keyword.split(" ")
found = False
# Check for presence of keyword in word string list
for i in range(len(all_word_strings) - len(keyword_words) + 1):
# Stop searching if we found a match
if found:
break
# Loop through keywords
for j in range(len(keyword_words)):
current_word = all_word_strings[i + j]
# Handle apostrophes - if the word on the page has apostrophe s, check both word equality and word
# quality without the apostrophe s
# This handles cases where the search keyword has an apostrophe s (legionnare s disease),
# as well as cases where the page text has an apostrophe s (procedure: the patient's condition...)
if current_word.endswith("'s"):
if current_word != keyword_words[j] and \
current_word[:len(current_word) - 2] != keyword_words[j]:
break
# If the keyword wasn't found, stop looking at the current word
elif current_word != keyword_words[j]:
break
# If the loop is exhausted (meaning we found all necessary words), we have a hit
# So pick the first word's x/y coordinates and use that for the target label
else:
start = all_words[i]
if do_derotate:
keyword_loc = fitz.Rect(start[0], start[1], start[2], start[3]) * page.derotation_matrix
else:
keyword_loc = fitz.Rect(start[0], start[1], start[2], start[3])
found = True
break
# If the keyword STILL isn't found, try just searching for the first word
if not found:
keyword_locs = page.search_for(keyword.split(" ")[0])
if keyword_locs:
if do_derotate:
keyword_loc = min([loc * page.derotation_matrix for loc in keyword_locs], key=lambda k: k.y0)
else:
keyword_loc = keyword_locs[0]
# Keep the keyword location if it is the highest (vertically) on the page
if not top_keyword_loc:
top_keyword_loc = keyword_loc
else:
if keyword_loc.y0 < top_keyword_loc.y0:
top_keyword_loc = keyword_loc
# If we found any kind of keyword hit, draw a label at the new location
if top_keyword_loc:
found = True
if do_derotate:
page_bounds = page.bound() * page.derotation_matrix
else:
page_bounds = page.bound()
offset = 0
# Make sure label does not go outside page boundary - if it would, cap at page boundary
if top_keyword_loc.y0 + VERTICAL_LABEL_LEN > page_bounds.y1:
offset = (top_keyword_loc.y0 + VERTICAL_LABEL_LEN) - page_bounds.y1
# Now, determine new annot_loc - use y values from keyword, but aligned with left margin, vertically
# Include offset information determined by boundary and occlusion checks
annot_loc = fitz.Rect(page_bounds.x0, top_keyword_loc.y0 - offset,
page_bounds.x0 + 15, (top_keyword_loc.y0 + VERTICAL_LABEL_LEN) - offset)
# Make sure label does not occlude an existing label - if it does, keep whichever is more vertical on the page
# Only way to do this is to get the rectangle of every label and cross-reference their coordinates
for existing_annot in page.annots():
existing_rect = existing_annot.rect
existing_rect_mod = deepcopy(existing_rect)
# Add a 15 pixel buffer in both directions before checking intersection for extra safety
existing_rect_mod.y0 = existing_rect_mod.y0 - 15
existing_rect_mod.y1 = existing_rect_mod.y1 + 15
# If it intersects, determine which of the two labels is further up the page
if existing_rect_mod.intersects(annot_loc):
# If the existing label is the one that is further up, just put the new one below it
if existing_rect.y0 < annot_loc.y0:
to_move = annot_loc
to_keep = existing_rect
do_set_rect = False
# Otherwise, prepare to move the old one down below the new one
else:
to_move = existing_rect
to_keep = annot_loc
do_set_rect = True
# Move whichever one you are moving 5 pixels below the one you are not moving
to_move.y0 = to_keep.y1 + 5
# Do the same offset check again to check for running over the page length
if to_move.y0 + VERTICAL_LABEL_LEN > page_bounds.y1:
offset = (to_move.y0 + VERTICAL_LABEL_LEN) - page_bounds.y1
else:
offset = 0
# Move the label you're moving
to_move.y1 = (to_move.y0 + VERTICAL_LABEL_LEN) - offset
# Call set_rect() if you're moving the existing label, as that is how you move an existing label
if do_set_rect:
existing_annot.set_rect(existing_rect)
# Otherwise, just set the new annotation location to the moved label
else:
annot_loc = to_move
# Finally, look inside annot_loc to see if there is any text in it; if there is, also lower opacity
# Do this by searching in annot_loc plus a slight extra x buffer to ensure confidence and cover edge cases
search_loc = fitz.Rect(annot_loc.x0, annot_loc.y0, annot_loc.x1 + 5, annot_loc.y1)
if page.get_textbox(search_loc):
lower_opacity = True
LOGGER.debug("Found overlap between text and clinical summary label on %s page %i, lowering opacity." %
(page.parent.name, page.number + 1))
# If we didn't find any keywords on the page, fall back to top-of-page logic (2nd row with diagnostics)
# This will revert fontsize, remove line breaks from the annotation text, and un-center the label text
if not found:
font_size = 18
annot_text = annot_text.replace("\n", "")
align = fitz.TEXT_ALIGN_LEFT
# Draw actual label now
label_annot = page.add_freetext_annot(annot_loc * page.derotation_matrix, annot_text, fontsize=font_size,
fill_color=fill_color, text_color=BLACK, align=align, rotate=page.rotation)
label_annot.set_info({"title": "NLP"})
# Set opacity to 50% if we determined there's overlap with any text
if lower_opacity:
label_annot.update(opacity=0.50, fill_color=fill_color)
I can see you new post here, and I also received it as an e-mail in my inbox. But I did not receive your input file - not here, not in my e-mail inbox.
Hi, I am still missing your input file. Jorj
Gesendet von Mailhttps://go.microsoft.com/fwlink/?LinkId=550986 für Windows
Von: @.> Gesendet: Samstag, 1. Oktober 2022 17:22 An: @.> Cc: Jorj X. @.>; State @.> Betreff: Re: [pymupdf/PyMuPDF] Kernel Python3 segfault error 4 in Linux machine running PyNuPDF V1.20.1 (Issue #1937)
Hi, I sent you an email yesterday. I am sending you the attached file again now. Again, the codes is running without kernel error in PyMuPDF 1.19.5. Please reply to my email to confirm you have received it. Thanks, Wei
On Saturday, October 1, 2022 at 06:44:01 AM PDT, Jorj X. McKie @.***> wrote:
I am going to close this because of lack of supporting evidence / reproducibility.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
def draw_clin_label(page, annot_text, annot_loc, font_size, fill_color, keywords): """Helper method to draw a new clinical summary label onto the page. These are placed parallel to where their associate keyword is on the page.
Arguments
page: PyMuPDF page PyMuPDF representation of a page.
annot_text: string The text to write into the label.
annot_loc: PyMuPDF Rectangle PyMuPDF page rectangle area in which to draw the label text. Only used in case of emergency, e.g. keyword wasn't found so must fall back to 2nd row.
font_size: int Fontsize to use. 14 for organs, 18 for everything else.
fill_color: PyMuPDF RGB color tuple Color to use for the label.
keywords: list of tuples List of tuples of (keyword, frequency) for the given category on the page. Used to search page text to determine label placement.
Returns
Label drawn onto page. The label is placed on left margin across from the keyword if the keyword is found; otherwise, it draws it in the 2nd row. """ found, lower_opacity = False, False top_keyword_loc = None align = fitz.TEXT_ALIGN_CENTER
Special handling for 180-degree rotated pages - in these cases, multiply each comparison by derotation matrix
so that y coordinates are mapped correctly. This breaks behavior for 270-degree pages so do with flag structure
do_derotate = page.rotation == 180
Search page text for all keywords that were identified earlier in the pipeline
for keyword, frequency in keywords: keyword_loc = None keyword_locs = page.search_for(keyword)
Pick the appearance of the keyword in the page with the lowest y0, in case there are multiple appearances
if keyword_locs: if do_derotate: keyword_loc = min([loc * page.derotation_matrix for loc in keyword_locs], key=lambda k: k.y0) else: keyword_loc = min(keyword_locs, key=lambda k: k.y0)
If keyword wasn't found, try getting page words, stripping punctuation, and checking that way
This is necessary because PyMuPDF search_for() includes punctuation and the keyword tokens don't
else: all_words = page.get_text("words") all_word_strings = [re.sub(TOKEN_WORDS, "", w[4]).lower() for w in all_words]
Handle token apostrophe s-es
Tokens will look like "marfan s syndrome" but PyMuPDF would be "marfan's syndrome"
apostrophe = re.search(APOSTROPHE_S, keyword)
If found, replace " s" with "'s"; either way, split words into a list
if apostrophe: span = apostrophe.span() keyword_words = (keyword[:span[0]] + "'s" + keyword[span[1]:]).split(" ") else: keyword_words = keyword.split(" ") found = False
Check for presence of keyword in word string list
for i in range(len(all_word_strings) - len(keyword_words) + 1):
Stop searching if we found a match
if found: break
Loop through keywords
for j in range(len(keyword_words)): current_word = all_word_strings[i + j]
Handle apostrophes - if the word on the page has apostrophe s, check both word equality and word
quality without the apostrophe s
This handles cases where the search keyword has an apostrophe s (legionnare s disease),
as well as cases where the page text has an apostrophe s (procedure: the patient's condition...)
if current_word.endswith("'s"):
if current_word != keyword_words[j] and
current_word[:len(current_word) - 2] != keyword_words[j]:
break
If the keyword wasn't found, stop looking at the current word
elif current_word != keyword_words[j]: break
If the loop is exhausted (meaning we found all necessary words), we have a hit
So pick the first word's x/y coordinates and use that for the target label
else: start = all_words[i] if do_derotate: keyword_loc = fitz.Rect(start[0], start[1], start[2], start[3]) * page.derotation_matrix else: keyword_loc = fitz.Rect(start[0], start[1], start[2], start[3]) found = True break
If the keyword STILL isn't found, try just searching for the first word
if not found: keyword_locs = page.search_for(keyword.split(" ")[0]) if keyword_locs: if do_derotate: keyword_loc = min([loc * page.derotation_matrix for loc in keyword_locs], key=lambda k: k.y0) else: keyword_loc = keyword_locs[0]
Keep the keyword location if it is the highest (vertically) on the page
if not top_keyword_loc: top_keyword_loc = keyword_loc else: if keyword_loc.y0 < top_keyword_loc.y0: top_keyword_loc = keyword_loc
If we found any kind of keyword hit, draw a label at the new location
if top_keyword_loc: found = True if do_derotate: page_bounds = page.bound() * page.derotation_matrix else: page_bounds = page.bound() offset = 0
Make sure label does not go outside page boundary - if it would, cap at page boundary
if top_keyword_loc.y0 + VERTICAL_LABEL_LEN > page_bounds.y1: offset = (top_keyword_loc.y0 + VERTICAL_LABEL_LEN) - page_bounds.y1
Now, determine new annot_loc - use y values from keyword, but aligned with left margin, vertically
Include offset information determined by boundary and occlusion checks
annot_loc = fitz.Rect(page_bounds.x0, top_keyword_loc.y0 - offset, page_bounds.x0 + 15, (top_keyword_loc.y0 + VERTICAL_LABEL_LEN) - offset)
Make sure label does not occlude an existing label - if it does, keep whichever is more vertical on the page
Only way to do this is to get the rectangle of every label and cross-reference their coordinates
for existing_annot in page.annots(): existing_rect = existing_annot.rect existing_rect_mod = deepcopy(existing_rect)
Add a 15 pixel buffer in both directions before checking intersection for extra safety
existing_rect_mod.y0 = existing_rect_mod.y0 - 15 existing_rect_mod.y1 = existing_rect_mod.y1 + 15
If it intersects, determine which of the two labels is further up the page
if existing_rect_mod.intersects(annot_loc):
If the existing label is the one that is further up, just put the new one below it
if existing_rect.y0 < annot_loc.y0: to_move = annot_loc to_keep = existing_rect do_set_rect = False
Otherwise, prepare to move the old one down below the new one
else: to_move = existing_rect to_keep = annot_loc do_set_rect = True
Move whichever one you are moving 5 pixels below the one you are not moving
to_move.y0 = to_keep.y1 + 5
Do the same offset check again to check for running over the page length
if to_move.y0 + VERTICAL_LABEL_LEN > page_bounds.y1: offset = (to_move.y0 + VERTICAL_LABEL_LEN) - page_bounds.y1 else: offset = 0
Move the label you're moving
to_move.y1 = (to_move.y0 + VERTICAL_LABEL_LEN) - offset
Call set_rect() if you're moving the existing label, as that is how you move an existing label
if do_set_rect: existing_annot.set_rect(existing_rect)
Otherwise, just set the new annotation location to the moved label
else: annot_loc = to_move
Finally, look inside annot_loc to see if there is any text in it; if there is, also lower opacity
Do this by searching in annot_loc plus a slight extra x buffer to ensure confidence and cover edge cases
search_loc = fitz.Rect(annot_loc.x0, annot_loc.y0, annot_loc.x1 + 5, annot_loc.y1) if page.get_textbox(search_loc): lower_opacity = True LOGGER.debug("Found overlap between text and clinical summary label on %s page %i, lowering opacity." % (page.parent.name, page.number + 1))
If we didn't find any keywords on the page, fall back to top-of-page logic (2nd row with diagnostics)
This will revert fontsize, remove line breaks from the annotation text, and un-center the label text
if not found: font_size = 18 annot_text = annot_text.replace("\n", "") align = fitz.TEXT_ALIGN_LEFT
Draw actual label now
label_annot = page.add_freetext_annot(annot_loc * page.derotation_matrix, annot_text, fontsize=font_size, fill_color=fill_color, text_color=BLACK, align=align, rotate=page.rotation) label_annot.set_info({"title": "NLP"})
Set opacity to 50% if we determined there's overlap with any text
if lower_opacity: label_annot.update(opacity=0.50, fill_color=fill_color)
— Reply to this email directly, view it on GitHubhttps://github.com/pymupdf/PyMuPDF/issues/1937#issuecomment-1264484596, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB7IDIUPKI6N4EC5POSSUK3WBCTQTANCNFSM6AAAAAAQVHJP4A. You are receiving this because you modified the open/close state.Message ID: @.***>
Hi, I am sending you the codes as follows. I also attached it to my email one more time. Please let me know if you receive it. Thanks, Wei def draw_clin_label(page, annot_text, annot_loc, font_size, fill_color, keywords): """Helper method to draw a new clinical summary label onto the page. These are placed parallel to where their associate keyword is on the page. Arguments ========= page: PyMuPDF page PyMuPDF representation of a page. annot_text: string The text to write into the label. annot_loc: PyMuPDF Rectangle PyMuPDF page rectangle area in which to draw the label text. Only used in case of emergency, e.g. keyword wasn't found so must fall back to 2nd row. font_size: int Fontsize to use. 14 for organs, 18 for everything else. fill_color: PyMuPDF RGB color tuple Color to use for the label. keywords: list of tuples List of tuples of (keyword, frequency) for the given category on the page. Used to search page text to determine label placement. Returns ======= Label drawn onto page. The label is placed on left margin across from the keyword if the keyword is found; otherwise, it draws it in the 2nd row. """ found, lower_opacity = False, False top_keyword_loc = None align = fitz.TEXT_ALIGN_CENTER # Special handling for 180-degree rotated pages - in these cases, multiply each comparison by derotation matrix # so that y coordinates are mapped correctly. This breaks behavior for 270-degree pages so do with flag structure do_derotate = page.rotation == 180 # Search page text for all keywords that were identified earlier in the pipeline for keyword, frequency in keywords: keyword_loc = None keyword_locs = page.search_for(keyword) # Pick the appearance of the keyword in the page with the lowest y0, in case there are multiple appearances if keyword_locs: if do_derotate: keyword_loc = min([loc * page.derotation_matrix for loc in keyword_locs], key=lambda k: k.y0) else: keyword_loc = min(keyword_locs, key=lambda k: k.y0) # If keyword wasn't found, try getting page words, stripping punctuation, and checking that way # This is necessary because PyMuPDF search_for() includes punctuation and the keyword tokens don't else: all_words = page.get_text("words") all_word_strings = [re.sub(TOKEN_WORDS, "", w[4]).lower() for w in all_words] # Handle token apostrophe s-es # Tokens will look like "marfan s syndrome" but PyMuPDF would be "marfan's syndrome" apostrophe = re.search(APOSTROPHE_S, keyword) # If found, replace " s" with "'s"; either way, split words into a list if apostrophe: span = apostrophe.span() keyword_words = (keyword[:span[0]] + "'s" + keyword[span[1]:]).split(" ") else: keyword_words = keyword.split(" ") found = False # Check for presence of keyword in word string list for i in range(len(all_word_strings) - len(keyword_words) + 1): # Stop searching if we found a match if found: break # Loop through keywords for j in range(len(keyword_words)): current_word = all_word_strings[i + j] # Handle apostrophes - if the word on the page has apostrophe s, check both word equality and word # quality without the apostrophe s # This handles cases where the search keyword has an apostrophe s (legionnare s disease), # as well as cases where the page text has an apostrophe s (procedure: the patient's condition...) if current_word.endswith("'s"): if current_word != keyword_words[j] and \ current_word[:len(current_word) - 2] != keyword_words[j]: break # If the keyword wasn't found, stop looking at the current word elif current_word != keyword_words[j]: break # If the loop is exhausted (meaning we found all necessary words), we have a hit # So pick the first word's x/y coordinates and use that for the target label else: start = all_words[i] if do_derotate: keyword_loc = fitz.Rect(start[0], start[1], start[2], start[3]) * page.derotation_matrix else: keyword_loc = fitz.Rect(start[0], start[1], start[2], start[3]) found = True break # If the keyword STILL isn't found, try just searching for the first word if not found: keyword_locs = page.search_for(keyword.split(" ")[0]) if keyword_locs: if do_derotate: keyword_loc = min([loc * page.derotation_matrix for loc in keyword_locs], key=lambda k: k.y0) else: keyword_loc = keyword_locs[0] # Keep the keyword location if it is the highest (vertically) on the page if not top_keyword_loc: top_keyword_loc = keyword_loc else: if keyword_loc.y0 < top_keyword_loc.y0: top_keyword_loc = keyword_loc # If we found any kind of keyword hit, draw a label at the new location if top_keyword_loc: found = True if do_derotate: page_bounds = page.bound() * page.derotation_matrix else: page_bounds = page.bound() offset = 0 # Make sure label does not go outside page boundary - if it would, cap at page boundary if top_keyword_loc.y0 + VERTICAL_LABEL_LEN > page_bounds.y1: offset = (top_keyword_loc.y0 + VERTICAL_LABEL_LEN) - page_bounds.y1 # Now, determine new annot_loc - use y values from keyword, but aligned with left margin, vertically # Include offset information determined by boundary and occlusion checks annot_loc = fitz.Rect(page_bounds.x0, top_keyword_loc.y0 - offset, page_bounds.x0 + 15, (top_keyword_loc.y0 + VERTICAL_LABEL_LEN) - offset) # Make sure label does not occlude an existing label - if it does, keep whichever is more vertical on the page # Only way to do this is to get the rectangle of every label and cross-reference their coordinates for existing_annot in page.annots(): existing_rect = existing_annot.rect existing_rect_mod = deepcopy(existing_rect) # Add a 15 pixel buffer in both directions before checking intersection for extra safety existing_rect_mod.y0 = existing_rect_mod.y0 - 15 existing_rect_mod.y1 = existing_rect_mod.y1 + 15 # If it intersects, determine which of the two labels is further up the page if existing_rect_mod.intersects(annot_loc): # If the existing label is the one that is further up, just put the new one below it if existing_rect.y0 < annot_loc.y0: to_move = annot_loc to_keep = existing_rect do_set_rect = False # Otherwise, prepare to move the old one down below the new one else: to_move = existing_rect to_keep = annot_loc do_set_rect = True # Move whichever one you are moving 5 pixels below the one you are not moving to_move.y0 = to_keep.y1 + 5 # Do the same offset check again to check for running over the page length if to_move.y0 + VERTICAL_LABEL_LEN > page_bounds.y1: offset = (to_move.y0 + VERTICAL_LABEL_LEN) - page_bounds.y1 else: offset = 0 # Move the label you're moving to_move.y1 = (to_move.y0 + VERTICAL_LABEL_LEN) - offset # Call set_rect() if you're moving the existing label, as that is how you move an existing label if do_set_rect: existing_annot.set_rect(existing_rect) # Otherwise, just set the new annotation location to the moved label else: annot_loc = to_move # Finally, look inside annot_loc to see if there is any text in it; if there is, also lower opacity # Do this by searching in annot_loc plus a slight extra x buffer to ensure confidence and cover edge cases search_loc = fitz.Rect(annot_loc.x0, annot_loc.y0, annot_loc.x1 + 5, annot_loc.y1) if page.get_textbox(search_loc): lower_opacity = True LOGGER.debug("Found overlap between text and clinical summary label on %s page %i, lowering opacity." % (page.parent.name, page.number + 1)) # If we didn't find any keywords on the page, fall back to top-of-page logic (2nd row with diagnostics) # This will revert fontsize, remove line breaks from the annotation text, and un-center the label text if not found: font_size = 18 annot_text = annot_text.replace("\n", "") align = fitz.TEXT_ALIGN_LEFT # Draw actual label now label_annot = page.add_freetext_annot(annot_loc * page.derotation_matrix, annot_text, fontsize=font_size, fill_color=fill_color, text_color=BLACK, align=align, rotate=page.rotation) label_annot.set_info({"title": "NLP"}) # Set opacity to 50% if we determined there's overlap with any text if lower_opacity: label_annot.update(opacity=0.50, fill_color=fill_color)
On Saturday, October 1, 2022 at 02:50:19 PM PDT, Jorj X. McKie ***@***.***> wrote:
Hi, I am still missing your input file. Jorj
Gesendet von Mailhttps://go.microsoft.com/fwlink/?LinkId=550986 für Windows
Von: @.> Gesendet: Samstag, 1. Oktober 2022 17:22 An: @.> Cc: Jorj X. @.>; State @.> Betreff: Re: [pymupdf/PyMuPDF] Kernel Python3 segfault error 4 in Linux machine running PyNuPDF V1.20.1 (Issue #1937)
Hi, I sent you an email yesterday. I am sending you the attached file again now. Again, the codes is running without kernel error in PyMuPDF 1.19.5. Please reply to my email to confirm you have received it. Thanks, Wei
On Saturday, October 1, 2022 at 06:44:01 AM PDT, Jorj X. McKie @.***> wrote:
I am going to close this because of lack of supporting evidence / reproducibility.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
def draw_clin_label(page, annot_text, annot_loc, font_size, fill_color, keywords): """Helper method to draw a new clinical summary label onto the page. These are placed parallel to where their associate keyword is on the page.
Arguments
page: PyMuPDF page PyMuPDF representation of a page.
annot_text: string The text to write into the label.
annot_loc: PyMuPDF Rectangle PyMuPDF page rectangle area in which to draw the label text. Only used in case of emergency, e.g. keyword wasn't found so must fall back to 2nd row.
font_size: int Fontsize to use. 14 for organs, 18 for everything else.
fill_color: PyMuPDF RGB color tuple Color to use for the label.
keywords: list of tuples List of tuples of (keyword, frequency) for the given category on the page. Used to search page text to determine label placement.
Returns
Label drawn onto page. The label is placed on left margin across from the keyword if the keyword is found; otherwise, it draws it in the 2nd row. """ found, lower_opacity = False, False top_keyword_loc = None align = fitz.TEXT_ALIGN_CENTER
Special handling for 180-degree rotated pages - in these cases, multiply each comparison by derotation matrix
so that y coordinates are mapped correctly. This breaks behavior for 270-degree pages so do with flag structure
do_derotate = page.rotation == 180
Search page text for all keywords that were identified earlier in the pipeline
for keyword, frequency in keywords: keyword_loc = None keyword_locs = page.search_for(keyword)
Pick the appearance of the keyword in the page with the lowest y0, in case there are multiple appearances
if keyword_locs: if do_derotate: keyword_loc = min([loc * page.derotation_matrix for loc in keyword_locs], key=lambda k: k.y0) else: keyword_loc = min(keyword_locs, key=lambda k: k.y0)
If keyword wasn't found, try getting page words, stripping punctuation, and checking that way
This is necessary because PyMuPDF search_for() includes punctuation and the keyword tokens don't
else: all_words = page.get_text("words") all_word_strings = [re.sub(TOKEN_WORDS, "", w[4]).lower() for w in all_words]
Handle token apostrophe s-es
Tokens will look like "marfan s syndrome" but PyMuPDF would be "marfan's syndrome"
apostrophe = re.search(APOSTROPHE_S, keyword)
If found, replace " s" with "'s"; either way, split words into a list
if apostrophe: span = apostrophe.span() keyword_words = (keyword[:span[0]] + "'s" + keyword[span[1]:]).split(" ") else: keyword_words = keyword.split(" ") found = False
Check for presence of keyword in word string list
for i in range(len(all_word_strings) - len(keyword_words) + 1):
Stop searching if we found a match
if found: break
Loop through keywords
for j in range(len(keyword_words)): current_word = all_word_strings[i + j]
Handle apostrophes - if the word on the page has apostrophe s, check both word equality and word
quality without the apostrophe s
This handles cases where the search keyword has an apostrophe s (legionnare s disease),
as well as cases where the page text has an apostrophe s (procedure: the patient's condition...)
if current_word.endswith("'s"):
if current_word != keyword_words[j] and
current_word[:len(current_word) - 2] != keyword_words[j]:
break
If the keyword wasn't found, stop looking at the current word
elif current_word != keyword_words[j]: break
If the loop is exhausted (meaning we found all necessary words), we have a hit
So pick the first word's x/y coordinates and use that for the target label
else: start = all_words[i] if do_derotate: keyword_loc = fitz.Rect(start[0], start[1], start[2], start[3]) * page.derotation_matrix else: keyword_loc = fitz.Rect(start[0], start[1], start[2], start[3]) found = True break
If the keyword STILL isn't found, try just searching for the first word
if not found: keyword_locs = page.search_for(keyword.split(" ")[0]) if keyword_locs: if do_derotate: keyword_loc = min([loc * page.derotation_matrix for loc in keyword_locs], key=lambda k: k.y0) else: keyword_loc = keyword_locs[0]
Keep the keyword location if it is the highest (vertically) on the page
if not top_keyword_loc: top_keyword_loc = keyword_loc else: if keyword_loc.y0 < top_keyword_loc.y0: top_keyword_loc = keyword_loc
If we found any kind of keyword hit, draw a label at the new location
if top_keyword_loc: found = True if do_derotate: page_bounds = page.bound() * page.derotation_matrix else: page_bounds = page.bound() offset = 0
Make sure label does not go outside page boundary - if it would, cap at page boundary
if top_keyword_loc.y0 + VERTICAL_LABEL_LEN > page_bounds.y1: offset = (top_keyword_loc.y0 + VERTICAL_LABEL_LEN) - page_bounds.y1
Now, determine new annot_loc - use y values from keyword, but aligned with left margin, vertically
Include offset information determined by boundary and occlusion checks
annot_loc = fitz.Rect(page_bounds.x0, top_keyword_loc.y0 - offset, page_bounds.x0 + 15, (top_keyword_loc.y0 + VERTICAL_LABEL_LEN) - offset)
Make sure label does not occlude an existing label - if it does, keep whichever is more vertical on the page
Only way to do this is to get the rectangle of every label and cross-reference their coordinates
for existing_annot in page.annots(): existing_rect = existing_annot.rect existing_rect_mod = deepcopy(existing_rect)
Add a 15 pixel buffer in both directions before checking intersection for extra safety
existing_rect_mod.y0 = existing_rect_mod.y0 - 15 existing_rect_mod.y1 = existing_rect_mod.y1 + 15
If it intersects, determine which of the two labels is further up the page
if existing_rect_mod.intersects(annot_loc):
If the existing label is the one that is further up, just put the new one below it
if existing_rect.y0 < annot_loc.y0: to_move = annot_loc to_keep = existing_rect do_set_rect = False
Otherwise, prepare to move the old one down below the new one
else: to_move = existing_rect to_keep = annot_loc do_set_rect = True
Move whichever one you are moving 5 pixels below the one you are not moving
to_move.y0 = to_keep.y1 + 5
Do the same offset check again to check for running over the page length
if to_move.y0 + VERTICAL_LABEL_LEN > page_bounds.y1: offset = (to_move.y0 + VERTICAL_LABEL_LEN) - page_bounds.y1 else: offset = 0
Move the label you're moving
to_move.y1 = (to_move.y0 + VERTICAL_LABEL_LEN) - offset
Call set_rect() if you're moving the existing label, as that is how you move an existing label
if do_set_rect: existing_annot.set_rect(existing_rect)
Otherwise, just set the new annotation location to the moved label
else: annot_loc = to_move
Finally, look inside annot_loc to see if there is any text in it; if there is, also lower opacity
Do this by searching in annot_loc plus a slight extra x buffer to ensure confidence and cover edge cases
search_loc = fitz.Rect(annot_loc.x0, annot_loc.y0, annot_loc.x1 + 5, annot_loc.y1) if page.get_textbox(search_loc): lower_opacity = True LOGGER.debug("Found overlap between text and clinical summary label on %s page %i, lowering opacity." % (page.parent.name, page.number + 1))
If we didn't find any keywords on the page, fall back to top-of-page logic (2nd row with diagnostics)
This will revert fontsize, remove line breaks from the annotation text, and un-center the label text
if not found: font_size = 18 annot_text = annot_text.replace("\n", "") align = fitz.TEXT_ALIGN_LEFT
Draw actual label now
label_annot = page.add_freetext_annot(annot_loc * page.derotation_matrix, annot_text, fontsize=font_size, fill_color=fill_color, text_color=BLACK, align=align, rotate=page.rotation) label_annot.set_info({"title": "NLP"})
Set opacity to 50% if we determined there's overlap with any text
if lower_opacity: label_annot.update(opacity=0.50, fill_color=fill_color)
— Reply to this email directly, view it on GitHubhttps://github.com/pymupdf/PyMuPDF/issues/1937#issuecomment-1264484596, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB7IDIUPKI6N4EC5POSSUK3WBCTQTANCNFSM6AAAAAAQVHJP4A. You are receiving this because you modified the open/close state.Message ID: @.***>
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
def draw_clin_label(page, annot_text, annot_loc, font_size, fill_color, keywords): """Helper method to draw a new clinical summary label onto the page. These are placed parallel to where their associate keyword is on the page.
Arguments
=========
page: PyMuPDF page
PyMuPDF representation of a page.
annot_text: string
The text to write into the label.
annot_loc: PyMuPDF Rectangle
PyMuPDF page rectangle area in which to draw the label text.
Only used in case of emergency, e.g. keyword wasn't found so must fall back to 2nd row.
font_size: int
Fontsize to use. 14 for organs, 18 for everything else.
fill_color: PyMuPDF RGB color tuple
Color to use for the label.
keywords: list of tuples
List of tuples of (keyword, frequency) for the given category on the page.
Used to search page text to determine label placement.
Returns
=======
Label drawn onto page. The label is placed on left margin across from the keyword if the keyword is found;
otherwise, it draws it in the 2nd row.
"""
found, lower_opacity = False, False
top_keyword_loc = None
align = fitz.TEXT_ALIGN_CENTER
# Special handling for 180-degree rotated pages - in these cases, multiply each comparison by derotation matrix
# so that y coordinates are mapped correctly. This breaks behavior for 270-degree pages so do with flag structure
do_derotate = page.rotation == 180
# Search page text for all keywords that were identified earlier in the pipeline
for keyword, frequency in keywords:
keyword_loc = None
keyword_locs = page.search_for(keyword)
# Pick the appearance of the keyword in the page with the lowest y0, in case there are multiple appearances
if keyword_locs:
if do_derotate:
keyword_loc = min([loc * page.derotation_matrix for loc in keyword_locs], key=lambda k: k.y0)
else:
keyword_loc = min(keyword_locs, key=lambda k: k.y0)
# If keyword wasn't found, try getting page words, stripping punctuation, and checking that way
# This is necessary because PyMuPDF search_for() includes punctuation and the keyword tokens don't
else:
all_words = page.get_text("words")
all_word_strings = [re.sub(TOKEN_WORDS, "", w[4]).lower() for w in all_words]
# Handle token apostrophe s-es
# Tokens will look like "marfan s syndrome" but PyMuPDF would be "marfan's syndrome"
apostrophe = re.search(APOSTROPHE_S, keyword)
# If found, replace " s" with "'s"; either way, split words into a list
if apostrophe:
span = apostrophe.span()
keyword_words = (keyword[:span[0]] + "'s" + keyword[span[1]:]).split(" ")
else:
keyword_words = keyword.split(" ")
found = False
# Check for presence of keyword in word string list
for i in range(len(all_word_strings) - len(keyword_words) + 1):
# Stop searching if we found a match
if found:
break
# Loop through keywords
for j in range(len(keyword_words)):
current_word = all_word_strings[i + j]
# Handle apostrophes - if the word on the page has apostrophe s, check both word equality and word
# quality without the apostrophe s
# This handles cases where the search keyword has an apostrophe s (legionnare s disease),
# as well as cases where the page text has an apostrophe s (procedure: the patient's condition...)
if current_word.endswith("'s"):
if current_word != keyword_words[j] and \
current_word[:len(current_word) - 2] != keyword_words[j]:
break
# If the keyword wasn't found, stop looking at the current word
elif current_word != keyword_words[j]:
break
# If the loop is exhausted (meaning we found all necessary words), we have a hit
# So pick the first word's x/y coordinates and use that for the target label
else:
start = all_words[i]
if do_derotate:
keyword_loc = fitz.Rect(start[0], start[1], start[2], start[3]) * page.derotation_matrix
else:
keyword_loc = fitz.Rect(start[0], start[1], start[2], start[3])
found = True
break
# If the keyword STILL isn't found, try just searching for the first word
if not found:
keyword_locs = page.search_for(keyword.split(" ")[0])
if keyword_locs:
if do_derotate:
keyword_loc = min([loc * page.derotation_matrix for loc in keyword_locs], key=lambda k: k.y0)
else:
keyword_loc = keyword_locs[0]
# Keep the keyword location if it is the highest (vertically) on the page
if not top_keyword_loc:
top_keyword_loc = keyword_loc
else:
if keyword_loc.y0 < top_keyword_loc.y0:
top_keyword_loc = keyword_loc
# If we found any kind of keyword hit, draw a label at the new location
if top_keyword_loc:
found = True
if do_derotate:
page_bounds = page.bound() * page.derotation_matrix
else:
page_bounds = page.bound()
offset = 0
# Make sure label does not go outside page boundary - if it would, cap at page boundary
if top_keyword_loc.y0 + VERTICAL_LABEL_LEN > page_bounds.y1:
offset = (top_keyword_loc.y0 + VERTICAL_LABEL_LEN) - page_bounds.y1
# Now, determine new annot_loc - use y values from keyword, but aligned with left margin, vertically
# Include offset information determined by boundary and occlusion checks
annot_loc = fitz.Rect(page_bounds.x0, top_keyword_loc.y0 - offset,
page_bounds.x0 + 15, (top_keyword_loc.y0 + VERTICAL_LABEL_LEN) - offset)
# Make sure label does not occlude an existing label - if it does, keep whichever is more vertical on the page
# Only way to do this is to get the rectangle of every label and cross-reference their coordinates
for existing_annot in page.annots():
existing_rect = existing_annot.rect
existing_rect_mod = deepcopy(existing_rect)
# Add a 15 pixel buffer in both directions before checking intersection for extra safety
existing_rect_mod.y0 = existing_rect_mod.y0 - 15
existing_rect_mod.y1 = existing_rect_mod.y1 + 15
# If it intersects, determine which of the two labels is further up the page
if existing_rect_mod.intersects(annot_loc):
# If the existing label is the one that is further up, just put the new one below it
if existing_rect.y0 < annot_loc.y0:
to_move = annot_loc
to_keep = existing_rect
do_set_rect = False
# Otherwise, prepare to move the old one down below the new one
else:
to_move = existing_rect
to_keep = annot_loc
do_set_rect = True
# Move whichever one you are moving 5 pixels below the one you are not moving
to_move.y0 = to_keep.y1 + 5
# Do the same offset check again to check for running over the page length
if to_move.y0 + VERTICAL_LABEL_LEN > page_bounds.y1:
offset = (to_move.y0 + VERTICAL_LABEL_LEN) - page_bounds.y1
else:
offset = 0
# Move the label you're moving
to_move.y1 = (to_move.y0 + VERTICAL_LABEL_LEN) - offset
# Call set_rect() if you're moving the existing label, as that is how you move an existing label
if do_set_rect:
existing_annot.set_rect(existing_rect)
# Otherwise, just set the new annotation location to the moved label
else:
annot_loc = to_move
# Finally, look inside annot_loc to see if there is any text in it; if there is, also lower opacity
# Do this by searching in annot_loc plus a slight extra x buffer to ensure confidence and cover edge cases
search_loc = fitz.Rect(annot_loc.x0, annot_loc.y0, annot_loc.x1 + 5, annot_loc.y1)
if page.get_textbox(search_loc):
lower_opacity = True
LOGGER.debug("Found overlap between text and clinical summary label on %s page %i, lowering opacity." %
(page.parent.name, page.number + 1))
# If we didn't find any keywords on the page, fall back to top-of-page logic (2nd row with diagnostics)
# This will revert fontsize, remove line breaks from the annotation text, and un-center the label text
if not found:
font_size = 18
annot_text = annot_text.replace("\n", "")
align = fitz.TEXT_ALIGN_LEFT
# Draw actual label now
label_annot = page.add_freetext_annot(annot_loc * page.derotation_matrix, annot_text, fontsize=font_size,
fill_color=fill_color, text_color=BLACK, align=align, rotate=page.rotation)
label_annot.set_info({"title": "NLP"})
# Set opacity to 50% if we determined there's overlap with any text
if lower_opacity:
label_annot.update(opacity=0.50, fill_color=fill_color)
def draw_clin_label(page, annot_text, annot_loc, font_size, fill_color, keywords): """Helper method to draw a new clinical summary label onto the page. These are placed parallel to where their associate keyword is on the page.
Arguments
=========
page: PyMuPDF page
PyMuPDF representation of a page.
annot_text: string
The text to write into the label.
annot_loc: PyMuPDF Rectangle
PyMuPDF page rectangle area in which to draw the label text.
Only used in case of emergency, e.g. keyword wasn't found so must fall back to 2nd row.
font_size: int
Fontsize to use. 14 for organs, 18 for everything else.
fill_color: PyMuPDF RGB color tuple
Color to use for the label.
keywords: list of tuples
List of tuples of (keyword, frequency) for the given category on the page.
Used to search page text to determine label placement.
Returns
=======
Label drawn onto page. The label is placed on left margin across from the keyword if the keyword is found;
otherwise, it draws it in the 2nd row.
"""
found, lower_opacity = False, False
top_keyword_loc = None
align = fitz.TEXT_ALIGN_CENTER
# Special handling for 180-degree rotated pages - in these cases, multiply each comparison by derotation matrix
# so that y coordinates are mapped correctly. This breaks behavior for 270-degree pages so do with flag structure
do_derotate = page.rotation == 180
# Search page text for all keywords that were identified earlier in the pipeline
for keyword, frequency in keywords:
keyword_loc = None
keyword_locs = page.search_for(keyword)
# Pick the appearance of the keyword in the page with the lowest y0, in case there are multiple appearances
if keyword_locs:
if do_derotate:
keyword_loc = min([loc * page.derotation_matrix for loc in keyword_locs], key=lambda k: k.y0)
else:
keyword_loc = min(keyword_locs, key=lambda k: k.y0)
# If keyword wasn't found, try getting page words, stripping punctuation, and checking that way
# This is necessary because PyMuPDF search_for() includes punctuation and the keyword tokens don't
else:
all_words = page.get_text("words")
all_word_strings = [re.sub(TOKEN_WORDS, "", w[4]).lower() for w in all_words]
# Handle token apostrophe s-es
# Tokens will look like "marfan s syndrome" but PyMuPDF would be "marfan's syndrome"
apostrophe = re.search(APOSTROPHE_S, keyword)
# If found, replace " s" with "'s"; either way, split words into a list
if apostrophe:
span = apostrophe.span()
keyword_words = (keyword[:span[0]] + "'s" + keyword[span[1]:]).split(" ")
else:
keyword_words = keyword.split(" ")
found = False
# Check for presence of keyword in word string list
for i in range(len(all_word_strings) - len(keyword_words) + 1):
# Stop searching if we found a match
if found:
break
# Loop through keywords
for j in range(len(keyword_words)):
current_word = all_word_strings[i + j]
# Handle apostrophes - if the word on the page has apostrophe s, check both word equality and word
# quality without the apostrophe s
# This handles cases where the search keyword has an apostrophe s (legionnare s disease),
# as well as cases where the page text has an apostrophe s (procedure: the patient's condition...)
if current_word.endswith("'s"):
if current_word != keyword_words[j] and \
current_word[:len(current_word) - 2] != keyword_words[j]:
break
# If the keyword wasn't found, stop looking at the current word
elif current_word != keyword_words[j]:
break
# If the loop is exhausted (meaning we found all necessary words), we have a hit
# So pick the first word's x/y coordinates and use that for the target label
else:
start = all_words[i]
if do_derotate:
keyword_loc = fitz.Rect(start[0], start[1], start[2], start[3]) * page.derotation_matrix
else:
keyword_loc = fitz.Rect(start[0], start[1], start[2], start[3])
found = True
break
# If the keyword STILL isn't found, try just searching for the first word
if not found:
keyword_locs = page.search_for(keyword.split(" ")[0])
if keyword_locs:
if do_derotate:
keyword_loc = min([loc * page.derotation_matrix for loc in keyword_locs], key=lambda k: k.y0)
else:
keyword_loc = keyword_locs[0]
# Keep the keyword location if it is the highest (vertically) on the page
if not top_keyword_loc:
top_keyword_loc = keyword_loc
else:
if keyword_loc.y0 < top_keyword_loc.y0:
top_keyword_loc = keyword_loc
# If we found any kind of keyword hit, draw a label at the new location
if top_keyword_loc:
found = True
if do_derotate:
page_bounds = page.bound() * page.derotation_matrix
else:
page_bounds = page.bound()
offset = 0
# Make sure label does not go outside page boundary - if it would, cap at page boundary
if top_keyword_loc.y0 + VERTICAL_LABEL_LEN > page_bounds.y1:
offset = (top_keyword_loc.y0 + VERTICAL_LABEL_LEN) - page_bounds.y1
# Now, determine new annot_loc - use y values from keyword, but aligned with left margin, vertically
# Include offset information determined by boundary and occlusion checks
annot_loc = fitz.Rect(page_bounds.x0, top_keyword_loc.y0 - offset,
page_bounds.x0 + 15, (top_keyword_loc.y0 + VERTICAL_LABEL_LEN) - offset)
# Make sure label does not occlude an existing label - if it does, keep whichever is more vertical on the page
# Only way to do this is to get the rectangle of every label and cross-reference their coordinates
for existing_annot in page.annots():
existing_rect = existing_annot.rect
existing_rect_mod = deepcopy(existing_rect)
# Add a 15 pixel buffer in both directions before checking intersection for extra safety
existing_rect_mod.y0 = existing_rect_mod.y0 - 15
existing_rect_mod.y1 = existing_rect_mod.y1 + 15
# If it intersects, determine which of the two labels is further up the page
if existing_rect_mod.intersects(annot_loc):
# If the existing label is the one that is further up, just put the new one below it
if existing_rect.y0 < annot_loc.y0:
to_move = annot_loc
to_keep = existing_rect
do_set_rect = False
# Otherwise, prepare to move the old one down below the new one
else:
to_move = existing_rect
to_keep = annot_loc
do_set_rect = True
# Move whichever one you are moving 5 pixels below the one you are not moving
to_move.y0 = to_keep.y1 + 5
# Do the same offset check again to check for running over the page length
if to_move.y0 + VERTICAL_LABEL_LEN > page_bounds.y1:
offset = (to_move.y0 + VERTICAL_LABEL_LEN) - page_bounds.y1
else:
offset = 0
# Move the label you're moving
to_move.y1 = (to_move.y0 + VERTICAL_LABEL_LEN) - offset
# Call set_rect() if you're moving the existing label, as that is how you move an existing label
if do_set_rect:
existing_annot.set_rect(existing_rect)
# Otherwise, just set the new annotation location to the moved label
else:
annot_loc = to_move
# Finally, look inside annot_loc to see if there is any text in it; if there is, also lower opacity
# Do this by searching in annot_loc plus a slight extra x buffer to ensure confidence and cover edge cases
search_loc = fitz.Rect(annot_loc.x0, annot_loc.y0, annot_loc.x1 + 5, annot_loc.y1)
if page.get_textbox(search_loc):
lower_opacity = True
LOGGER.debug("Found overlap between text and clinical summary label on %s page %i, lowering opacity." %
(page.parent.name, page.number + 1))
# If we didn't find any keywords on the page, fall back to top-of-page logic (2nd row with diagnostics)
# This will revert fontsize, remove line breaks from the annotation text, and un-center the label text
if not found:
font_size = 18
annot_text = annot_text.replace("\n", "")
align = fitz.TEXT_ALIGN_LEFT
# Draw actual label now
label_annot = page.add_freetext_annot(annot_loc * page.derotation_matrix, annot_text, fontsize=font_size,
fill_color=fill_color, text_color=BLACK, align=align, rotate=page.rotation)
label_annot.set_info({"title": "NLP"})
# Set opacity to 50% if we determined there's overlap with any text
if lower_opacity:
label_annot.update(opacity=0.50, fill_color=fill_color)
We are talking past each other: The code alone does not help! I need the file plus the code.
Hi Jorj, We are running the Medical records in PDF files, which have Personal Identification Information and Personal Health Information that can't be released. This is governed by the law and there is no exception. Please see if it is possible to debug without the actual PDF files. Sorry for the inconvenience.
Thanks, Wei On Saturday, October 1, 2022 at 03:28:11 PM PDT, Jorj X. McKie @.***> wrote:
We are talking past each other: The code alone does not help! I need the file plus the code.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
Please see if it is possible to debug without the actual PDF files. Sorry for the inconvenience.
Then I am sorry to say, that we cannot help. PyMuPDF has ways to remove (sensitive) text: determine the respective text positions and overlay them with redaction annotations, apply the redactions and save to a new file. The resulting file will no longer contain this sensitive information. You can prove to your authority by extracting text and confirming that no sensitive information is contained any more. Then confirm that that new file still causes your problem. If so, you can use it to report the problem.
Another idea: We are currently working on the next PyMuPDF version. Possibly it would solve your problem. I could send you that version - however for Ubuntu Linux Python 3.10 only (or Windows), not Oracle Linux Python 3.7. Please drop me a note if that would help.
Hi, I can upgrade our system to use Python 3.10. Please send me that version, I can try it. Thanks, Wei
On Tuesday, October 11, 2022 at 03:49:27 AM PDT, Jorj X. McKie ***@***.***> wrote:
Please see if it is possible to debug without the actual PDF files. Sorry for the inconvenience.
Then I am sorry to say, that we cannot help. PyMuPDF has ways to remove (sensitive) text: determine the respective text positions and overlay them with redaction annotations, apply the redactions and save to a new file. The resulting file will no longer contain this sensitive information. You can prove to your authority by extracting text and confirming that no sensitive information is contained any more. Then confirm that that new file still causes your problem. If so, you can use it to report the problem.
Another idea: We are currently working on the next PyMuPDF version. Possibly it would solve your problem. I could send you that version - however for Ubuntu Linux Python 3.10 only (or Windows), not Oracle Linux Python 3.7. Please drop me a note if that would help.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
PyMuPDF-1.20.4-cp310-cp310-linux_x86_64.zip
Unzip and then do python -m pip install -U PyMuPDF-1.20.4-cp310-cp310-linux_x86_64.whl
I hope your machine can accept the platform tag "-linux_x86_64". If not you will have to wait for the official version.