PyMuPDF
PyMuPDF copied to clipboard
page.get_text() returns hexadecimal text for some characters
Description of the bug
get_text() extracts numbers in the Cash Flow table in this document as hexadecimal characters. Copy/paste from the page and pdftotext extract the correct text.
How to reproduce the bug
Ford Motor Company (F) Cash Flow - Yahoo Finance - Yahoo Finance.pdf
import fitz
import pdftotext
import pdfplumber
def print_comparison(fn, page):
#pymupdf
pymupdf_doc = fitz.open(fn)
#pdftotext
with open(fn, "rb") as f:
pdftotext_doc = pdftotext.PDF(f)
#pdfplumber
pdfplumber_doc = pdfplumber.open(fn)
print("PyMuPDF:\n")
print(repr(pymupdf_doc[page].get_text()))
print("\npdftotext:\n")
print(repr(pdftotext_doc[page]))
print("\npdfplumber:\n")
print(repr(pdfplumber_doc.pages[page].extract_text()))
print_comparison('Ford.Motor.Company.F.Cash.Flow.-.Yahoo.Finance.-.Yahoo.Finance.pdf', 1)
PyMuPDF:
'Related Tickers\nTTM\n12/31/2023\n12/31/2022\n12/31/2021\n12/31/2020\n\x8e\x91,\x96\x8e\x95,\x8d\x8d\x8d\n\x8e\x91,\x96\x8e\x95,\x8d\x8d\x8d\n\x93,\x95\x92\x90,\x8d\x8d\x8d\n\x8e\x92,\x94\x95\x94,\x8d\x8d\x8d\n\x8f\x91,\x8f\x93\x96,\x8d\x8d\x8d\n-\x8e\x94,\x93\x8f\x95,\x8d\x8d\x8d\n-\x8e\x94,\x93\x8f\x95,\x8d\x8d\x8d\n-\x91,\x90\x91\x94,\x8d\x8d\x8d\n\x8f,\x94\x91\x92,\x8d\x8d\x8d\n-\x8e\x95,\x93\x8e\x92,\x8d\x8d\x8d\n\x8f,\x92\x95\x91,\x8d\x8d\x8d\n\x8f,\x92\x95\x91,\x8d\x8d\x8d\n\x8f,\x92\x8e\x8e,\x8d\x8d\x8d\n-\x8f\x90,\x91\x96\x95,\x8d\x8d\x8d\n\x8f,\x90\x8e\x92,\x8d\x8d\x8d\n\x8f\x92,\x8e\x8e\x8d,\x8d\x8d\x8d\n\x8f\x92,\x8e\x8e\x8d,\x8d\x8d\x8d\n\x8f\x92,\x90\x91\x8d,\x8d\x8d\x8d\n\x8f\x8d,\x94\x90\x94,\x8d\x8d\x8d\n\x8f\x92,\x96\x90\x92,\x8d\x8d\x8d\n-\x95,\x8f\x90\x93,\x8d\x8d\x8d\n-\x95,\x8f\x90\x93,\x8d\x8d\x8d\n-\x93,\x95\x93\x93,\x8d\x8d\x8d\n-\x93,\x8f\x8f\x94,\x8d\x8d\x8d\n-\x92,\x94\x91\x8f,\x8d\x8d\x8d\n\x92\x8e,\x93\x92\x96,\x8d\x8d\x8d\n\x92\x8e,\x93\x92\x96,\x8d\x8d\x8d\n\x91\x92,\x91\x94\x8d,\x8d\x8d\x8d\n\x8f\x94,\x96\x8d\x8e,\x8d\x8d\x8d\n\x93\x92,\x96\x8d\x8d,\x8d\x8d\x8d\n-\x91\x8e,\x96\x93\x92,\x8d\x8d\x8d\n-\x91\x8e,\x96\x93\x92,\x8d\x8d\x8d\n-\x91\x92,\x93\x92\x92,\x8d\x8d\x8d\n-\x92\x91,\x8e\x93\x91,\x8d\x8d\x8d\n-\x93\x8d,\x92\x8e\x91,\x8d\x8d\x8d\n-\x90\x90\x92,\x8d\x8d\x8d\n-\x90\x90\x92,\x8d\x8d\x8d\n-\x91\x95\x91,\x8d\x8d\x8d\n--\n--\n\x93,\x93\x95\x8f,\x8d\x8d\x8d\n\x93,\x93\x95\x8f,\x8d\x8d\x8d\n-\x8e\x90,\x8d\x8d\x8d\n\x96,\x92\x93\x8d,\x8d\x8d\x8d\n\x8e\x95,\x92\x8f\x94,\x8d\x8d\x8d\n \nYahoo Finance Plus Essential\naccess required.\nUnlock Access\nBreakdown\nOperating Cash\nFlow\nInvesting Cash\nFlow\nFinancing Cash\nFlow\nEnd Cash Position\nCapital Expenditure\nIssuance of Debt\nRepayment of Debt\nRepurchase of\nCapital Stock\nFree Cash Flow\n12/31/2020 - 6/1/1972\nGM\nGeneral Motors Compa…\n39.49 +1.23%\n\xa0\nRIVN\nRivian Automotive, Inc.\n15.39 -3.15%\n\xa0\nNIO\nNIO Inc.\n5.97 +0.17%\n\xa0\nSTLA\nStellantis N.V.\n25.63 +0.91%\n\xa0\nLCID\nLucid Group, Inc.\n3.7000 +0.54%\n\xa0\nTSLA\nTesla, Inc.\n194.77 +0.52%\n\xa0\nTM\nToyota Motor Corporati…\n227.09 +0.14%\n\xa0\nXPEV\nXPeng Inc.\n9.08 +0.89%\n\xa0\nFSR\nFisker Inc.\n0.5579 -11.46%\n\xa0\nCopyright © 2024 Yahoo.\nAll rights reserved.\nPOPULAR QUOTES\nTesla\nDAX Index\nKOSPI\nDow Jones\nS&P BSE SENSEX\nSPDR S&P 500 ETF Trust\nEXPLORE MORE\nCredit Score Management\nHousing Market\nActive vs. Passive Investing\nShort Selling\nToday’s Mortgage Rates\nHow Much Mortgage Can You Afford\nABOUT\nData Disclaimer\nHelp\nSu\x0cestions\nSitemap\n'
pdftotext:
'Breakdown\n\nOperating Cash\nFlow\nInvesting Cash\nFlow\nFinancing Cash\nFlow\nEnd Cash Position\nCapital Expenditure\nIssuance of Debt\nRepayment of Debt\nRepurchase of\nCapital Stock\nFree Cash Flow\n\nRelated Tickers\nGM\nGeneral Motors Compa…\n39.49 +1.23%\n\nCopyright © 2024 Yahoo.\nAll rights reserved.\n\nTTM\n\n12/31/2023\n\n12/31/2022\n\n12/31/2021\n\n12/31/2020\n\n14,918,000\n\n14,918,000\n\n6,853,000\n\n15,787,000\n\n24,269,000\n\n-17,628,000\n\n-17,628,000\n\n-4,347,000\n\n2,745,000\n\n-18,615,000\n\n2,584,000\n\n2,584,000\n\n2,511,000\n\n-23,498,000\n\n2,315,000\n\n25,110,000\n-8,236,000\n51,659,000\n-41,965,000\n\n25,110,000\n-8,236,000\n51,659,000\n-41,965,000\n\n25,340,000\n-6,866,000\n45,470,000\n-45,655,000\n\n20,737,000\n-6,227,000\n27,901,000\n-54,164,000\n\n25,935,000\n-5,742,000\n65,900,000\n-60,514,000\n\n-335,000\n\n-335,000\n\n-484,000\n\n--\n\n--\n\n6,682,000\n\n6,682,000\n\n-13,000\n\n9,560,000\n\n18,527,000\n\nRIVN\nRivian Automotive, Inc.\n\nNIO\nNIO Inc.\n\nSTLA\nStellantis N.V.\n\nPOPULAR QUOTES\nTesla\nDAX Index\nKOSPI\nDow Jones\nS&P BSE SENSEX\nSPDR S&P 500 ETF Trust\n\nEXPLORE MORE\n\n15.39 -3.15%\n\n5.97 +0.17%\n\n25.63 +0.91%\n\nLCID\nLucid Group, Inc.\n\n3.7000 +0.54%\n\nCredit Score Management\nHousing Market\nActive vs. Passive Investing\nShort Selling\nToday’s Mortgage Rates\nHow Much Mortgage Can You Afford\n\nABOUT\n\nTSLA\nTesla, Inc.\n\n194.77 +0.52%\n\nData Disclaimer\nHelp\nSuggestions\nSitemap\n\n12/31/2020 - 6/1/1972\n\nYahoo Finance Plus Essential\naccess required.\nUnlock Access\n\nTM\nToyota Motor Corporati…\n227.09 +0.14%\n\nXPEV\nXPeng Inc.\n\n9.08 +0.89%\n\nFSR\nFisker Inc.\n\n0.5579 -11.46%\n\n\x0c'
pdfplumber:
'Breakdown TTM 12/31/2023 12/31/2022 12/31/2021 12/31/2020 12/31/2020 - 6/1/1972\nOperating Cash\n\x00\x00,\x00\x00\x00,\x00\x00\x00 \x00\x00,\x00\x00\x00,\x00\x00\x00 \x00,\x00\x00\x00,\x00\x00\x00 \x00\x00,\x00\x00\x00,\x00\x00\x00 \x00\x00,\x00\x00\x00,\x00\x00\x00\nFlow\nYahoo Finance Plus Essential\nInvesting Cash access required.\n-\x00\x00,\x00\x00\x00,\x00\x00\x00 -\x00\x00,\x00\x00\x00,\x00\x00\x00 -\x00,\x00\x00\x00,\x00\x00\x00 \x00,\x00\x00\x00,\x00\x00\x00 -\x00\x00,\x00\x00\x00,\x00\x00\x00\nFlow\nUnlock Access\nFinancing Cash\n\x00,\x00\x00\x00,\x00\x00\x00 \x00,\x00\x00\x00,\x00\x00\x00 \x00,\x00\x00\x00,\x00\x00\x00 -\x00\x00,\x00\x00\x00,\x00\x00\x00 \x00,\x00\x00\x00,\x00\x00\x00\nFlow\nEnd Cash Position \x00\x00,\x00\x00\x00,\x00\x00\x00 \x00\x00,\x00\x00\x00,\x00\x00\x00 \x00\x00,\x00\x00\x00,\x00\x00\x00 \x00\x00,\x00\x00\x00,\x00\x00\x00 \x00\x00,\x00\x00\x00,\x00\x00\x00\nCapital Expenditure -\x00,\x00\x00\x00,\x00\x00\x00 -\x00,\x00\x00\x00,\x00\x00\x00 -\x00,\x00\x00\x00,\x00\x00\x00 -\x00,\x00\x00\x00,\x00\x00\x00 -\x00,\x00\x00\x00,\x00\x00\x00\nIssuance of Debt \x00\x00,\x00\x00\x00,\x00\x00\x00 \x00\x00,\x00\x00\x00,\x00\x00\x00 \x00\x00,\x00\x00\x00,\x00\x00\x00 \x00\x00,\x00\x00\x00,\x00\x00\x00 \x00\x00,\x00\x00\x00,\x00\x00\x00\nRepayment of Debt -\x00\x00,\x00\x00\x00,\x00\x00\x00 -\x00\x00,\x00\x00\x00,\x00\x00\x00 -\x00\x00,\x00\x00\x00,\x00\x00\x00 -\x00\x00,\x00\x00\x00,\x00\x00\x00 -\x00\x00,\x00\x00\x00,\x00\x00\x00\nRepurchase of\n-\x00\x00\x00,\x00\x00\x00 -\x00\x00\x00,\x00\x00\x00 -\x00\x00\x00,\x00\x00\x00 -- --\nCapital Stock\nFree Cash Flow \x00,\x00\x00\x00,\x00\x00\x00 \x00,\x00\x00\x00,\x00\x00\x00 -\x00\x00,\x00\x00\x00 \x00,\x00\x00\x00,\x00\x00\x00 \x00\x00,\x00\x00\x00,\x00\x00\x00\nRelated Tickers\nGM RIVN NIO STLA LCID TSLA TM XPEV FSR\nGeneral Motors Compa… Rivian Automotive, Inc. NIO Inc. Stellantis N.V. Lucid Group, Inc. Tesla, Inc. Toyota Motor Corporati… XPeng Inc. Fisker Inc.\n39.49 +1.23% 15.39 -3.15% 5.97 +0.17% 25.63 +0.91% 3.7000 +0.54% 194.77 +0.52% 227.09 +0.14% 9.08 +0.89% 0.5579 -11.46%\nPOPULAR QUOTES EXPLORE MORE ABOUT\nTesla Credit Score Management Data Disclaimer\nCopyright © 2024 Yahoo.\nDAX Index Housing Market Help\nAll rights reserved.\nKOSPI Active vs. Passive Investing Su\x00estions\nShort Selling Sitemap\nDow Jones\nToday’s Mortgage Rates\nS&P BSE SENSEX\nHow Much Mortgage Can You Afford\nSPDR S&P 500 ETF Trust'
Expected behavior (optional)
I expect the numbers in the table to be returned as normal text, similar to pdftotext
PyMuPDF version
1.23.25
Operating system
Linux
Python version
3.10
Thanks for the report.
It looks like PyMuPDF with the latest MuPDF master branch does not include these control characters in the text. So this looks like a MuPDF issue.
I'll ask the MuPDF people about what has changed on MuPDF master relative to PyMuPDF's default MuPDF-1.23.10.
MuPDF master has support for ActualText which fixes this problem. We are expecting MuPDF to move to new release 1.24.x branch in the next few weeks which will include ActualText support, and so the problem will be fixed in PyMuPDF shortly afterwards.
Fixed in 1.24.0.