PyMuPDF icon indicating copy to clipboard operation
PyMuPDF copied to clipboard

page.get_text() returns hexadecimal text for some characters

Open brandenkmurray opened this issue 1 year ago • 2 comments

Description of the bug

get_text() extracts numbers in the Cash Flow table in this document as hexadecimal characters. Copy/paste from the page and pdftotext extract the correct text.

How to reproduce the bug

Ford Motor Company (F) Cash Flow - Yahoo Finance - Yahoo Finance.pdf

import fitz
import pdftotext
import pdfplumber

def print_comparison(fn, page):
    #pymupdf
    pymupdf_doc = fitz.open(fn)

    #pdftotext
    with open(fn, "rb") as f:
        pdftotext_doc = pdftotext.PDF(f)

    #pdfplumber
    pdfplumber_doc = pdfplumber.open(fn)

    print("PyMuPDF:\n")
    print(repr(pymupdf_doc[page].get_text()))
    print("\npdftotext:\n")
    print(repr(pdftotext_doc[page]))
    print("\npdfplumber:\n")
    print(repr(pdfplumber_doc.pages[page].extract_text()))


print_comparison('Ford.Motor.Company.F.Cash.Flow.-.Yahoo.Finance.-.Yahoo.Finance.pdf', 1)
PyMuPDF:

'Related Tickers\nTTM\n12/31/2023\n12/31/2022\n12/31/2021\n12/31/2020\n\x8e\x91,\x96\x8e\x95,\x8d\x8d\x8d\n\x8e\x91,\x96\x8e\x95,\x8d\x8d\x8d\n\x93,\x95\x92\x90,\x8d\x8d\x8d\n\x8e\x92,\x94\x95\x94,\x8d\x8d\x8d\n\x8f\x91,\x8f\x93\x96,\x8d\x8d\x8d\n-\x8e\x94,\x93\x8f\x95,\x8d\x8d\x8d\n-\x8e\x94,\x93\x8f\x95,\x8d\x8d\x8d\n-\x91,\x90\x91\x94,\x8d\x8d\x8d\n\x8f,\x94\x91\x92,\x8d\x8d\x8d\n-\x8e\x95,\x93\x8e\x92,\x8d\x8d\x8d\n\x8f,\x92\x95\x91,\x8d\x8d\x8d\n\x8f,\x92\x95\x91,\x8d\x8d\x8d\n\x8f,\x92\x8e\x8e,\x8d\x8d\x8d\n-\x8f\x90,\x91\x96\x95,\x8d\x8d\x8d\n\x8f,\x90\x8e\x92,\x8d\x8d\x8d\n\x8f\x92,\x8e\x8e\x8d,\x8d\x8d\x8d\n\x8f\x92,\x8e\x8e\x8d,\x8d\x8d\x8d\n\x8f\x92,\x90\x91\x8d,\x8d\x8d\x8d\n\x8f\x8d,\x94\x90\x94,\x8d\x8d\x8d\n\x8f\x92,\x96\x90\x92,\x8d\x8d\x8d\n-\x95,\x8f\x90\x93,\x8d\x8d\x8d\n-\x95,\x8f\x90\x93,\x8d\x8d\x8d\n-\x93,\x95\x93\x93,\x8d\x8d\x8d\n-\x93,\x8f\x8f\x94,\x8d\x8d\x8d\n-\x92,\x94\x91\x8f,\x8d\x8d\x8d\n\x92\x8e,\x93\x92\x96,\x8d\x8d\x8d\n\x92\x8e,\x93\x92\x96,\x8d\x8d\x8d\n\x91\x92,\x91\x94\x8d,\x8d\x8d\x8d\n\x8f\x94,\x96\x8d\x8e,\x8d\x8d\x8d\n\x93\x92,\x96\x8d\x8d,\x8d\x8d\x8d\n-\x91\x8e,\x96\x93\x92,\x8d\x8d\x8d\n-\x91\x8e,\x96\x93\x92,\x8d\x8d\x8d\n-\x91\x92,\x93\x92\x92,\x8d\x8d\x8d\n-\x92\x91,\x8e\x93\x91,\x8d\x8d\x8d\n-\x93\x8d,\x92\x8e\x91,\x8d\x8d\x8d\n-\x90\x90\x92,\x8d\x8d\x8d\n-\x90\x90\x92,\x8d\x8d\x8d\n-\x91\x95\x91,\x8d\x8d\x8d\n--\n--\n\x93,\x93\x95\x8f,\x8d\x8d\x8d\n\x93,\x93\x95\x8f,\x8d\x8d\x8d\n-\x8e\x90,\x8d\x8d\x8d\n\x96,\x92\x93\x8d,\x8d\x8d\x8d\n\x8e\x95,\x92\x8f\x94,\x8d\x8d\x8d\n \nYahoo Finance Plus Essential\naccess required.\nUnlock Access\nBreakdown\nOperating Cash\nFlow\nInvesting Cash\nFlow\nFinancing Cash\nFlow\nEnd Cash Position\nCapital Expenditure\nIssuance of Debt\nRepayment of Debt\nRepurchase of\nCapital Stock\nFree Cash Flow\n12/31/2020 - 6/1/1972\nGM\nGeneral Motors Compa…\n39.49 +1.23%\n\xa0\nRIVN\nRivian Automotive, Inc.\n15.39 -3.15%\n\xa0\nNIO\nNIO Inc.\n5.97 +0.17%\n\xa0\nSTLA\nStellantis N.V.\n25.63 +0.91%\n\xa0\nLCID\nLucid Group, Inc.\n3.7000 +0.54%\n\xa0\nTSLA\nTesla, Inc.\n194.77 +0.52%\n\xa0\nTM\nToyota Motor Corporati…\n227.09 +0.14%\n\xa0\nXPEV\nXPeng Inc.\n9.08 +0.89%\n\xa0\nFSR\nFisker Inc.\n0.5579 -11.46%\n\xa0\nCopyright © 2024 Yahoo.\nAll rights reserved.\nPOPULAR QUOTES\nTesla\nDAX Index\nKOSPI\nDow Jones\nS&P BSE SENSEX\nSPDR S&P 500 ETF Trust\nEXPLORE MORE\nCredit Score Management\nHousing Market\nActive vs. Passive Investing\nShort Selling\nToday’s Mortgage Rates\nHow Much Mortgage Can You Afford\nABOUT\nData Disclaimer\nHelp\nSu\x0cestions\nSitemap\n'

pdftotext:

'Breakdown\n\nOperating Cash\nFlow\nInvesting Cash\nFlow\nFinancing Cash\nFlow\nEnd Cash Position\nCapital Expenditure\nIssuance of Debt\nRepayment of Debt\nRepurchase of\nCapital Stock\nFree Cash Flow\n\nRelated Tickers\nGM\nGeneral Motors Compa…\n39.49 +1.23%\n\nCopyright © 2024 Yahoo.\nAll rights reserved.\n\nTTM\n\n12/31/2023\n\n12/31/2022\n\n12/31/2021\n\n12/31/2020\n\n14,918,000\n\n14,918,000\n\n6,853,000\n\n15,787,000\n\n24,269,000\n\n-17,628,000\n\n-17,628,000\n\n-4,347,000\n\n2,745,000\n\n-18,615,000\n\n2,584,000\n\n2,584,000\n\n2,511,000\n\n-23,498,000\n\n2,315,000\n\n25,110,000\n-8,236,000\n51,659,000\n-41,965,000\n\n25,110,000\n-8,236,000\n51,659,000\n-41,965,000\n\n25,340,000\n-6,866,000\n45,470,000\n-45,655,000\n\n20,737,000\n-6,227,000\n27,901,000\n-54,164,000\n\n25,935,000\n-5,742,000\n65,900,000\n-60,514,000\n\n-335,000\n\n-335,000\n\n-484,000\n\n--\n\n--\n\n6,682,000\n\n6,682,000\n\n-13,000\n\n9,560,000\n\n18,527,000\n\nRIVN\nRivian Automotive, Inc.\n\nNIO\nNIO Inc.\n\nSTLA\nStellantis N.V.\n\nPOPULAR QUOTES\nTesla\nDAX Index\nKOSPI\nDow Jones\nS&P BSE SENSEX\nSPDR S&P 500 ETF Trust\n\nEXPLORE MORE\n\n15.39 -3.15%\n\n5.97 +0.17%\n\n25.63 +0.91%\n\nLCID\nLucid Group, Inc.\n\n3.7000 +0.54%\n\nCredit Score Management\nHousing Market\nActive vs. Passive Investing\nShort Selling\nToday’s Mortgage Rates\nHow Much Mortgage Can You Afford\n\nABOUT\n\nTSLA\nTesla, Inc.\n\n194.77 +0.52%\n\nData Disclaimer\nHelp\nSuggestions\nSitemap\n\n12/31/2020 - 6/1/1972\n\nYahoo Finance Plus Essential\naccess required.\nUnlock Access\n\nTM\nToyota Motor Corporati…\n227.09 +0.14%\n\nXPEV\nXPeng Inc.\n\n9.08 +0.89%\n\nFSR\nFisker Inc.\n\n0.5579 -11.46%\n\n\x0c'

pdfplumber:

'Breakdown TTM 12/31/2023 12/31/2022 12/31/2021 12/31/2020 12/31/2020 - 6/1/1972\nOperating Cash\n\x00\x00,\x00\x00\x00,\x00\x00\x00 \x00\x00,\x00\x00\x00,\x00\x00\x00 \x00,\x00\x00\x00,\x00\x00\x00 \x00\x00,\x00\x00\x00,\x00\x00\x00 \x00\x00,\x00\x00\x00,\x00\x00\x00\nFlow\nYahoo Finance Plus Essential\nInvesting Cash access required.\n-\x00\x00,\x00\x00\x00,\x00\x00\x00 -\x00\x00,\x00\x00\x00,\x00\x00\x00 -\x00,\x00\x00\x00,\x00\x00\x00 \x00,\x00\x00\x00,\x00\x00\x00 -\x00\x00,\x00\x00\x00,\x00\x00\x00\nFlow\nUnlock Access\nFinancing Cash\n\x00,\x00\x00\x00,\x00\x00\x00 \x00,\x00\x00\x00,\x00\x00\x00 \x00,\x00\x00\x00,\x00\x00\x00 -\x00\x00,\x00\x00\x00,\x00\x00\x00 \x00,\x00\x00\x00,\x00\x00\x00\nFlow\nEnd Cash Position \x00\x00,\x00\x00\x00,\x00\x00\x00 \x00\x00,\x00\x00\x00,\x00\x00\x00 \x00\x00,\x00\x00\x00,\x00\x00\x00 \x00\x00,\x00\x00\x00,\x00\x00\x00 \x00\x00,\x00\x00\x00,\x00\x00\x00\nCapital Expenditure -\x00,\x00\x00\x00,\x00\x00\x00 -\x00,\x00\x00\x00,\x00\x00\x00 -\x00,\x00\x00\x00,\x00\x00\x00 -\x00,\x00\x00\x00,\x00\x00\x00 -\x00,\x00\x00\x00,\x00\x00\x00\nIssuance of Debt \x00\x00,\x00\x00\x00,\x00\x00\x00 \x00\x00,\x00\x00\x00,\x00\x00\x00 \x00\x00,\x00\x00\x00,\x00\x00\x00 \x00\x00,\x00\x00\x00,\x00\x00\x00 \x00\x00,\x00\x00\x00,\x00\x00\x00\nRepayment of Debt -\x00\x00,\x00\x00\x00,\x00\x00\x00 -\x00\x00,\x00\x00\x00,\x00\x00\x00 -\x00\x00,\x00\x00\x00,\x00\x00\x00 -\x00\x00,\x00\x00\x00,\x00\x00\x00 -\x00\x00,\x00\x00\x00,\x00\x00\x00\nRepurchase of\n-\x00\x00\x00,\x00\x00\x00 -\x00\x00\x00,\x00\x00\x00 -\x00\x00\x00,\x00\x00\x00 -- --\nCapital Stock\nFree Cash Flow \x00,\x00\x00\x00,\x00\x00\x00 \x00,\x00\x00\x00,\x00\x00\x00 -\x00\x00,\x00\x00\x00 \x00,\x00\x00\x00,\x00\x00\x00 \x00\x00,\x00\x00\x00,\x00\x00\x00\nRelated Tickers\nGM RIVN NIO STLA LCID TSLA TM XPEV FSR\nGeneral Motors Compa… Rivian Automotive, Inc. NIO Inc. Stellantis N.V. Lucid Group, Inc. Tesla, Inc. Toyota Motor Corporati… XPeng Inc. Fisker Inc.\n39.49 +1.23% 15.39 -3.15% 5.97 +0.17% 25.63 +0.91% 3.7000 +0.54% 194.77 +0.52% 227.09 +0.14% 9.08 +0.89% 0.5579 -11.46%\nPOPULAR QUOTES EXPLORE MORE ABOUT\nTesla Credit Score Management Data Disclaimer\nCopyright © 2024 Yahoo.\nDAX Index Housing Market Help\nAll rights reserved.\nKOSPI Active vs. Passive Investing Su\x00estions\nShort Selling Sitemap\nDow Jones\nToday’s Mortgage Rates\nS&P BSE SENSEX\nHow Much Mortgage Can You Afford\nSPDR S&P 500 ETF Trust'

Expected behavior (optional)

I expect the numbers in the table to be returned as normal text, similar to pdftotext

PyMuPDF version

1.23.25

Operating system

Linux

Python version

3.10

brandenkmurray avatar Feb 22 '24 02:02 brandenkmurray

Thanks for the report.

It looks like PyMuPDF with the latest MuPDF master branch does not include these control characters in the text. So this looks like a MuPDF issue.

I'll ask the MuPDF people about what has changed on MuPDF master relative to PyMuPDF's default MuPDF-1.23.10.

MuPDF master has support for ActualText which fixes this problem. We are expecting MuPDF to move to new release 1.24.x branch in the next few weeks which will include ActualText support, and so the problem will be fixed in PyMuPDF shortly afterwards.

Fixed in 1.24.0.