pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

Split Page leads to duplicates in extract_text

Open mrkgoh opened this issue 2 years ago • 3 comments

So I am trying to split a dual-page-on-one-sheet of a pdf and then extract the text.

The file is at https://disclosure.bursamalaysia.com/FileAccess/apbursaweb/download?id=212505&name=EA_DS_ATTACHMENTS

You can see that each sheet there are two portrait-pages side by side. I need to extract text in the 'Statement of Financial Positions' in page 10 - where on the left it is a page in portrait and on the right is a page in landscape.

I figured that I first need to split each page into 2. So I did it with the following:

import copy
import io
from PyPDF2 import PdfFileWriter, PdfFileReader
from PyPDF2.pdf import PageObject

def check_and_split_pdf(path: io.BytesIO, dest: str = None) -> None:

    reader = PdfFileReader(path)
    writer = PdfFileWriter()
    # split dual portrait pages into 2 pages
    any_page_dual = False
    for i in range(reader.getNumPages()):
        pp = reader.getPage(i)
        left = copy.copy(pp)
        right = copy.copy(pp)
        if not is_pdf_page_single_sheet(page=pp):
            left.mediaBox.upperRight = (
                left.mediaBox.getUpperRight_x() / 2,
                left.mediaBox.getUpperRight_y(),
            )
            writer.addPage(left)
            right.mediaBox.lowerLeft = (
                right.mediaBox.getUpperRight_x() / 2,
                right.mediaBox.getLowerLeft_y(),
            )
            writer.addPage(right)
            any_page_dual = True
            print(f'Page {i} is Dual page pdf, addPage: Done')
        else:
            writer.addPage(pp)
            print(f'Page {i} is Single page pdf, addPage: Done')
    
    if dest:
        with open(dest, 'wb') as out:
            writer.write(out)
    else:
        if any_page_dual:
            temp = io.BytesIO()
            writer.write(temp)
            return temp
        else:
            return path

def is_pdf_page_single_sheet(page: PageObject) -> bool:
    if 1.25 <= (page.mediaBox.getUpperRight_y()/page.mediaBox.getUpperRight_x()) <= 1.55:
        return True
    elif 0.70 <= (page.mediaBox.getUpperRight_y()/page.mediaBox.getUpperRight_x()) <= 0.80:
        return False
    else:
        raise UnableToCheckSinglePageSingleSheet()

class UnableToCheckSinglePageSingleSheet(Exception):
    pass

import urllib3
import pdfplumber
import io

url = 'https://disclosure.bursamalaysia.com/FileAccess/apbursaweb/download?id=212505&name=EA_DS_ATTACHMENTS'
http = urllib3.PoolManager()
temp = io.BytesIO()
temp.write(http.request("GET", url).data)
temp2 = check_and_split_pdf(temp)

and this was successful as when I inspect the output.pdf file it is now one-page-one-sheet pdf. So I thought of extract_text() with

import PyPDF2
dest = f"/topglove4.pdf"
read_pdf =  PyPDF2.PdfFileReader(dest)
for i in range(read_pdf.getNumPages()):
    page = read_pdf.getPage(i)
    single_page_text = page.extractText()
    # TypeError: can only concatenate str (not "NoneType") to str
    if single_page_text is not None: 
        all_text[i] = single_page_text

but when I inspect all_text yes I get double the pages but the text are duplicated every 2 pages. It seems that splitting the page worked but the text of the previous page are still on the current page's metadata and it is captured by page.extractText() but if you inspect the file there are no previous page text that appeared.

As you can see page 0 and page 1 are the same:

all_text[0]
>>>
'176\n177\nOUR PERFORMANCE\nFINANCIAL \nSTATEMENTS\nDIRECTORS™ \n\nRESPONSIBILITY \n\nSTATEMENT\nFor the Audited Financial Statements\nThe Directors are required by the Companies Act 2016 (CA) to \nprepare the ˜nancial statements for each ˜nancial year which \n\nhave been made out in accordance with applicable Malaysian \n\nFinancial Reporting Standards (MFRSs), the Inter national \n\nFinancial Reporting Standards (IFRSs), and the requirements \n\nof the CA in Malaysia.\nThe Directors are responsible to ensure that the ˜nancial \nstatements give a true and fair view of the state of affairs of the \n\nGroup and of the Company at the end of the ˜nancial year, and \n\nof the results and cash ˚ows of the Group and of the Company \n\nfor the ˜nancial year.\nIn preparing the ˜nancial statements, the Directors ensured \nthat the Management has:\nŁ\n \nadopted appropriate accounting policies and applied \nthem consistently;\nŁ\n \nmade judgements and estimates that are reasonable and \n\nprudent; and\nŁ\n \nprepared the ˜nancial statements on a going concern \n\nbasis.\nThe Directors are responsible to ensure that the Group and \n\nthe Company keep accounting records which disclose the \n\n˜nancial position of the Group and of the Company with \n\nreasonable accuracy, enabling them to ensure that the ˜nancial \n\nstatements comply with the CA.\nThe Directors are responsible for taking such steps as are \nreasonably open to them to safeguard the assets of the Group \n\nand of the Company, and to detect and prevent fraud and \n\nother irregularities.\nOUR PERFORMANCE\n176\n \nDirectors™ Responsibility Statement\n177\n \nDirectors™ Report\n185\n \nStatement by Directors\n185\n \nStatutory Declaration\n186\n \nIndependent Auditors™ Report\n190\n \nStatements of Pro˜t or Loss\n191\n  \nStatements of Comprehensive \nIncome\n192\n  \nStatements of Financial Position\n195\n  \nStatements of Changes in Equity\n198\n \nStatements of Cash Flows\n202\n  \nNotes to the Financial Statements\nOTHER INFORMATION\n291\n \nList of Properties\n308\n \nAnalysis of Shareholdings\n311\n \nNotice of 23\nrd\n AGM\n317\n  \nAdministrative Details for 23\nrd\n AGM\n323\n \nProxy Form\n325\n \nGRI Content Index\n331\n \nIndependent External Assurance \n \n \nStatement\n335\n \nCorporate Song\nDIRECTORS™ REPORT\nThe directors have pleasure in presenting their report together with the audited ˜nancial statements of the Group and of the \nCompany for the ˜nancial year ended 31 August 2021.\nPRINCIPAL ACTIVITIES\n\nThe principal activities of the Company are investment holding and provision of management services. \n\nThe principal activities and other information of the subsidiaries are described in Note 19 to the ˜nancial statements. \n\nThere have been no signi˜cant changes in the nature of these principal activities during the ˜nancial year.\n\nRESULTS\n Group \n RM™000 \n Company \n RM™000 \nPro˜t net of tax\n 7,823,992 \n 6,461,350 \n \n \nPro˜t attributable to:\nOwners of the parent\n7,710,327\n 6,461,350 \nHolders of Perpetual Sukuk\n 51,350 \n - \nNon-controlling interests\n62,315\n - \n7,823,992\n 6,461,350\nThere were no material transfers to or from reserves or provisions during the ˜nancial year other than as disclosed in the ˜nancial \nstatements.\nIn the opinion of the directors, the results of the operations of the Group and of the Company during the ˜nancial year were not \nsubstantially affected by any item, transaction or event of a material and unusual nature.\nDIVIDENDS\n\nThe amounts of dividends paid by the Company since 31 August 2020 were as follows:\n \n \n RM™000 \nIn respect of the ˜nancial year ended 31 August 2021:\nThird tax exempt interim single tier dividend of 18 sen per share on 8,004,542,000 \n \nordinary shares, declared on 9 June 2021 and paid on 7 July 2021\n 1,440,559\nSecond tax exempt interim single tier dividend of 25.2 sen per share on 8,004,018,000 \n \nordinary shares, declared on 9 March 2021 and paid on 6 April 2021\n 2,017,607\nFirst tax exempt interim single tier dividend of 16.5 sen per share on 8,022,604,000 \n \nordinary shares, declared on 9 December 2020 and paid on 11 January 2021\n 1,323,582\nIn respect of the ˜nancial year ended 31 August 2020:\nFinal tax exempt single tier dividend of 8.5 sen per share on 8,143,086,000 \n \nordinary shares, declared on 23 September 2020 and paid on 3 November 2020\n 692,321 \n 5,474,069\nFurther details on dividends recognised during the ˜nancial year are disclosed in Note 46 to the ˜nancial statements.\nA single tier ˜nal dividend in respect of the ˜nancial year ended 31 August 2021, of 5.4 sen per share on 8,007,085,000 ordinary \nshares amounting to RM432,454,000 had been declared on 17 September 2021 and paid on 15 October 2021. The ˜nancial \n\nstatements for the current ˜nancial year do not re˚ect this dividend. Such dividend will be accounted for within equity as an \n\nappropriation of retained earnings for the ˜nancial year ending 31 August 2022.\n'
all_text[1]
>>>
'176\n177\nOUR PERFORMANCE\nFINANCIAL \nSTATEMENTS\nDIRECTORS™ \n\nRESPONSIBILITY \n\nSTATEMENT\nFor the Audited Financial Statements\nThe Directors are required by the Companies Act 2016 (CA) to \nprepare the ˜nancial statements for each ˜nancial year which \n\nhave been made out in accordance with applicable Malaysian \n\nFinancial Reporting Standards (MFRSs), the Inter national \n\nFinancial Reporting Standards (IFRSs), and the requirements \n\nof the CA in Malaysia.\nThe Directors are responsible to ensure that the ˜nancial \nstatements give a true and fair view of the state of affairs of the \n\nGroup and of the Company at the end of the ˜nancial year, and \n\nof the results and cash ˚ows of the Group and of the Company \n\nfor the ˜nancial year.\nIn preparing the ˜nancial statements, the Directors ensured \nthat the Management has:\nŁ\n \nadopted appropriate accounting policies and applied \nthem consistently;\nŁ\n \nmade judgements and estimates that are reasonable and \n\nprudent; and\nŁ\n \nprepared the ˜nancial statements on a going concern \n\nbasis.\nThe Directors are responsible to ensure that the Group and \n\nthe Company keep accounting records which disclose the \n\n˜nancial position of the Group and of the Company with \n\nreasonable accuracy, enabling them to ensure that the ˜nancial \n\nstatements comply with the CA.\nThe Directors are responsible for taking such steps as are \nreasonably open to them to safeguard the assets of the Group \n\nand of the Company, and to detect and prevent fraud and \n\nother irregularities.\nOUR PERFORMANCE\n176\n \nDirectors™ Responsibility Statement\n177\n \nDirectors™ Report\n185\n \nStatement by Directors\n185\n \nStatutory Declaration\n186\n \nIndependent Auditors™ Report\n190\n \nStatements of Pro˜t or Loss\n191\n  \nStatements of Comprehensive \nIncome\n192\n  \nStatements of Financial Position\n195\n  \nStatements of Changes in Equity\n198\n \nStatements of Cash Flows\n202\n  \nNotes to the Financial Statements\nOTHER INFORMATION\n291\n \nList of Properties\n308\n \nAnalysis of Shareholdings\n311\n \nNotice of 23\nrd\n AGM\n317\n  \nAdministrative Details for 23\nrd\n AGM\n323\n \nProxy Form\n325\n \nGRI Content Index\n331\n \nIndependent External Assurance \n \n \nStatement\n335\n \nCorporate Song\nDIRECTORS™ REPORT\nThe directors have pleasure in presenting their report together with the audited ˜nancial statements of the Group and of the \nCompany for the ˜nancial year ended 31 August 2021.\nPRINCIPAL ACTIVITIES\n\nThe principal activities of the Company are investment holding and provision of management services. \n\nThe principal activities and other information of the subsidiaries are described in Note 19 to the ˜nancial statements. \n\nThere have been no signi˜cant changes in the nature of these principal activities during the ˜nancial year.\n\nRESULTS\n Group \n RM™000 \n Company \n RM™000 \nPro˜t net of tax\n 7,823,992 \n 6,461,350 \n \n \nPro˜t attributable to:\nOwners of the parent\n7,710,327\n 6,461,350 \nHolders of Perpetual Sukuk\n 51,350 \n - \nNon-controlling interests\n62,315\n - \n7,823,992\n 6,461,350\nThere were no material transfers to or from reserves or provisions during the ˜nancial year other than as disclosed in the ˜nancial \nstatements.\nIn the opinion of the directors, the results of the operations of the Group and of the Company during the ˜nancial year were not \nsubstantially affected by any item, transaction or event of a material and unusual nature.\nDIVIDENDS\n\nThe amounts of dividends paid by the Company since 31 August 2020 were as follows:\n \n \n RM™000 \nIn respect of the ˜nancial year ended 31 August 2021:\nThird tax exempt interim single tier dividend of 18 sen per share on 8,004,542,000 \n \nordinary shares, declared on 9 June 2021 and paid on 7 July 2021\n 1,440,559\nSecond tax exempt interim single tier dividend of 25.2 sen per share on 8,004,018,000 \n \nordinary shares, declared on 9 March 2021 and paid on 6 April 2021\n 2,017,607\nFirst tax exempt interim single tier dividend of 16.5 sen per share on 8,022,604,000 \n \nordinary shares, declared on 9 December 2020 and paid on 11 January 2021\n 1,323,582\nIn respect of the ˜nancial year ended 31 August 2020:\nFinal tax exempt single tier dividend of 8.5 sen per share on 8,143,086,000 \n \nordinary shares, declared on 23 September 2020 and paid on 3 November 2020\n 692,321 \n 5,474,069\nFurther details on dividends recognised during the ˜nancial year are disclosed in Note 46 to the ˜nancial statements.\nA single tier ˜nal dividend in respect of the ˜nancial year ended 31 August 2021, of 5.4 sen per share on 8,007,085,000 ordinary \nshares amounting to RM432,454,000 had been declared on 17 September 2021 and paid on 15 October 2021. The ˜nancial \n\nstatements for the current ˜nancial year do not re˚ect this dividend. Such dividend will be accounted for within equity as an \n\nappropriation of retained earnings for the ˜nancial year ending 31 August 2022.\n'

Environment

Which environment were you using when you encountered the problem?

python3 -m platform
Linux-4.4.59+-x86_64-with-glibc2.2.5
python3 -c "import PyPDF2;print(PyPDF2.__version__)"
1.27.9

Code

This is a minimal, complete example that shows the issue:

See above

PDF

The file is at https://disclosure.bursamalaysia.com/FileAccess/apbursaweb/download?id=212505&name=EA_DS_ATTACHMENTS

Update - tested with this pdf and the same problem rose: url = 'https://disclosure.bursamalaysia.com/FileAccess/apbursaweb/download?id=207448&name=EA_DS_ATTACHMENTS'

mrkgoh avatar Apr 25 '22 06:04 mrkgoh

The next step in a good bug ticket is to ensure that the code is minimal. For example, the code for downloading / cropping seems unnecessary if it's only about the text duplication.

Try to upload a PDF to Github to which you have the intelectual property that shows the problem (if that is possible).

MartinThoma avatar Apr 25 '22 11:04 MartinThoma

The next step in a good bug ticket is to ensure that the code is minimal. For example, the code for downloading / cropping seems unnecessary if it's only about the text duplication.

Try to upload a PDF to Github to which you have the intelectual property that shows the problem (if that is possible).

There is no intellectual property on the pdf. It is accessible via the link given above.

mrkgoh avatar Apr 27 '22 08:04 mrkgoh

analysis at first sight: extract_text() does not takes into account cropping area so all text is extracted weither visible or not on the viewer. I do not see an easy solution to get the dispayed area - aspecially the right/ bottom left area.

pubpub-zz avatar Jul 10 '22 08:07 pubpub-zz

@mrkgoh visitor functions have been implemented proposing a solution to check what is the current position. https://pypdf.readthedocs.io/en/latest/user/extract-text.html#example-1-ignore-header-and-footer with this function you should be able to select only the visible(within the trimbox) to extract text

I close this PR meanwhile as fixed.

pubpub-zz avatar Feb 26 '23 14:02 pubpub-zz