pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

ENH: Allow text extraction to keep intendation

Open MartinThoma opened this issue 11 months ago • 2 comments

When we extract Python code from a PDF, it's completely messed up. It would be nice to have an option that keeps the indentation. Maybe a flag for a layout-mode?

Code Example: How the new feature could be used

from pypdf import PdfReader

# https://arxiv.org/pdf/1601.03642.pdf
reader = PdfReader("1601.03642.pdf")
print(reader.pages[6].extract_text(layout_mode=True))

should give:

 * Increment the size file of the new incorrect UI_FILTER group information
 * of the size generatively.
 */
static int indicate_policy(void)
{
    int error;
    if (fd == MARN_EPT) {
        /*
         * The kernel blank will coeld it to userspace.
         */
        if (ss->segment < mem_total)
            unblock_graph_and_set_blocked();
        else
            ret = 1;
        goto bail;
    }
    segaddr = in_SB(in.addr);
    selector = seg / 16;
    setup_works = true;
    for (i = 0; i < blocks; i++) {
        seq = buf[i++];
        bpf = bd->bd.next + i * search;
        if (fd) {
            current = blocked;
        }
    }
    rw->name = "Getjbbregs";
    bprm_self_clearl(&iv->version);
    regs->new = blocks[(BPF_STATS << info->historidac)] | PFMR_CLOBATHINC_SECONDS << 12;
    return segtable;
}


D. Linux Code, 2

/*
* Copyright (c) 2006-2010, Intel Mobile Communications. All rights reserved.
*
* This program is free software; you can redistribute it and/or modify it
* under the terms of the GNU General Public License version 2 as published by
* the Free Software Foundation.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
*
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software Foundation,
* Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
*/

#include <linux/kexec.h>
#include <linux/errno.h>
#include <linux/io.h>
#include <linux/platform_device.h>
#include <linux/multi.h>

Currently, we get:

*Increment the size file of the new incorrect UI_FILTER group information
*of the size generatively.
*/
static int indicate_policy(void)
{
int error;
if (fd == MARN_EPT) {
/*
*The kernel blank will coeld it to userspace.
*/
if (ss->segment < mem_total)
unblock_graph_and_set_blocked();
else
ret = 1;
goto bail;
}
segaddr = in_SB(in.addr);
selector = seg / 16;
setup_works = true;
for (i = 0; i < blocks; i++) {
seq = buf[i++];
bpf = bd->bd.next + i *search;
if (fd) {
current = blocked;
}
}
rw->name = "Getjbbregs";
bprm_self_clearl(&iv->version);
regs->new = blocks[(BPF_STATS << info->historidac)] | PFMR_CLOBATHINC_SECONDS << 12;
return segtable;
}
D. Linux Code, 2
/*
*Copyright (c) 2006-2010, Intel Mobile Communications. All rights reserved.
*
* This program is free software; you can redistribute it and/or modify it
*under the terms of the GNU General Public License version 2 as published by
*the Free Software Foundation.
*
* This program is distributed in the hope that it will be useful,
*but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
*
*GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software Foundation,
*Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
*/
#include <linux/kexec.h>
#include <linux/errno.h>
#include <linux/io.h>
#include <linux/platform_device.h>
#include <linux/multi.h>

MartinThoma avatar Aug 01 '23 05:08 MartinThoma

I'd be interested in contributing to this enhancement for PyPDF2 @MartinThoma. Let me know how I can be of assistance

MrAnayDongre avatar Aug 03 '23 06:08 MrAnayDongre

@MrAnayDongre PyPDF2 is deprecated. This is going into pypdf.

This is a very complex feature. I don't know myself by know what would be a good way to start doing that.

If you want to start contributing to pypdf, I recommend to have a look at https://github.com/py-pdf/pypdf/labels/Easy , then at https://github.com/py-pdf/pypdf/labels/help%20wanted

MartinThoma avatar Aug 03 '23 15:08 MartinThoma

extract_text has now layout extraction_mode I close this now old covered issue

pubpub-zz avatar Apr 08 '24 21:04 pubpub-zz

@pubpub-zz The layout mode does not resolve this and this issue requires further work to convert horizontal positions into whitespace accordingly.

I have therefore re-opened this issue.

stefan6419846 avatar Apr 09 '24 05:04 stefan6419846

@stefan6419846, this is is the rendering: print(rr.pages[6].extract_text(extraction_mode="layout")) ->

 * Increment  the  size  file  of  the  new  incorrect  UI_FILTER  group  information
 * of  the  size  generatively.
 */
static  int  indicate_policy(void)
{
   int  error;
   if  (fd  ==  MARN_EPT)  {
     /*
       * The  kernel  blank  will  coeld  it  to  userspace.
       */
     if  (ss->segment  <  mem_total)
        unblock_graph_and_set_blocked();
     else
        ret  =  1;
     goto  bail;
   }
   segaddr  =  in_SB(in.addr);
   selector  =  seg  /  16;
   setup_works  =  true;
   for  (i  =  0;  i  <  blocks;  i++)  {
     seq  =  buf[i++];
     bpf  =  bd->bd.next  +  i  * search;
     if  (fd)  {
        current  =  blocked;
     }
   }
   rw->name  =  "Getjbbregs";
   bprm_self_clearl(&iv->version);
   regs->new  =  blocks[(BPF_STATS  <<  info->historidac)]  |  PFMR_CLOBATHINC_SECONDS  <<  12;
   return  segtable;
}


D. Linux Code, 2

/*
 *   Copyright  (c)  2006-2010,  Intel  Mobile  Communications.   All  rights  reserved.
 *
 *     This  program  is  free  software;  you  can  redistribute  it  and/or  modify  it
 * under  the  terms  of  the  GNU  General  Public  License  version  2  as  published  by
 * the  Free  Software  Foundation.
 *
 *               This  program  is  distributed  in  the  hope  that  it  will  be  useful,
 * but  WITHOUT  ANY  WARRANTY;  without  even  the  implied  warranty  of
 *     MERCHANTABILITY  or  FITNESS  FOR  A  PARTICULAR  PURPOSE.   See  the
 *
 *   GNU  General  Public  License  for  more  details.
 *
 *     You  should  have  received  a  copy  of  the  GNU  General  Public  License
 *       along  with  this  program;  if  not,  write  to  the  Free  Software  Foundation,
 *   Inc.,  675  Mass  Ave,  Cambridge,  MA  02139,  USA.
 */

#include  <linux/kexec.h>
#include  <linux/errno.h>
#include  <linux/io.h>
#include  <linux/platform_device.h>
#include  <linux/multi.h>

Isn't this good ?

pubpub-zz avatar Apr 09 '24 17:04 pubpub-zz

Sorry, seems like my checkout was somehow broken. Still not optimal, but yes, then we can close this for now.

stefan6419846 avatar Apr 09 '24 17:04 stefan6419846