ExtractTablesFromPdf icon indicating copy to clipboard operation
ExtractTablesFromPdf copied to clipboard

Coordinate Y is not correct

Open CJ1789 opened this issue 7 years ago • 17 comments

I want to only extract line positions so at that desired area I can extract Text from Pdf, But it is taking left top as (0,0) and when I am tring to extract text from pdf using itextsharp it is taking left bottom as (0,0), So I am not able to take correct text.

please help me I am stuck.

CJ1789 avatar Dec 13 '17 07:12 CJ1789

The code does not work for all files. You could start comparing the results of other softwares (on line) and see if the can extract data. Then you can tailor the code on your pdf. Usually there is a lot of work to do to parse a pdf exactly as you need (so it is good if you need to extract data from a lot of pdfs).

bubibubi avatar Dec 13 '17 07:12 bubibubi

Can you please tell me why you transform point x and y like

    public double TransformX(double x, double y)
    {
        return a * x + c * y + e;

    }

    public double TransformY(double x, double y)
    {
        return b * x + d * y + f;

    }

and in case 0 Rotation rotated point y as Y= 800-Y;

    public Point Rotate(int pageRotation)
    {
        switch (pageRotation)
        {
            case 0:
                return new Point(X, 800 - Y);
            case 90:
                return new Point(Y, X);
            case 180:
                return new Point(X, Y);
            default:
                return this;
        }
    }

CJ1789 avatar Dec 13 '17 08:12 CJ1789

The first transformation is from the pdf guide.

About second question, 0 as page rotation means no rotation. I prefere to have the origin in upper left corner while pdf origin is lower left. 800 - y is to flip vertically (800 works for me, you can use a different literal). Otherwise you have to do this in 180 rotation.

bubibubi avatar Dec 13 '17 09:12 bubibubi

How to do 180 rotation?

CJ1789 avatar Dec 13 '17 09:12 CJ1789

you could swap the two rotations. 0 => y 180 => 800 - y

But then I think that you'll find several things not working (the other functions expects that the origin is in the upper left corner). Anyway, if you see that for some reasons you have everything is already flipped you could try it.

bubibubi avatar Dec 13 '17 09:12 bubibubi

I am not getting the answer in both the cases. please help me what to do. Is 800 - y is the way to flip pdf or you have got this value for your pdf?

CJ1789 avatar Dec 13 '17 10:12 CJ1789

c - y c is from my pdf. The condition to determine c is c -y > 0 and it is used for rendering (debug) so it can't be 1000000 - y

bubibubi avatar Dec 13 '17 10:12 bubibubi

what is c??? I mean how can I identify it for my pdf

Can I sent you mt pdf??

CJ1789 avatar Dec 13 '17 10:12 CJ1789

c means a literal a constant.

Yes, send me your pdf. I can have a look...

bubibubi avatar Dec 13 '17 10:12 bubibubi

send me your mail id please

CJ1789 avatar Dec 13 '17 10:12 CJ1789

७_१२_6.pdf ७_१२_7.pdf ७_१२_8.pdf ७_१२_9.pdf ७_१२_10.pdf ७_१२_11.pdf ७_१२_data on 2 pages.pdf

I want to determine vertical line position of line 3 ie line[2] and line 6 ie line[5]

CJ1789 avatar Dec 13 '17 10:12 CJ1789

Ok, I had a look to the first pdf. You can do the same thing updating the source code and using the BuildTablesFromPdf.Renderer app. The table in the first page is not really a table because is not well aligned. So the library determines more cells then there are. Also, there is an issue on text positioning. I'm probably ignoring a pdf statement that locates the text in the right place.

About second page there is a different issue. The coordinates are wrong. Probably I'm ignoring a pdf statements that I should consider. After solving this issue you will also have the issue about text positioning as in first page.

I will probably try to fix it but I'm not sure and I don't know when. If you fix it and you share the code it will be really appreciated.

bubibubi avatar Dec 13 '17 11:12 bubibubi

ok, thnx

CJ1789 avatar Dec 13 '17 12:12 CJ1789

I got correct Y.

CJ1789 avatar Dec 14 '17 10:12 CJ1789

Could you send me the code? THX!!!

bubibubi avatar Dec 14 '17 14:12 bubibubi

Just have to modify it by adding 150.

But now new issue had arrived. I am able to extract pdf but some characters are not being identified. Can you help me with that?

Example अधिकार as अ\0धकार महाराष्ट्र as महारा\0\0 क्षेत्र as \0े\0

CJ1789 avatar Dec 15 '17 04:12 CJ1789

Hello,

I want to know that the code is running perfectly for first page in pdf but what to do for second page. I am not getting correct Y. Please help me.

CJ1789 avatar Dec 21 '17 10:12 CJ1789