PDFiumSharp icon indicating copy to clipboard operation
PDFiumSharp copied to clipboard

About page text extraction

Open kuyeduwu opened this issue 6 years ago • 6 comments

Thank you for this great work, but it seems some of the types / methods have been implemented, like TEXTPAGE object, I didn't find any implementation of the method to get the page text.

Not sure if I missed something or if those functions are still developing.

Also, I found this is really helpful to my project, but the last commit is about 1 year ago, is this still actively maintained?

kuyeduwu avatar Dec 05 '18 06:12 kuyeduwu

You are right, there are many functions which are not wrapped yet.

I'm sorry to say that it is not activity maintained, I'm tied up with work and other projects. I would welcome collaborators though.

ArgusMagnus avatar Dec 05 '18 09:12 ArgusMagnus

Ok, maybe I could wrap some of them with my project going. It's not easy to find a good open source C# PDF library, I wish I could make some contributions to keep it going. :)

kuyeduwu avatar Dec 06 '18 08:12 kuyeduwu

I would like to help as well, although I don't have much free time, Anyway for start here is short snippet to get text from a page with bounding rects

using (var pdfDocument = new PdfDocument(pdfFile))
{
	var page = pdfDocument.Pages[0];
	var textPage = PDFium.FPDFText_LoadPage(page.Handle);
	var rectCount = PDFium.FPDFText_CountRects(textPage, 0, 1000000);
	for (int i = 0; i < rectCount; i++)
	{
		PDFium.FPDFText_GetRect(textPage, i, out var l, out var t, out var r, out var b);
		var text = PDFium.FPDFText_GetBoundedText(textPage, l, t, r, b);
		Console.WriteLine($"{(int)l}, {(int)t}, {(int)r}, {(int)b} - {text}");
	}
}

It is a prototype so no unloading/releasing handles and it is probably missing some rotated pages handling

Morcatko avatar Dec 06 '18 18:12 Morcatko

@Morcatko thank you for this snippet. And @ArgusMagnus do you have any preferences on code standard or something like that on this project?

kuyeduwu avatar Dec 07 '18 10:12 kuyeduwu

I might be able to help you out. I spent the better part of this weekend coding up a "PdfText" class to support text extraction (with a painful level of granularity should you need it). I guess when I have time and feel like the code is solid enough I can do a fork and maybe even a pull request. New to Github so still learning the ropes here.

bradleypeet avatar Feb 18 '19 01:02 bradleypeet

Attached are two files: the PdfText class I created and a modified FS_RECTF.cs file that it requires. You'll need to add/modify the following code in PdfPage.cs to hook the Text property up and ensure it gets disposed. Keep in mind this is a work in-progress, not thoroughly tested. If anything is missing/unresolved, let me know.

public PdfText Text
{   
    get
    {
        if (_text == null)
        {
            _text = PdfText.Load(this);
        }
        return _text;
    }
}
private PdfText _text;
protected override void Dispose(FPDF_PAGE handle)
{            
    (_text as IDisposable)?.Dispose();
    PDFium.FPDF_ClosePage(handle);
}

PdfText.cs.txt FS_RECTF.cs.txt

bradleypeet avatar Feb 18 '19 03:02 bradleypeet