PDFiumSharp
PDFiumSharp copied to clipboard
About page text extraction
Thank you for this great work, but it seems some of the types / methods have been implemented, like TEXTPAGE object, I didn't find any implementation of the method to get the page text.
Not sure if I missed something or if those functions are still developing.
Also, I found this is really helpful to my project, but the last commit is about 1 year ago, is this still actively maintained?
You are right, there are many functions which are not wrapped yet.
I'm sorry to say that it is not activity maintained, I'm tied up with work and other projects. I would welcome collaborators though.
Ok, maybe I could wrap some of them with my project going. It's not easy to find a good open source C# PDF library, I wish I could make some contributions to keep it going. :)
I would like to help as well, although I don't have much free time, Anyway for start here is short snippet to get text from a page with bounding rects
using (var pdfDocument = new PdfDocument(pdfFile))
{
var page = pdfDocument.Pages[0];
var textPage = PDFium.FPDFText_LoadPage(page.Handle);
var rectCount = PDFium.FPDFText_CountRects(textPage, 0, 1000000);
for (int i = 0; i < rectCount; i++)
{
PDFium.FPDFText_GetRect(textPage, i, out var l, out var t, out var r, out var b);
var text = PDFium.FPDFText_GetBoundedText(textPage, l, t, r, b);
Console.WriteLine($"{(int)l}, {(int)t}, {(int)r}, {(int)b} - {text}");
}
}
It is a prototype so no unloading/releasing handles and it is probably missing some rotated pages handling
@Morcatko thank you for this snippet. And @ArgusMagnus do you have any preferences on code standard or something like that on this project?
I might be able to help you out. I spent the better part of this weekend coding up a "PdfText" class to support text extraction (with a painful level of granularity should you need it). I guess when I have time and feel like the code is solid enough I can do a fork and maybe even a pull request. New to Github so still learning the ropes here.
Attached are two files: the PdfText class I created and a modified FS_RECTF.cs file that it requires. You'll need to add/modify the following code in PdfPage.cs to hook the Text property up and ensure it gets disposed. Keep in mind this is a work in-progress, not thoroughly tested. If anything is missing/unresolved, let me know.
public PdfText Text
{
get
{
if (_text == null)
{
_text = PdfText.Load(this);
}
return _text;
}
}
private PdfText _text;
protected override void Dispose(FPDF_PAGE handle)
{
(_text as IDisposable)?.Dispose();
PDFium.FPDF_ClosePage(handle);
}