PdfPig
PdfPig copied to clipboard
testing page.text
page.text did not give text with newlines. using this code on lestest code of pdfpig using (PdfDocument doc = PdfDocument.Open(textBox1.Text, new ParsingOptions { Password = textBox2.Text })) { var page = doc.GetPage(1); string pagetext = page.Text; File.WriteAllText("text.txt", pagetext); textBox3.Text = pagetext; }
Hi there, has this changed between the versions or is this the first time you've run this document?
By default page.Text just gives the text in the order the PDF content stream presents it. Because PDF is a presentation format text doesn't always correspond to reading order. You may be looking for something like page.GetWords() which attempts to convert the presentation format to a logical word order.
For example the following is perfectly valid, in pseudo-pdf content:
Current Location 7, 7
Show-Text 'llo'
Move-To 5, 7
Show-Text 'He'
Move-To 23, 7
Show-Text 'World'
Would give the text (page.Text) lloHeWorld whereas the actual reading order of text is Hello World. Actually converting to reading order is a lot harder, page.GetWords() takes a simple approach but it's not going to be robust.
getting the correct order of every text in pdf. some 1% text is missing because it contains some Unicode letters. when test same pdf in pdfsharp or pdfbox using ikvm or itextsharp it gives me a text with a newline. for example data like "hello word abc" then it result as hellowordabc. page.GetWords() work great. "special thank to making this awesome project"
Do you have an example document you would be able to share please? I want to check Text is at least doing what I would expect according to the file content even if it's not an exact match for PDFBox.
Please give me your mail so I will send a file. its a private document so not able to share in public. use page.getwords() it gives the words but not in proper order. For exmple, if I have a two different paragraph at the same Y direction then two lines are mix in page.getword(). but on page.text it will give correct order but missing a new line or give a space in two different lines.
using (PdfDocument doc = PdfDocument.Open(txtFilePath.Text, new ParsingOptions { Password = txtPassword.Text })) { Page page = doc.GetPage(1); string text = page.Text; var pageWords = page.GetWords(); textBox3.Text = text + Environment.NewLine + Environment.NewLine + Environment.NewLine + string.Join(" ", pageWords.Select(x => x.Text)); var images = page.ExperimentalAccess.GetRawImages(); }
My email is elioty at hotmail dot co dot uk. I'll take a look when I get a chance. It sounds like the GetWords isn't returning the correct result because the text is multi-column which is something of an unsolved problem: https://en.wikipedia.org/wiki/Document_layout_analysis
But hopefully I can change the way Text is built to at least preserve line breaks if they are present.
Yes, its multi-paragraph at same y location start. Please check your mail for the file.
Hi Eliot,
I just did a pull request with 2 Document Layout Analysis tools:
- Nearest Neighbour Word Extractor
- Recursive X-Y Cut algorithm
Might be a solution to this problem.
Hi Mayur,
I looked into the changes necessary to insert newlines into the content stream. Because the newlines are not predictably used in the document content it turned out to be too difficult to do this well for every document type.
I've attached an updated nuget package that seems to work well for your document and pushed the changes to a branch https://github.com/UglyToad/PdfPig/tree/newline-in-text but I think they're too risky to merge to master at the moment. As discussed I think implementing your own custom IWordExtractor would be a better way to approach this problem, page.Text is mainly provided to aid text searching of document content rather than a reading order representation of content.
Let me know if this works for you.
Tested more than 25 documents and 6 different types of It works like a charm but still one of the document types still two word mix when use page.text Sending one of that document to your mail. when two different words mix up it misleading. if newline will risky then space will also risky?
I also try @BobLd example but in my case, it gives 90% same result as page.getwords().
@BobLd give me your mail will send the document for testing
It looks like in that specific document there was a more general bug in font size handling, I've attached a patched nuget package with a fix for that case, please let me know how it goes:
Just Sending the different type of pdf that contains columns that mix two different columns of text. pdfpig.0.0.6.35.1 work great for sent 2 days ago file format. tested more than 15 files.
I've had a look at the documents you sent through, these ones are much more difficult, the underlying word order is entirely different in the PDF content to the visual reading order. I'd be interested to know if PDFSharp or PDFBox handle those word orders correctly?
The previous company I worked at built a product partly around correctly finding the correct word order for documents like these, it's not an easy problem to solve, especially where the words are in a table or column format.
I think you'd have more luck taking the output of GetWords and doing your own logic on the result to correctly order words because it would be very specific to the document types.
Guys does new line in page.Text will taken into account.
I'm not aware of any documents which include the new-line character inside the text itself, so page.Text does not include newlines (in all situations I've seen). To work out where the new-lines are you need to loop through the letters.
To give you an idea of what that might look like this code should do it (you'll need some better checking on deltaY otherwise sub/superscripts will be included on the newline):
var myText = new StringBuilder();
var previous = page.Letters[0];
myText.Append(previous.Value);
for (int i = 1; i < page.Letters.Count; i++)
{
var current = page.Letters[i];
var deltaY = Math.Abs(current.Location.Y - previous.Location.Y);
if (deltaY > 3)
{
myText.AppendLine();
}
myText.Append(current.Value);
previous = current;
}
var actualText = myText.ToString();
https://github.com/UglyToad/PdfPig/issues/35#issuecomment-593444156 Hello Eliot! Is it possible to add the above mentioned functionality to the library?
Sure, I have a version I'm planning to add which seems to work well for most documents with sensible letter order, I'll update you when it's available.
@EliotJones Is there a roadmap of planned features and feature releases? 😃
Another way to get the lines would be to use document analysis tools. Here is an example of what it could look like:
using (PdfDocument document = PdfDocument.Open(pdfPath))
{
for (int i = 0; i < document.NumberOfPages; i++)
{
var page = document.GetPage(i + 1);
var words = NearestNeighbourWordExtractor.Instance.GetWords(page.Letters).ToList(); // extract words
var blocks = RecursiveXYCut.Instance.GetBlocks(words, page.Width / 5.0); // extract blocks and lines
var orderedBlocks = UnsupervisedReadingOrderDetector.Instance.Get(blocks); // order blocks
foreach (var block in orderedBlocks)
{
foreach (var line in block.TextLines)
{
myText.AppendLine(line.Text);
}
myText.AppendLine("-----------------------------------------------");
}
}
}
var actualText = myText.ToString();
@Martin005 I don't have a formal/documented roadmap at present. To be honest I mainly just start adding features if I find them interesting when I get time and motivation to work on them (which is a bit tough at the moment because I've just started a new job so I'm a bit tired from that) but in my mental to-do list it's roughly something like:
- PDF Merging (this has already been started)
- PDF/A compliance for generated documents
- PNG support for document builder
- Ability to generate page images (I think @BobLd is doing something on this related to SVG)
Then there's other features contributed by other contributers, bug fixes for the core text extraction functionality and things like this issue, document layout/text analysis features which are very much ad-hoc.
@Martin005 , if you want to see some works in progress, you can have a look at my fork's branches
@Martin005 I've added a new approach and an example of extracting text from the raw letters here:
- New approach: https://github.com/UglyToad/PdfPig/blob/master/examples/ExtractTextWithNewlines.cs
- Roll-your-own: https://github.com/UglyToad/PdfPig/blob/master/examples/OpenDocumentAndExtractWords.cs
its very cool..