PdfPig testing page.text

page.text did not give text with newlines. using this code on lestest code of pdfpig using (PdfDocument doc = PdfDocument.Open(textBox1.Text, new ParsingOptions { Password = textBox2.Text })) { var page = doc.GetPage(1); string pagetext = page.Text; File.WriteAllText("text.txt", pagetext); textBox3.Text = pagetext; }

Jun 15 '19 11:06 mayurjansari

Hi there, has this changed between the versions or is this the first time you've run this document?

By default page.Text just gives the text in the order the PDF content stream presents it. Because PDF is a presentation format text doesn't always correspond to reading order. You may be looking for something like page.GetWords() which attempts to convert the presentation format to a logical word order.

For example the following is perfectly valid, in pseudo-pdf content:

Current Location 7, 7
Show-Text 'llo'
Move-To 5, 7
Show-Text 'He'
Move-To 23, 7
Show-Text 'World'

Would give the text (page.Text) lloHeWorld whereas the actual reading order of text is Hello World. Actually converting to reading order is a lot harder, page.GetWords() takes a simple approach but it's not going to be robust.

Jun 15 '19 12:06 EliotJones

getting the correct order of every text in pdf. some 1% text is missing because it contains some Unicode letters. when test same pdf in pdfsharp or pdfbox using ikvm or itextsharp it gives me a text with a newline. for example data like "hello word abc" then it result as hellowordabc. page.GetWords() work great. "special thank to making this awesome project"

Jun 15 '19 12:06 mayurjansari

Do you have an example document you would be able to share please? I want to check Text is at least doing what I would expect according to the file content even if it's not an exact match for PDFBox.

Jun 15 '19 12:06 EliotJones

Please give me your mail so I will send a file. its a private document so not able to share in public. use page.getwords() it gives the words but not in proper order. For exmple, if I have a two different paragraph at the same Y direction then two lines are mix in page.getword(). but on page.text it will give correct order but missing a new line or give a space in two different lines.

using (PdfDocument doc = PdfDocument.Open(txtFilePath.Text, new ParsingOptions { Password = txtPassword.Text })) { Page page = doc.GetPage(1); string text = page.Text; var pageWords = page.GetWords(); textBox3.Text = text + Environment.NewLine + Environment.NewLine + Environment.NewLine + string.Join(" ", pageWords.Select(x => x.Text)); var images = page.ExperimentalAccess.GetRawImages(); }

Jun 16 '19 10:06 mayurjansari

My email is elioty at hotmail dot co dot uk. I'll take a look when I get a chance. It sounds like the GetWords isn't returning the correct result because the text is multi-column which is something of an unsolved problem: https://en.wikipedia.org/wiki/Document_layout_analysis

But hopefully I can change the way Text is built to at least preserve line breaks if they are present.

Jun 16 '19 11:06 EliotJones

Yes, its multi-paragraph at same y location start. Please check your mail for the file.

Jun 16 '19 11:06 mayurjansari

Hi Eliot,

I just did a pull request with 2 Document Layout Analysis tools:

Nearest Neighbour Word Extractor
Recursive X-Y Cut algorithm

Might be a solution to this problem.

Jun 16 '19 13:06 BobLd

Hi Mayur,

I looked into the changes necessary to insert newlines into the content stream. Because the newlines are not predictably used in the document content it turned out to be too difficult to do this well for every document type.

I've attached an updated nuget package that seems to work well for your document and pushed the changes to a branch https://github.com/UglyToad/PdfPig/tree/newline-in-text but I think they're too risky to merge to master at the moment. As discussed I think implementing your own custom IWordExtractor would be a better way to approach this problem, page.Text is mainly provided to aid text searching of document content rather than a reading order representation of content.

Let me know if this works for you.

PdfPig.0.0.6.2.zip

Jun 17 '19 21:06 EliotJones

Tested more than 25 documents and 6 different types of It works like a charm but still one of the document types still two word mix when use page.text Sending one of that document to your mail. when two different words mix up it misleading. if newline will risky then space will also risky?

I also try @BobLd example but in my case, it gives 90% same result as page.getwords(). @BobLd give me your mail will send the document for testing

Jun 19 '19 02:06 mayurjansari

It looks like in that specific document there was a more general bug in font size handling, I've attached a patched nuget package with a fix for that case, please let me know how it goes:

PdfPig.0.0.6.35.1.zip

Jun 19 '19 17:06 EliotJones

Just Sending the different type of pdf that contains columns that mix two different columns of text. pdfpig.0.0.6.35.1 work great for sent 2 days ago file format. tested more than 15 files.

Jun 21 '19 07:06 mayurjansari

I've had a look at the documents you sent through, these ones are much more difficult, the underlying word order is entirely different in the PDF content to the visual reading order. I'd be interested to know if PDFSharp or PDFBox handle those word orders correctly?

The previous company I worked at built a product partly around correctly finding the correct word order for documents like these, it's not an easy problem to solve, especially where the words are in a table or column format.

I think you'd have more luck taking the output of GetWords and doing your own logic on the result to correctly order words because it would be very specific to the document types.

Jun 23 '19 15:06 EliotJones

Guys does new line in page.Text will taken into account.

Mar 02 '20 14:03 sachithanandhampalaniswamy

I'm not aware of any documents which include the new-line character inside the text itself, so page.Text does not include newlines (in all situations I've seen). To work out where the new-lines are you need to loop through the letters.

To give you an idea of what that might look like this code should do it (you'll need some better checking on deltaY otherwise sub/superscripts will be included on the newline):

var myText = new StringBuilder();
var previous = page.Letters[0];
myText.Append(previous.Value);
for (int i = 1; i < page.Letters.Count; i++)
{
   var current = page.Letters[i];
   var deltaY = Math.Abs(current.Location.Y - previous.Location.Y);
   if (deltaY > 3) 
   {
       myText.AppendLine();
   }

   myText.Append(current.Value);

   previous = current;
}
var actualText = myText.ToString();

Mar 02 '20 15:03 EliotJones

https://github.com/UglyToad/PdfPig/issues/35#issuecomment-593444156 Hello Eliot! Is it possible to add the above mentioned functionality to the library?

Mar 23 '20 14:03 Martin005

Sure, I have a version I'm planning to add which seems to work well for most documents with sensible letter order, I'll update you when it's available.

Mar 24 '20 19:03 EliotJones

@EliotJones Is there a roadmap of planned features and feature releases? 😃

Mar 25 '20 12:03 Martin005

Another way to get the lines would be to use document analysis tools. Here is an example of what it could look like:

using (PdfDocument document = PdfDocument.Open(pdfPath))
{      
	for (int i = 0; i < document.NumberOfPages; i++)
	{
		var page = document.GetPage(i + 1);
		var words = NearestNeighbourWordExtractor.Instance.GetWords(page.Letters).ToList(); // extract words
		var blocks = RecursiveXYCut.Instance.GetBlocks(words, page.Width / 5.0);            // extract blocks and lines
		var orderedBlocks = UnsupervisedReadingOrderDetector.Instance.Get(blocks);          // order blocks

		foreach (var block in orderedBlocks)
		{
			foreach (var line in block.TextLines)
			{
				myText.AppendLine(line.Text);
			}
			myText.AppendLine("-----------------------------------------------");
		}
	}
}
var actualText = myText.ToString();

Mar 25 '20 13:03 BobLd

@Martin005 I don't have a formal/documented roadmap at present. To be honest I mainly just start adding features if I find them interesting when I get time and motivation to work on them (which is a bit tough at the moment because I've just started a new job so I'm a bit tired from that) but in my mental to-do list it's roughly something like:

PDF Merging (this has already been started)
PDF/A compliance for generated documents
PNG support for document builder
Ability to generate page images (I think @BobLd is doing something on this related to SVG)

Then there's other features contributed by other contributers, bug fixes for the core text extraction functionality and things like this issue, document layout/text analysis features which are very much ad-hoc.

Mar 25 '20 18:03 EliotJones

@Martin005 , if you want to see some works in progress, you can have a look at my fork's branches

Mar 25 '20 18:03 BobLd

@Martin005 I've added a new approach and an example of extracting text from the raw letters here:

New approach: https://github.com/UglyToad/PdfPig/blob/master/examples/ExtractTextWithNewlines.cs
Roll-your-own: https://github.com/UglyToad/PdfPig/blob/master/examples/OpenDocumentAndExtractWords.cs

Apr 19 '20 16:04 EliotJones

its very cool..

Jun 02 '23 17:06 Vasanthvivi

PdfPig PdfPig copied to clipboard

testing page.text

PdfPig
PdfPig copied to clipboard