PdfPig icon indicating copy to clipboard operation
PdfPig copied to clipboard

New lines

Open Lesaje opened this issue 1 year ago • 3 comments

        var result = new List<Invoice>();
        foreach (var file in directoryInfo.GetFiles("*.pdf"))
        {
            var text = "";
            using (PdfDocument document = PdfDocument.Open(file.FullName))
            {
                foreach (Page page in document.GetPages()) { text += page.Text; }
            }
            result.Add(new Invoice(file.Name, text));
        }
        return result;
var result = new Invoice(file.Name.Replace(".pdf", ".txt"), text);
File.WriteAllText(result.Path, result.Content);

When using this code, i get following result: $10.00 USD due October 23, 2023Page 1 of 1Date of issueOctober 23, 2023Date dueOctober 23

So there is clearly some problem with new lines. Could that be fixed somehow?

Lesaje avatar Nov 03 '23 16:11 Lesaje

I have had to work around both carriage returns and missing spaces to mirror PDFBox as follows. This appears to work fine 99% of the time, until there is a very odd order layout encountered, and gives the outcome in the variable strPDFTXTOut

// 1px used as pdf accuracy is not ideal var deviation = 1; // string to build to var strPDFTXTOut = string.Empty; var letters = page.Letters; // get the 1st letter for coordinates, sizes etc var lastLetter = letters[0]; // each letter foreach (var letter in letters) { // calc difference in vertical and horizontal position to last latter var difY = letter.Location.Y - lastLetter.Location.Y; var difX = letter.Location.X - lastLetter.Location.X - lastLetter.Width; if (difY < -deviation || difY > deviation) { // if the letter is more than px vertical different from last letter then its a carriage return strPDFTXTOut += "\r\n"; } else if (difX > deviation) { // if the letter is more than px horizontal different from last letter then its a space strPDFTXTOut += " "; } // add this letter strPDFTXTOut += letter.Value; // save this letter as last letter lastLetter = letter; }

HuwSy avatar Nov 27 '23 14:11 HuwSy

if you want new line, try.

 var text = "";
using (PdfDocument document = PdfDocument.Open(file.FullName))
          {
                foreach (Page page in document.GetPages()) 
                      { 
                         text += ContentOrderTextExtractor.GetText(page);
                      }
            }
            result.Add(new Invoice(file.Name, text));

;

mayurjansari avatar Nov 28 '23 06:11 mayurjansari

See also

  • https://github.com/UglyToad/PdfPig/issues/630
  • https://github.com/UglyToad/PdfPig/issues/274

EliotJones avatar Feb 18 '24 15:02 EliotJones