PdfPig
PdfPig copied to clipboard
New lines
var result = new List<Invoice>();
foreach (var file in directoryInfo.GetFiles("*.pdf"))
{
var text = "";
using (PdfDocument document = PdfDocument.Open(file.FullName))
{
foreach (Page page in document.GetPages()) { text += page.Text; }
}
result.Add(new Invoice(file.Name, text));
}
return result;
var result = new Invoice(file.Name.Replace(".pdf", ".txt"), text);
File.WriteAllText(result.Path, result.Content);
When using this code, i get following result:
$10.00 USD due October 23, 2023Page 1 of 1Date of issueOctober 23, 2023Date dueOctober 23
So there is clearly some problem with new lines. Could that be fixed somehow?
I have had to work around both carriage returns and missing spaces to mirror PDFBox as follows. This appears to work fine 99% of the time, until there is a very odd order layout encountered, and gives the outcome in the variable strPDFTXTOut
// 1px used as pdf accuracy is not ideal
var deviation = 1;
// string to build to
var strPDFTXTOut = string.Empty;
var letters = page.Letters;
// get the 1st letter for coordinates, sizes etc
var lastLetter = letters[0];
// each letter
foreach (var letter in letters)
{
// calc difference in vertical and horizontal position to last latter
var difY = letter.Location.Y - lastLetter.Location.Y;
var difX = letter.Location.X - lastLetter.Location.X - lastLetter.Width;
if (difY < -deviation || difY > deviation)
{
// if the letter is more than
if you want new line, try.
var text = "";
using (PdfDocument document = PdfDocument.Open(file.FullName))
{
foreach (Page page in document.GetPages())
{
text += ContentOrderTextExtractor.GetText(page);
}
}
result.Add(new Invoice(file.Name, text));
;
See also
- https://github.com/UglyToad/PdfPig/issues/630
- https://github.com/UglyToad/PdfPig/issues/274