npm-pdfreader icon indicating copy to clipboard operation
npm-pdfreader copied to clipboard

Question: Missing spaces in text output

Open TheRealThor opened this issue 2 years ago • 1 comments

Hey this might not be a bug, I created a PDF from DOCX. In the PDF the text its written like that: "(University of California, Los Angeles - School of Law)". The the text output from the PDF its missing those spaces or added some where there don't belong:

Output:
{
      oc: '#424f78',
      x: 3.395,
      y: 3.774,
      w: 3.948,
      sw: 0.34115625,
      clr: -1,
      A: 'left',
      R: [Array],
      text: '(Universityof C alifornia,LosAngeles-SchoolofLaw)'
   },

I used this code to extract the text (from example):

function addTextToLines(textLines, item) {
  const existingLine = textLines.find(({ y }) => y === item.y);
  if (existingLine) {
    existingLine.text += "" + item.text;
  } else {
    textLines.push(item);
  }
}

TheRealThor avatar Mar 28 '22 13:03 TheRealThor

It looks like the software that produced that PDF has a particular way to split words into text entries... Maybe because the paragraph of text was justified, i.e. adding variable amount of space between words and letters, to fill the width of the page?

In that case, I see two possible solutions that you may want to try:

  • try to regenerate that PDF file in a way to prevent variable space between words and letter, e.g. by using another PDF generator, or better: by disabling any text formatting that may cause this problem.
  • or replace "" by " " in your source code, to add spaces between words, and then process the resulting lines to de-duplicate spaces and fix words using a spell-checker / dictionary.

adrienjoly avatar Apr 02 '22 11:04 adrienjoly