pdf2json
pdf2json copied to clipboard
Page unit conversion to PDF points
@richardlitt and I are also having a little understanding the 'page unit', the coordinate convention and how that relates to PDF points (8.5" x 11" = 612 x 792 points). Can you provide a little clarification?
To translate the page units to any coordinate system do this:
Take the width and height returned in the parsed pdf object, then work out scale factors according to the units of measure you desire, so if the width of your page in mm is 290 and the height is 297 mm, then:
scale x = pdf width / 290
scale y = pdf height / 297
Then correct x and y for any text object
x = text x * scale x
y = text y * scale y
@rkpatel33, what @SPlatten wrote is the general case. Since you asked how it relates specifically to 72dpi for pdf2json with scaling to 8.5x11:
//scaled to 72 DPI from whatever resolution the parsed version is in
x72dpi = x * (8.5 * 72) / parsedPdf.formImage.Width;
y72dpi = y * (11 * 72) / parsedPdf.formImage.Pages[0].Height; //assuming all pages are same height
This issue can most likely be closed. AFAIK @rkpatel33 is no longer interested in this issue.
@richardlitt agreed. This is more of a help question than an issue but wanted to provide a concrete example for anyone else that trips across this as I did.
Hello gentlemen,
I am keenly interested in this issue and hopeful you are still in the neighborhood.
To wit, I am working with the A4 paper size which measures:
- 210 × 297 millimeters ( ratio = 0.7070707070707071 ) or
- 8.27 × 11.69 inches ( ratio = 0.7074422583404619 ) or
- 595 × 842 points ( ratio = 0.7066508313539192 ) rounded off in Postscript
pdf2json is telling me the these are the dimension:
- 37.188000 x 52.625000 ( ratio = 0.7066603325415677 )
What, pray tell, ARE these "units" called in pdf2json ?
Thank you in advance for your time and attention.
@jdwilsonjr the units are relative to the scale of the page. As the page scales the units should be scaled by the same. In simple terms you are just scaling the positional units by the ratio between the source page height/width and the destination page height/width.
In your case: destination A4 paper is 72dpi (595/72 ~= 8.264, 842/72 = 11.69 inches), source paper is pdf2json which 37.188x52.625. You don't need to worry about the DPI of pdf2json here because even if it was 300dpi the x/y numbers would just be double what they'd be at 150dpi. Your only concern is scaling them by the same amount you're scaling the page.
x72dpi = pdf2json.x * 595 / 37.188000; // aka (8.27 * 72) / 37.188
y72dpi = pdf2json.y * 842 / 52.625000; // aka (11.69 * 72) / 52.625
@matthewerwin thanks for the response.
Hmmm, so are you saying that the pdf2json "paper size" is (relatively) 37.1875 inches by 52.625 inches and that is always used irrespective of the actual document dpi ?
@jdwilsonjr no. I'm saying the units used in pdf2json cancel out. When you divide the x coordinate by the width the page the units cancel just like they do when converting english units to metric units in elementary math.
i.e. converting from metric centimeters to English system inches
(50 cm * 1in) / 2.54cm = 19.685in
Does that clarify why the units used by pdf2json don't matter and you only need to be concerned about how that ratio relates to your desired destination units?
Ok. Thanks.
Hello,
I am writing some tests to verify the size of some pdf files (A5, A4, A3 etc).
The width and height returned by pdf2json in case of A4 is 37.185 x 52.62. The A4 paper has the width 595px, 595 / 31.185 = 16.00 As far as I've seen for all page sizes the scale is ~16.00.
Is there a way to obtain that 16.00 scale factor programmatically, so that I can have something like this in my tests:
pdf2jsonWidthA4 * scale = pdfWidthA4
pdf2jsonHeigthA3 * scale = pdfHeightA3
etc
which will tell if the pdfs have the desired dimensions.
Thank you!
@szabbal A4 paper doesn't have a fixed pixel width. Pixels are a product of "resolution" -- pixels are dots and are fundamentally fixed by the DPI of the device which displays/captures them (even if that "device" is virtual/memory/disk storage).
If you want to determine if something is A4 paper then it's a question of inches -- not pixels. You need to check aspect ratio to determine it's "A" paper and not "B" or "C" paper -- then use the width & height in inches to determine if it's A4 or A5, or A6, etc.
Might help you to read this to get the relationship of units mentally sorted out:
https://stackoverflow.com/a/46105049/1513347
Hi @matthewerwin, thank you for your reply. Indeed measuring paper width and height makes more sense with inches, thank you for clarifying that. However something is still not clear for me (sorry if this was already asked): I am using pdf2json to determine if some pdf files have the size of A0, A1, A2 etc. For all of my test files pdf2json returns a width and height (in page unit):
- A0: 148.935 x 210.555
- A1: 105.255 x 148.935
- A2: 74.37 x 105.255
- A3: 52.62 x 74.37
- A4: 37.185 x 52.62
- A5: 26.19 x 37.185
- A6: 18.57 x 26.19
For all of them the width/height ration is 0.70.
All of the above dimensions are proportional to the width and height of standard A0-6 papers, and in all cases that proportion is 1/16 (with DPI 72). Consider the following example for A4:
width = (scale * pdf2jonWidth) / DPI
width = (16 * 37.185) / 72
width = 8.26 - more or less the width of A4 which is 8.3
height = (scale * pdf2jsonHeight) / DPI
height = (16 * 52.62) / 72
height = 11.69 - which is again more or less the height of A4 (11.7 inch)
I would like to have the same results but without that magic number 16, I would like to know where that 16 is coming from and if I can get it programatically.
Thank you!