pdfium-lib icon indicating copy to clipboard operation
pdfium-lib copied to clipboard

Some Suggestions for some code improvement and a question related to removed functions

Open quincarter opened this issue 2 years ago • 21 comments

Hello there again!

I am actively looking at this and i noticed that the code has changed a lot in the last year or so. Glad to see you are working on the project!

There's some new things i found that are kind of difficult to understand just at first glance and i would recommend renaming a lot of the variables that are like i or H or F or E or something like one letter variables to be more descriptive.

Just as an example, i am not sure what the bottom lines do exactly. like what values do s, J, Z, E, m represent?

window.heap = (J, s) => {
                let E;
                switch (J) {
                    case Int8Array: E = Module.HEAP8; break;
                    case Int16Array: E = Module.HEAP16; break;
                    case Int32Array: E = Module.HEAP32; break;
                    case Uint8Array: E = Module.HEAPU8; break;
                    case Uint16Array: E = Module.HEAPU16; break;
                    case Uint32Array: E = Module.HEAPU32; break;
                    case Float32Array: E = Module.HEAPF32; break;
                    case Float64Array: E = Module.HEAPF64; break;
                }
                const Z = J.BYTES_PER_ELEMENT; const m = Module._malloc(s * Z);
                const a = Array(1 + s); a[0] = ({ s, J, Z, E, m, free: () => Module._free(m) }); // ???
                for (let i = 0; i < s; i++) a[i + 1] = ({ p: m + (i * Z), get v() { return E[m / Z + i]; } }); // ????
                return a;
            };

There are some things that can be cleaned up too -- like:

switch (lastError) {
                    case FPDF.LAST_ERROR.SUCCESS:
                        return "success";
                        break; // break isn't needed because this is unreachable since there is a return above it. :)
                    case FPDF.LAST_ERROR.UNKNOWN:
                        return "unknown error";
                        break; // break isn't needed because this is unreachable since there is a return above it. :)
                    case FPDF.LAST_ERROR.FILE:
                        return "file not found or could not be opened";
                        break; // break isn't needed because this is unreachable since there is a return above it. :)
                    case FPDF.LAST_ERROR.FORMAT:
                        return "file not in PDF format or corrupted";
                        break; // break isn't needed because this is unreachable since there is a return above it. :)
                    case FPDF.LAST_ERROR.PASSWORD:
                        return "password required or incorrect password";
                        break; // break isn't needed because this is unreachable since there is a return above it. :)
                    case FPDF.LAST_ERROR.SECURITY:
                        return "unsupported security scheme";
                        break; // break isn't needed because this is unreachable since there is a return above it. :)
                    case FPDF.LAST_ERROR.PAGE:
                        return "page not found or content error";
                        break; // break isn't needed because this is unreachable since there is a return above it. :)
                    default:
                        return "unknown error";

And one last thing i see there is an open issue for related to the Bitmap memory leak (#33). I see that there was a function or two removed from the code and possibly some documentation around how to actually run the alternative would be helpful. I am not exactly sure how to get a pdf to render in this new way.

This function was removed:

PDFium_GetRenderPageDataSize(doc, pageNumber)

I am not sure how to actually get the render page data size now with these new functions. So any help in that sense would be great. I am able to execute basically every other function on FPDF/Pdfium

EDIT - I had this working in an earlier version using that other function, utilizing the pdfium.wasm file and the glue code. Pretty sure that was even back in January I updated it and it is working on that version. But i need to update it to stay in sync with the code.

quincarter avatar May 21 '22 02:05 quincarter

Hi,

We are using the functions direct from PDFium C functions without need create custom code.

// index, width, height
getRender(i = 0, w, h) {
  const flag = FPDF.REVERSE_BYTE_ORDER | FPDF.ANNOT;
  const heap = Module._malloc(w * h * C);

  for (let i = 0; i < w * h * C; i++) {
      Module.HEAPU8[heap + i] = 0;
  }

  const bmap = FPDF.Bitmap_CreateEx(w, h, F, heap, w * C);
  const page = FPDF.LoadPage(this.wasmData.wasm, i);

  FPDF.Bitmap_FillRect(bmap, 0, 0, w, h, 0xFFFFFFFF);
  FPDF.RenderPageBitmap(bmap, page, 0, 0, w, h, 0, flag);
  FPDF.Bitmap_Destroy(bmap);
  FPDF.ClosePage(page);

  return heap;
}

paulocoutinhox avatar May 21 '22 04:05 paulocoutinhox

Ok so the for loop is where my code is failing.

for (let i = 0; i < w * h * C; i++) { // i is already defined 
      Module.HEAPU8[heap + i] = 0; // which i is this supposed to be using?
  }

In this instance, i is already defined in the above method signature. So what value is that supposed to be?

quincarter avatar May 21 '22 11:05 quincarter

Okay so i think i got a little farther. I have extracted all your classes to separate files and making this make more sense to me. But the variable names are still kind of arbitrary to me. I am having some issues figuring out what some of the single character variables mean. I see on this page that the C you use in CreatEx is part of the stride. When i set it to null, it actually runs through, but it messes with the rest of the functions and the loops don't get hit. If i keep it at your default value of 4 then it doesn't run at all. Any guidance as to what the variables mean would be helpful. So i guess my question is what C actually refers to in your code 😄

Thank you!

Code i am talking about:

// index, width, height
getRender(i = 0, w, h) {
  const flag = FPDF.REVERSE_BYTE_ORDER | FPDF.ANNOT;
  const heap = Module._malloc(w * h * C); // <<<<<< here what is C? I know you declare a global C and set it C = 4, but that isn't working in my code. 

  for (let i = 0; i < w * h * C; i++) {
      Module.HEAPU8[heap + i] = 0;
  }

  const bmap = FPDF.Bitmap_CreateEx(w, h, F, heap, w * C);
  const page = FPDF.LoadPage(this.wasmData.wasm, i);

  FPDF.Bitmap_FillRect(bmap, 0, 0, w, h, 0xFFFFFFFF);
  FPDF.RenderPageBitmap(bmap, page, 0, 0, w, h, 0, flag);
  FPDF.Bitmap_Destroy(bmap);
  FPDF.ClosePage(page);

  return heap;
}

quincarter avatar May 22 '22 22:05 quincarter

Okay well i got it. It's not rendering as pretty as it was before in the previous version. Any tips to get it not to shift so much? Not sure why it's doing that. Also when i change pages, it is distorting the image on the canvas. I am playing with that, but it's rendering in like 5ms on each image and cleaning the heap. So it's managing memory pretty well in this way.

image

Distorted image. (every page turn it gets worse) image

quincarter avatar May 23 '22 04:05 quincarter

Hi,

@cetinsert can you help with these questions?

I think that C is because the file format: BGRA.

https://github.com/chromium/pdfium/blob/3e36f68831431bf497babc74075cd69af5fd9823/public/fpdfview.h#L1025

Thanks.

paulocoutinhox avatar May 24 '22 05:05 paulocoutinhox

I think that C is because the file format: BGRA.

yes


// index, width, height
getRender(i = 0, w, h) {
  const flag = FPDF.REVERSE_BYTE_ORDER | FPDF.ANNOT;
  const heap = Module._malloc(w * h * C); // <<<<<< here what is C? I know you declare a global C and set it C = 4, but that isn't working in my code. 

  for (let i = 0; i < w * h * C; i++) {
      Module.HEAPU8[heap + i] = 0;
  }

  const bmap = FPDF.Bitmap_CreateEx(w, h, F, heap, w * C);
  const page = FPDF.LoadPage(this.wasmData.wasm, i);

  FPDF.Bitmap_FillRect(bmap, 0, 0, w, h, 0xFFFFFFFF);
  FPDF.RenderPageBitmap(bmap, page, 0, 0, w, h, 0, flag);
  FPDF.Bitmap_Destroy(bmap);
  FPDF.ClosePage(page);

  return heap;
}
  for (let i = 0; i < w * h * C; i++) {
      Module.HEAPU8[heap + i] = 0;
  }

I am not using this explicit loop. There is a faster way of doing it, which I will have to review.

  o.page_render = function(page, w, h, { rotate = 0, flag = 0 } = {}) { flag |= FPDF.Const.RenderPage.REVERSE_BYTE_ORDER;
    const  heap =               malloc                            (w * h * C);
    /**/          Module.HEAPU8.subarray(heap, heap +              w * h * C).fill(0);
    const  bmap =          FPDF.Bitmap_CreateEx                   (w,  h,  F, heap, w * C); //FPDF.Bitmap_FillRect(bmap, 0, 0, w, h, 0xFFFFFFFF /* AABBGGRR */);
    /**/                   FPDF.RenderPageBitmap(bmap, page, 0, 0, w,  h, rotate, flag);
    /**/                   FPDF.Bitmap_Destroy  (bmap);
    return heap;
  };
  o.render = function(n = 1, o = { /* scale, rotation, rotate, flag */ }) { // PDF.js -> PDFium
    const   page   =                                                          FPDF.LoadPage(this.wasm, n - 1);
    const  rotate  = o.rotate ?? (o.rotation !== undefined ? rotation2rotate(o.rotation) : 0);
    const [ w, h ] = this.getPageSize(n - 1,      { ...o, rotate });
    const   heap   = this.page_render(page, w, h, { ...o, rotate });          FPDF.ClosePage(page);
    const   data   = Module.HEAPU8.subarray(heap, heap + w * h * C); // 👈🏻
    const   i      = new ImageData(w, h); i.data.set(data); /* 👈🏻 */ free(heap); 
    return Comlink.transfer({ a: i.data.buffer, w, h, rotate }, [ i.data.buffer ]);
  };

CetinSert avatar May 24 '22 15:05 CetinSert

Yeah i would imagine doing this with a subArray() would be better, no?

new Uint8ClampedArray(
      this.Module.HEAPU8.subarray(
        someData
       somdData.length
      )
    );

quincarter avatar May 24 '22 15:05 quincarter

As for all the single-letter characters in

const heap = (J, s) => { // J = JS array type; s = typed array size
  let     E;
  switch (J) {
    case    Int8Array: E = Module.  HEAP8; break;
    case   Int16Array: E = Module. HEAP16; break;
    case   Int32Array: E = Module. HEAP32; break;
    case   Uint8Array: E = Module. HEAPU8; break;
    case  Uint16Array: E = Module.HEAPU16; break;
    case  Uint32Array: E = Module.HEAPU32; break;
    case Float32Array: E = Module.HEAPF32; break;
    case Float64Array: E = Module.HEAPF64; break;
  }
  const  Z = J.BYTES_PER_ELEMENT;          const m           =  malloc(s * Z);
  const  a = Array(1 + s); a[0] = ({ s, J, Z, E, m, free: () => free(m) });
  for (let i = 0; i < s; i++) a[i + 1] = ({ p: m + (i * Z), get v() { return E[m / Z + i]; } });
  return a;
};
const H = (t, s, d) => f => { // type size default
  const [ m,  ...a ] = heap(t, s);
  const v = f(...a.map(x => x.p)); if (!v) { m.free(); return d; } // !v :: C* return value not 0
  const r =      a.map(x => x.v) ;           m.free(); return r;
};

There's some new things i found that are kind of difficult to understand just at first glance and i would recommend renaming a lot of the variables that are like i or H or F or E or something like one letter variables to be more descriptive.

Just as an example, i am not sure what the bottom lines do exactly. like what values do s, J, Z, E, m represent?

I am responsible for this bit of the code. Feel free to revise it as you see fit. I would prefer you keep the neat 2d alignment in place though.

inside heap:

  • J: JS type
  • E: emscripten type
  • Z: byte siZe of the array element type (..., UInt81; ... UInt324, ...)
  • s: typed array size
  • i: typed array element index

inside H (which allocates heap space; calls f on it and frees the memory afterwards):

  • f: function to call on the temporary heap allocation

CetinSert avatar May 24 '22 15:05 CetinSert

@quincarter, @paulocoutinhox I hope I have answered all the questions.

CetinSert avatar May 24 '22 15:05 CetinSert

i am having some issues with the subarray and how to format it, but i do think it would be faster. I was using that in a previous iteration but the data was different.

quincarter avatar May 24 '22 15:05 quincarter

Okay well i got it. It's not rendering as pretty as it was before in the previous version. Any tips to get it not to shift so much? Not sure why it's doing that. Also when i change pages, it is distorting the image on the canvas. I am playing with that, but it's rendering in like 5ms on each image and cleaning the heap. So it's managing memory pretty well in this way.

image

Distorted image. (every page turn it gets worse) image

@cetinsert do you know why this is happening? The image distortion? I can get an initial page to work, but when i change pages, it is like the bitmap doesn't get cleared or something or fails to render it properly.

  • EDIT - I am building a separate project, extending this one and the functionality, trying to further optimize how it runs in the browser.

quincarter avatar May 24 '22 15:05 quincarter

No idea. Sorry.

CetinSert avatar May 24 '22 15:05 CetinSert

Are you using pdfium.wasm in this project? https://github.com/cetinsert/cetinsert

quincarter avatar May 24 '22 15:05 quincarter

@quincarter, @paulocoutinhox I hope I have answered all the questions.

Thanks, you are TOP!

paulocoutinhox avatar May 24 '22 15:05 paulocoutinhox

As for all the single-letter characters in

const heap = (J, s) => { // J = JS array type; s = typed array size
  let     E;
  switch (J) {
    case    Int8Array: E = Module.  HEAP8; break;
    case   Int16Array: E = Module. HEAP16; break;
    case   Int32Array: E = Module. HEAP32; break;
    case   Uint8Array: E = Module. HEAPU8; break;
    case  Uint16Array: E = Module.HEAPU16; break;
    case  Uint32Array: E = Module.HEAPU32; break;
    case Float32Array: E = Module.HEAPF32; break;
    case Float64Array: E = Module.HEAPF64; break;
  }
  const  Z = J.BYTES_PER_ELEMENT;          const m           =  malloc(s * Z);
  const  a = Array(1 + s); a[0] = ({ s, J, Z, E, m, free: () => free(m) });
  for (let i = 0; i < s; i++) a[i + 1] = ({ p: m + (i * Z), get v() { return E[m / Z + i]; } });
  return a;
};
const H = (t, s, d) => f => { // type size default
  const [ m,  ...a ] = heap(t, s);
  const v = f(...a.map(x => x.p)); if (!v) { m.free(); return d; } // !v :: C* return value not 0
  const r =      a.map(x => x.v) ;           m.free(); return r;
};

There's some new things i found that are kind of difficult to understand just at first glance and i would recommend renaming a lot of the variables that are like i or H or F or E or something like one letter variables to be more descriptive.

Just as an example, i am not sure what the bottom lines do exactly. like what values do s, J, Z, E, m represent?

I am responsible for this bit of the code. Feel free to revise it as you see fit. I would prefer you keep the neat 2d alignment in place though.

inside heap:

  • J: JS type
  • E: emscripten type
  • Z: byte siZe of the array element type (..., UInt81; ... UInt324, ...)
  • s: array element size
  • i: array element index

inside H (which allocates heap space; calls f on it and frees the memory afterwards):

  • f: function to call on the temporary heap allocation

Also thanks a ton for the explanation here! :) this helps a lot

quincarter avatar May 24 '22 15:05 quincarter

Are you using pdfium.wasm in this project? https://github.com/cetinsert/cetinsert

Yes. Exactly the same artefact from this repository; custom JS bindings. When I detect a leak, I nuke it and rebind ahead of time. That way I can keep rendering ad infinitum.


Some examples where PDFium would render better than PDF.js: https://github.com/pdf-ist/WebPDF/discussions Please note that PDF.js might have caught up by now.

I have not updated WebPDF.pro in a while because PDF.js was messing with how they want to do forms and I thought I can afford to give them a year to sort things out before I waste any time on this. PDF.js folks were overly Firefox-centric and were not open to suggestions that would make their code work better with Shadow DOM.

CetinSert avatar May 24 '22 16:05 CetinSert

I have a similar solution in a web component in production on our app but an earlier version.

quincarter avatar May 24 '22 16:05 quincarter

The difference in pdf.js and pdfium is the inherent way that WASM is used by the browser. Low end devices still struggle with pdf.js rendering. I have a side by side comparison in a video where PDF.js takes 12 seconds to render page 1 and pdfium takes 20ms. 6x CPU throttling and slow 3G Network.

WASM is still pre optimized out of the box where PDF.js needs to go through typical browser optimization as it compiles And finds paths of least resistance.

quincarter avatar May 24 '22 16:05 quincarter

The difference in pdf.js and pdfium is the inherent way that WASM is used by the browser. Low end devices still struggle with pdf.js rendering. I have a side by side comparison in a video where PDF.js takes 12 seconds to render page 1 and pdfium takes 20ms. 6x CPU throttling and slow 3G Network.

WASM is still pre optimized out of the box where PDF.js needs to go through typical browser optimization as it compiles And finds paths of least resistance.

I know. Also just updated https://github.com/paulocoutinhox/pdfium-lib/issues/71#issuecomment-1136147487 .

CetinSert avatar May 24 '22 16:05 CetinSert

As for unfriendliness, this guy tops it all: https://sert.works/posts/mupdf/ though.

CetinSert avatar May 24 '22 16:05 CetinSert

As for unfriendliness, this guy tops it all: https://sert.works/posts/mupdf/ though.

Omg. This is atrocious! This should be in some kind of hall of Fame. I appreciate anyone willing to use and contribute back to the open source community. Glad you have that documented.

quincarter avatar May 24 '22 16:05 quincarter

Hi,

I will refactor all and remake a lot of things in this project. And when i start it, i will change this too.

Im only finishing another open source project that is a requirement for me.

Stay tuned here and feel free to donate to help project development.

Thanks.

paulocoutinhox avatar Oct 21 '22 01:10 paulocoutinhox

Hi,

All problems was fixed in new release.

You can check here: https://github.com/paulocoutinhox/pdfium-lib/releases/tag/5407

The demo is here: https://pdfviewer.github.io/

Can you add these new code to web viewer?

Feel free to donate to help project development.

Thanks.

paulocoutinhox avatar Nov 08 '22 18:11 paulocoutinhox

Everything is ok? Left something?

paulocoutinhox avatar Sep 14 '23 15:09 paulocoutinhox

Everything is ok? Left something?

Hi there! I have since left the project I was working on that implemented this solution. I have some of this stashed locally but it's been a while! I can try to pick it back up when I free up.

quincarter avatar Sep 14 '23 22:09 quincarter

Ok, so i will close. Feel free if you need something. Thanks.

paulocoutinhox avatar Sep 14 '23 23:09 paulocoutinhox

Hi,

I have updated everything.

You can now use the latest version.

Thanks.

paulocoutinhox avatar Dec 15 '23 07:12 paulocoutinhox