bun icon indicating copy to clipboard operation
bun copied to clipboard

Error with `pdf-text-reader` and NODE_ENV=production

Open randompixel opened this issue 1 year ago • 4 comments
trafficstars

What version of Bun is running?

1.0.35+940448d6b

What platform is your computer?

Darwin 23.2.0 arm64 arm

What steps can reproduce the bug?

If I attempt to use pdf-text-reader (which itself uses Mozilla's pdf.js and this is where the error is happening) when not in dev mode it fails.

  • Developed in bun run dev = works fine
  • Ran bun build src/index.ts --outdir ./out --target bun --sourcemap=external and then bun run out/index.js = fail

So I switched the oven/bun docker build I have to call bun run dev directly instead of building it. Except that also failed when I went into production. After a lot more trial & error

  • ENV NODE_ENV=production I commented out this line from the dockerfile and it works.

So it appears that build also sets NODE_ENV=production but then can't resolve require when it does

Minimal reproduction:

import { readPdfText } from "pdf-text-reader";

/** Define the return type for a FileParser class' parse function */
type ParsedFile = {
	fileName: string;
	fileSize: number;
	fileType: string;
	body: string;
	footers?: string;
	headers?: string;
};

/** Define the signature for a FileParser class */
type ParseFunction = (file: File) => Promise<ParsedFile>;

/** Define the type of classes that the factory can return */
interface FileParser {
	parse: ParseFunction;
}

class PdfParser implements FileParser {
	public async parse(file: File): Promise<ParsedFile> {
		const blob = file;
		const stream = await blob.arrayBuffer();
		const readText = await readPdfText({ data: stream, worker: null });

		return {
			fileName: blob.name,
			fileSize: blob.size,
			fileType: blob.type,
			body: readText,
		};
	}
}

Bun.serve({
  port: 4000,
  async fetch(req) {
    const url = new URL(req.url);

    // parse formdata at /action
    if (url.pathname === '/parse') {
      const formdata = await req.formData();
      const file = formdata.get('file');
			console.log(file);
			const parser = new PdfParser();
			const body = await parser.parse(file);
			return new Response(body.body);
		}

  
		return new Response("Not Found", { status: 404 });
	}
});

What is the expected behavior?

POST a file through form-data and it parses the text out of the PDF when running in NODE_ENV=production

What do you see instead?

Setting up fake worker failed: \"Can't find variable: require\"

Additional information

No response

randompixel avatar Mar 26 '24 16:03 randompixel

I don't see a problem running with env NODE_ENV=production bun a.js, but this is a minimal reproduction for the issue with bun build --target bun:

// a.js
if(typeof require === "function") {
  const mymodule = eval("require")("./b.js");
  mymodule.main();
}

// b.js
module.exports.main = function() {
  console.log("hello from b.js");
}
bun build a.js --outdir ./out --target bun
bun ./out/a.js
# should log "hello from b.js", instead errors

pdfjs-dist seems to be hiding the fake worker import behind eval("require"), maybe so when bundled for the browser it doesn't get imported? Although for the browser it seems designed to run with no bundler because it embeds a <script> element to load the fake worker.


While trying to make a smaller reproduction, I got a different error (on Darwin 23.2.0 arm64 arm)

// a.js
import { readPdfText } from "pdf-text-reader";

const file = await Bun.file("dummy.pdf").arrayBuffer();
const readText = await readPdfText({ data: file, worker: null });
console.log(readText);
$> bun build a.js --outdir ./out --target bun --sourcemap=external
fish: Job 1, 'bun build a.js --outdir ./out -…' terminated by signal SIGBUS (Misaligned address error)
Exited with code [SIGBUS]

Removing sourcemap=external the error doesn't show up.

This seems to be caused by strings.wtf8ByteSequenceLengthWithInvalid(remaining[0]); returning a number larger than remaining.len in sourcemap.zig:

https://github.com/oven-sh/bun/blob/d113803777b14f317188dbfa6bd4e49c54dce9fb/src/sourcemap/sourcemap.zig#L915

pfgithub avatar Mar 29 '24 02:03 pfgithub

After upgrading to the v5 branch for pdf-text-reader, bun 1.1.8 won't even start

dyld[68449]: missing symbol called
error: script "dev" was terminated by signal SIGABRT (Abort)
[1]    68448 abort      bun dev

Both 5.0.1 and 5.1.0 releases of pdf-text-reader fail with the above error https://github.com/electrovir/pdf-text-reader/releases

Unfortunately both of the above fix a security issue with pdf.js that is being reported.

randompixel avatar May 21 '24 10:05 randompixel

@randompixel pdf-text-reader seems to be using either V8 C++ API or libuv. Please follow along in #4290

@190n is actively working on supporting V8 C++ APIs in Bun

Jarred-Sumner avatar Aug 08 '24 19:08 Jarred-Sumner

Is there any other way to read PDF files in Bun?

anuragk15 avatar Aug 26 '24 03:08 anuragk15

+1

snowfluke avatar Feb 27 '25 08:02 snowfluke