emscripten icon indicating copy to clipboard operation
emscripten copied to clipboard

[WasmFS+OPFS] Unspecified file size limit

Open DenizUgur opened this issue 1 year ago • 19 comments

I have a wasm file compiled with -lopfs.js -sWASMFS -sFORCE_FILESYSTEM and I'm trying to perform IO operations on a large (4GB) file. The process works properly for smaller files (~150MB, probably anything lower than 2^32-1 bytes). However, I've noticed that when I try to read this large file from WASM it fails. When I try to read it using readFile from JS I don't see correct size on ArrayBuffer. I don't expect to see all of them (ArrayBuffer limit) but it's way less than my browser's limit (Chrome).

In essence, I'm trying to pass a File provided by user to OPFS and use it with my executable. WORKERFS works very nice if I create the node with WORKERFS.createNode. However, the problem with that is reading back the data if it's larger than 2^32-1 bytes. I'm considering overloading WORKERFS.stream_ops.write/FS.write to write to OPFS and create File object using that.

I've verified that the file that has been written to OPFS was correct by downloading it again.

A wild guess but maybe seeking more than 2^32-1 bytes at once is creating this problem?

EDIT: I've confirmed this is the problem. I had a MP4 file intentionally made larger than it's supposed to be. This is done by marking first sample at an offset of ~4GB. Then I've tried a normal 4GB downloaded from YouTube and it worked properly. So it's not size issue but rather seeking issue.

What's your opinion on this process? Is there really a file size limit with OPFS or WasmFS? What about monkey-patching WORKERFS?

DenizUgur avatar Feb 13 '24 23:02 DenizUgur

In WasmFS, the readFile JS API first reads the entire file into Wasm memory, then copies the contents out into JS. For very large files, it will not be possible to allocate that much Wasm memory, so readFile will fail.

You could indeed hack your own version of an OPFS file system by modifying stream_ops and WORKERFS, but I wouldn't recommend it.

I recommend using read instead of readFile to read the file contents in chunks rather than all at once.

tlively avatar Feb 14 '24 01:02 tlively

Thank you for your response. Indeed readFile wouldn't be the correct method here. Regardless the issue was related to seeking anyway. See my last edit please. This is a very edge case for me so I'll not pursue it anymore but just wanted to report it becasue the behaviour differs from WORKERFS and thought maybe this is something unexpected.

DenizUgur avatar Feb 14 '24 01:02 DenizUgur

Hmm, seeking should use 64-bit values everywhere, so it should be able to handle very large files. What API were you calling where you saw problems with seeking?

tlively avatar Feb 14 '24 01:02 tlively

It's an educated guess. The application doesn't throw an exception and works properly with smaller offsets, so it's highly probably it's related to seeking or similar (not sure what else would be). I'm not using any JS API, yes I've used -sFORCE_FILESYSTEM but I think I don't have any need for it since browser APIs for OPFS is enough.

I'm calling POSIX APIs for file I/O, nothing emscripten related I guess.

DenizUgur avatar Feb 14 '24 01:02 DenizUgur

Ok, sounds good. I'll close this issue, but feel free to reopen if it turns out that there is an Emscripten bug we can fix.

tlively avatar Feb 14 '24 02:02 tlively

Well, I'm not going to follow this issue because it's a very edge case. However, the problem with WasmFS is still present. If seeking with a value greater than 2^32-1 is something emscripten needs to handle then it needs to be checked. Maybe add a test case for this.

DenizUgur avatar Feb 14 '24 05:02 DenizUgur

I agree we should add test case for this if there is not one already. @tlively do we have a test for seeking in large files beyond 2^32? If not lets add one.

sbc100 avatar Feb 14 '24 19:02 sbc100

Have you guys had a chance to test this?

DenizUgur avatar Apr 23 '24 20:04 DenizUgur

Alright, I made my own tests. I couldn't test if it was related to seeking because fseek only accepts long anyway. But writing past 2GB is not possible. @sbc100 I hope I'm not missing something here, I've theorized that this could be due to wasm being compiled to 32-bit but I'm not sure how that would affect it. I hope this helps.

I've used the following C code to test it.

void em_fs_4gb_test()
{
  FILE *f = fopen("/opfs/test.mp4", "wb");
  if (!f)
    return;

  printf("Writing 4GB file to /opfs/test.mp4\n");
  u32 block_size = 1024 * 1024;
  u32 block_count = 4 * 1024;
  u8 *block = malloc(block_size);
  memset(block, 1, block_size);
  for (u32 i = 0; i < block_count; i++)
  {
    u32 written = fwrite(block, 1, block_size, f);
    if (written != block_size)
    {
      printf("Error writing block %d, trying to write %d bytes, wrote %d\n", i, block_size, written);
      break;
    }
  }

  free(block);
  fclose(f);
}

The link flags I've used

-O2 -sMODULARIZE=1 -sEXPORT_ES6=1 -sEXPORT_NAME=foo -sEXPORTED_FUNCTIONS=_free,_malloc,_main -sENVIRONMENT=web,worker -sEXPORTED_RUNTIME_METHODS=FS,OPFS,getValue,setValue,UTF8ToString,lengthBytesUTF8,stringToUTF8,cwrap,addFunction,PThread -sPTHREAD_POOL_SIZE=1 -sPTHREAD_POOL_SIZE_STRICT=0 -lopfs.js -sWASMFS -sFORCE_FILESYSTEM

And the console printed

Writing 4GB file to /opfs/test.mp4
Error writing block 2048, trying to write 1048576 bytes, wrote 0

The related portion for that error in the profiler

image

DenizUgur avatar Apr 25 '24 19:04 DenizUgur

I'd like to add my experiences with this issue. I've loaded an 8gb file into OPFS and mounted it with WasmFS. When reading over the file with ifstream in 2^24 slices it fails at offset 2^32. I checked the ifstream and the EOF bit was set. Has there been any progress on looking into this issue?

brianhvo02 avatar May 18 '24 00:05 brianhvo02

I investigated a bit and it looks as if when reading a file larger than 4GB it only is able to read the first 2GB and the last 2GB. The offset remains the same though.

brianhvo02 avatar May 25 '24 21:05 brianhvo02

@tlively Right now, the _wasmfs_opfs_read_blob and _wasmfs_opfs_get_size_blob functions in library_wasmfs_opfs.js are using uint32_t for file sizes and offset, which limits any operations to files less than 4GB. Using off_t would mean wrapping the offset and size in Number and BigInt constructors respectively. I'm thinking this could throw a wrench into people not using -sWASM_BIGINT so what's your take on this?

brianhvo02 avatar May 26 '24 19:05 brianhvo02

Sounds like those functions need to be updated to use i64s and bigints. IIUC, Emscripten has utilities to make that relatively painless, but I don't know exactly what the state of the art is there.

tlively avatar May 28 '24 19:05 tlively

I'm running into this limit when working with large audio files (>2GB).

Looking at opfs_backend.h, it already uses off_t to set the size of the file (regardless of WASM_BIGINT), so changing the read and write functions to use off_t and __i53abi may work. However I've noticed some odd behaviour with __i53abi that causes arguments following the i53 argument to be invalid. Reordering the arguments to put the i53 argument last seems to work.

goldwaving avatar Sep 09 '24 19:09 goldwaving

I'm running into this limit when working with large audio files (>2GB).

Looking at opfs_backend.h, it already uses off_t to set the size of the file (regardless of WASM_BIGINT), so changing the read and write functions to use off_t and __i53abi may work. However I've noticed some odd behaviour with __i53abi that causes arguments following the i53 argument to be invalid. Reordering the arguments to put the i53 argument last seems to work.

That should not be necessary. Please file a bug if you can reproduce it.

sbc100 avatar Sep 09 '24 20:09 sbc100

Hi @sbc100, it's possible to reproduce this. See my latest comment above.

DenizUgur avatar Sep 09 '24 20:09 DenizUgur

Hi @sbc100, it's possible to reproduce this. See my latest comment above.

Sorry, I was specifically referring to the __i53abi issue mentioned in https://github.com/emscripten-core/emscripten/issues/21335#issuecomment-2338966756. Is that what you were referring to too?

sbc100 avatar Sep 09 '24 20:09 sbc100

I made a copy of the OPFS related files to create a test version to support >4GB files. Here is a piece of the code for writing:

int _gwwasmfs_opfs_write_access(int access_id,
                              const uint8_t* buf,
                              uint32_t len,
                              off_t pos);
  _gwwasmfs_opfs_write_access__i53abi: true,
  _gwwasmfs_opfs_write_access__deps: ['$gwwasmfsOPFSAccessHandles'],
  _gwwasmfs_opfs_write_access: {{{ asyncIf(!PTHREADS) }}} function(accessID, bufPtr, len, pos) {
    let accessHandle = gwwasmfsOPFSAccessHandles.get(accessID);
    let data = HEAPU8.subarray(bufPtr, bufPtr + len);
    console.log( "Write2: ", pos  );

This fails because when pos reaches 2^31, it goes negative, despite it being positive on the C++ side. It seems like __i53abi had no effect. I had to explicitly add a __sig: 'iipij' to avoid that.

For the reading code:

int _gwwasmfs_opfs_read_blob(em_proxying_ctx* ctx,
                           int blob_id,
                           uint8_t* buf,
                           uint32_t len,
                           off_t pos,
                           int32_t* nread);
  _gwwasmfs_opfs_read_blob__i53abi: true,
  _gwwasmfs_opfs_read_blob__deps: ['$gwwasmfsOPFSBlobs', '$gwwasmfsOPFSProxyFinish'],
  _gwwasmfs_opfs_read_blob: async function(ctx, blobID, bufPtr, len, pos, nreadPtr) {

In this case, nreadPtr is always zero and you get a memory corruption error (and read error because the number of byte read cannot be passed back). Again, it seems like __i53abi had no effect and it is using an int32_t for pos. An explicit __sig was needed.

The _wasmfs_opfs_set_size_file function does not have an explicit __sig, so I don't know how (if?) that is working without it.

How are signatures determined when they are not given explicitly? I've had strange problems where I had to specify it for some functions, but not for others.

The end result is that using __i53abi and explicit __sig for the read and write functions in OPFS allows it to support >4GB regardless of WASM_BIGINT (_wasmfs_opfs_get_size_blob also has to be changed to return the size like wasmfs_opfs_get_size_file does).

Maybe the documentation here could mention the p type and the __i53abi decorator with the j type and when an explict __sig is required.

goldwaving avatar Sep 10 '24 13:09 goldwaving

The _wasmfs_opfs_set_size_file function does not have an explicit __sig, so I don't know how (if?) that is working without it.

We automatically generate __sig attributes for all builting JS functions. See https://github.com/emscripten-core/emscripten/blob/3073806d3fde4320c81cb2dc7cf0e00378f52df1/src/library_sigs.js#L439 and https://github.com/emscripten-core/emscripten/blob/main/src/library_sigs.js#L1

I agree we should improve the documentation here. I believe __i53abi doesn't do anything unless a signature with a j in it is found. We could probably add a helper error message when its used without a j in the sig since that doesn't do anything.

sbc100 avatar Sep 10 '24 16:09 sbc100