Loading Files from URL or IndexedDB without memory pressure
Amazing, amazing lib!
I noticed that the demuxer takes File as input. In my use-case, I work with large video files, and I either load assets from URLs or already cached in my IndexedDB. My issue is that if I want to comply with current API, both cases I need to create a blob but, that loads the whole into memory. I did my research in the topic, sharing my findings below. I am no expert using Emscripten but I'd gladly contribute.
Fetching from URL
Currently, I need to do something like this:
const blob = await fetch(url).then((r) => r.blob());
const file = new File([blob], filename, { type: mimeType });
await demuxer.load(file);
The issue, is that Blob creation is memory-first:
If the in-memory space for blobs is getting full, or a new blob is too large to be in-memory, then the blob system uses the disk. This can either be paging old blobs to disk, or saving the new too-large blob straight to disk.
It the memory would be overwhelmed, it evicts the data to the disk, but would be much nicer to have a disk-first solution. Also by default fetch is memory-first too.
Loading to disk first seems to possible with emscripten.
In the API, a loadFromUrl method could do this.
Fetching from IndexedDB
Currently, the same flow applies if I want to get my files from IndexedDB, create a Blob, then a File to demux.
It looks like it is also possible to mount an IndexedDB, although I am not sure how an API for that should look like.
I believe that any application working with large video files would benefit from supporting these use cases. Please let me know your thoughts!
Recently, I’ve been thinking about how to support loading from a URL. One possible approach I’ve considered is to hack the WORKERFS.stream_ops.read. Since it reads from a stream, we might be able to rewrite it using the fetch API's ReadableStream. Perhaps you could give it a try.
Regarding fetching from IndexedDB, when getting a blob, if you avoid methods like blob.arrayBuffer(), the entire blob won’t be loaded into memory. And In WORKERFS, the blob is split into smaller chunks for reading. Therefore, you can directly use the blob to create a File for demuxing.
considered is to hack the WORKERFS.stream_ops.read.
Sounds interesting. I couldn't piece together though, where would be the entry-point for that piece?
What I tried is that the Webdemuxer.load either takes a File or a url, and the URL would be loaded when load is called on the demuxer and written to the filesystem chunk-by-chunk. Maybe not the most elegant, but looks straightforward so far and has relatively low footprint. Haven't got to POC yet but I think it could work. What do you think?
written to the filesystem chunk-by-chunk
Actually, the core concept is the same as I mentioned :). However, WORKERFS does not support writing files chunk by chunk; it only exposes a mount method. That’s why I mentioned that we need to rewrite some methods in WORKERFS.
WORKERFS does not support writing files chunk by chunk
I had to learn that the hard way.
Also EMSCRIPTEN_FETCH_PERSIST_FILE sounds great to have the file in the IndexedDB straight away but I came to the same conclusion as some folks before me that it is solely usable for caching, and can't directly access that file.
What I will try next is to utilize the Origin Private File System, but honestly, given these limitations, I no longer feel strongly about having this feature included as a built-in part of the library.”
I will try it later. If I make some progress, I will let you know.
@bartadaniel I just published version 2.3.0, and it now supports loading from a file URL. The only usage change is passing the file URL to the load method.
This was accomplished by rewriting WORKERFS.stream_ops.read. You can see the detailed code here: https://github.com/ForeverSc/web-demuxer/blob/main/lib/web-demuxer/post.js#L66-L81
@ForeverSc this is really smart!
For the videos I tried, even though partial fetch was possible, they did not return the content-range header, hence it failed to get the file size. Changing getFileSize to get the size with HEAD worked like a charm:
function getFileSize(url) {
const xhr = new XMLHttpRequest();
xhr.open('HEAD', url, false);
xhr.send();
if (xhr.status !== 200) {
throw new Error(`getFileSize request failed: ${url}`);
}
return parseInt(xhr.getResponseHeader('Content-Length'));
}
This seems to be the primary usage of this method link:
This method can be used in cases where a URL might produce a large download, for example, a HEAD request can read the Content-Length header to check the file size before downloading the file with a GET.
Another topic, the file will be fetched as many times as you call the API. If I call getMediaInfo, getVideoStream then getAudioStream the file will be fetched three times. After the first fetch, it will likely be returned from the browser cache, but it is still not optimal because it rewrites the file to the disk each time. Probably we should keep and reuse the file for subsequent calls?
At first, I used HEAD to get the file size, but it is much slower than GET. It seems necessary to perform some fallback logic when GET is not supported, and then try using HEAD.
Regarding multiple fetches, it’s problematic when storing files in the browser. I prefer using OPFS for storing and reusing files. However, I don’t think the storage logic should be mixed into the package; instead, it might be better to add a handler function parameter to the load method, such as:
load(url, function read(arraybuffer) {
// cache in OPFS / IndexedDB / Memory
})
I prefer using OPFS for storing and reusing files. However, I don’t think the storage logic should be mixed into the package
I 100% agree both; That's why I myself reconsidered my stance on supporting URLs. If I need to take care of storage & caching I would rather stream the file to OPFS separately and use the already existing load(File) method on the lib.