duckdb-wasm Customize loader

It would be great if there was a way to customize the http loader so that a user can override authentication and base urls easily.

Jul 19 '21 15:07 domoritz

Indeed, that sounds useful.

I wonder if we could keep that on the Typescript side during URL registration. E.g. something like the following would be rather easy to do:

const db = new duckdb.AsyncDuckDB(...);
await db.instantiate(...);

await db.registerFileURL('somefilename', 'https://somelocation', {
  username ...
  password ...
});

await db.runQuery('select * from read_csv_auto("somefilename")');

Implementing that via SQL would force us to do that upstream in DuckDB.

Jul 19 '21 15:07 ankoh

I think the loader should absolutely live in ts. Have a look at how we customize loaders in Vega as an example for what I mean.

I think your example is too limited as there are many different auth mechanisms.

Jul 19 '21 15:07 domoritz

Sure, I was just referring to the way this would actually be used. Things like the http username would never make it SQL text then which would make this very specific to the API around duckdb-wasm. But it's certainly easier that way.

I guess one could come up with a way of registering a filename along with a generic loader. If you have a proposal feel free to add it here.

Jul 19 '21 16:07 ankoh

https://github.com/vega/vega/tree/master/packages/vega-loader has a good API.

I think the loader mostly needs a function (url: string) -> data where one can customize exactly where to fetch the data from and how it will be fetched.

Jul 19 '21 18:07 domoritz

Hm, I'm not convinced that (url: string) -> data would add too much value. The current API already allows to register Blobs so if the data is to be fetched at once, the user can just do that upfront and register the Blob as file, right?

So for any loader that we register with DuckDB, I at least expect the loader to read data chunk-wise, otherwise the Blob registration is only one additional up-front fetch. E.g. this is a subset of the interface, that we currently use for the filesystem:

    openFile(mod: DuckDBModule, fileId: number): void;
    syncFile(mod: DuckDBModule, fileId: number): void;
    closeFile(mod: DuckDBModule, fileId: number): void;
    getLastFileModificationTime(mod: DuckDBModule, fileId: number): number;
    getFileSize(mod: DuckDBModule, fileId: number): number;
    truncateFile(mod: DuckDBModule, fileId: number, newSize: number): void;
    readFile(mod: DuckDBModule, fileId: number, buffer: number, bytes: number, location: number): number;
    writeFile(mod: DuckDBModule, fileId: number, buffer: number, bytes: number, location: number): number;

Right now, a SELECT count(*) FROM 'https://someserver/foo.parquet' will effectively only fire range requests for the parquet metadata in the footer of the file which might be a few kilobytes even with a file size of multiple gigabytes.

Jul 19 '21 18:07 ankoh

I think this differs significantly from what vega is doing. If we go for the loaders, I'd definitely favour fine-granular chunk-wise loaders + the clear statement that everything beyond should just be done externally and be registered as blob?

Jul 19 '21 18:07 ankoh

Ahh, right. I agree that range requests are crucial. Let's extract the loader with the API above into a default loader and if someone needs to customize it, then they need to implement the API.

We should add some simple options to the default loader to customize the baseURL and authentication with basic auth.

Jul 19 '21 18:07 domoritz