Support of registered files (DuckDB-WASM) in vgplot
Data loading helpers create DuckDB tables/views from parquet, JSON or CSV files. Unfortunately, this process can take a while when data is big. Sometimes — in DuckDB-WASM — it is nicer to register files and run queries directly on them (see registerFileURL). Query performance on DuckDB tables is optimal, but queries on parquet files are fast, too.
Here, we try to use a registered CSV file (via HTTP) in vgplot:
import * as vg from "@uwdata/vgplot";
import { DuckDBDataProtocol } from '@duckdb/duckdb-wasm';
const wasm = await vg.wasmConnector();
wasm.db.registerFileURL(
'athletes.csv',
'https://uwdata.github.io/mosaic-datasets/data/athletes.csv',
DuckDBDataProtocol.HTTP,
true,
);
vg.coordinator().databaseConnector(wasm);
const $query = vg.Selection.intersect();
export default vg.vconcat(
vg.menu({
label: 'Sport',
as: $query,
from: 'athletes.csv',
column: 'sport',
value: 'aquatics'
}),
vg.vspace(10),
vg.plot(
vg.barX(vg.from('athletes.csv', { filterBy: $query }), {
x: vg.sum('gold'),
y: 'nationality',
fill: 'steelblue',
sort: { y: '-x', limit: 10 },
}),
vg.xLabel('Gold Medals'),
vg.yLabel('Nationality'),
vg.yLabelAnchor('top'),
vg.marginTop(15)
)
);
Note, that references of mosaic clients point now to registered file athletes.csv, not a DuckDB table anymore.
This works! However, currently not all marks from vgplot are supported with registered files. Connected marks (e.g. area, line) don't work. When replacing above mark barX by
vg.lineY(vg.from('athletes.csv', { filterBy: $query }), {
x: 'nationality',
y: vg.sum('gold')
})
an error is raised:
Coordinator.js:145 Uncaught (in promise) Error: Catalog Error: Table with name csv does not exist!
I think the reason is probably due to some preprocessing of table name, see here. This logic conflicts with registered file name. I tried changing name of registered file athletes.csv, but if I omit .csv suffix, then DuckDB-WASM interprets reference as table.
It would be great if vgplot fully works with registered files (via HTTP(S), S3, local storage, etc.).
I won’t be able to take a closer look at this until next week (the issue is that Mosaic interprets the table name as schema.table), but in the meantime here are two ideas:
- Can you create a view over the registered file (eg, using an
execquery)? Then you can query the view as a normal table. - You might try creating a file without an extension or perhaps wrap the file name in an explicit sql expression (
sql`’athletes.csv’`or similar)?
Thanks, first workaround (creating a view for it) worked out.
Since views work well for this. I think we also improved escaping of table names in the meantime. I'll close this for now but if you find other issues, we can reopen or log more specific issues.
Quoting is in https://github.com/uwdata/mosaic/issues/378.