positron icon indicating copy to clipboard operation
positron copied to clipboard

Data Explorer: Create preliminary positron-duckdb extension using duckdb-wasm to provide "headless" data explorer backend

Open wesm opened this issue 1 year ago • 10 comments

For epic #2187, addresses #4963.

This provides a new built-in positron-duckdb extension that loads duckdb-wasm in a web worker and provides an RPC endpoint using VSCode's command service for fulfilling Data Explorer requests. Only getting schemas, data values, and null count summary statistics are supported right now. So follow on work includes:

  • Numeric formatting and string truncation (respecting the passed FormatOptions)
  • Row filtering
  • Sorting
  • Detailed summary statistics
  • Histograms and frequency tables for sparklines

There are some rough edges, for example if you click on a file before the extension is fully loaded at application startup, it will fail, so I will need to consult others on how to fix that.

Lastly, I have checked in some small (~10K total) data files to use in the extension tests (yarn test-extension -l positron-duckdb) and added exclusions to hygiene.js so that pre-commit checks do not complain about them. I'm not sure if there is a better way to handle this.

Other notes:

  • Added code to comms/generate-comms.ts to generate interfaces containing all the parameters for each RPC, same as there already is for Rust and Python, which was needed to provide a fully formed command protocol to communicate with the extension. We can potentially look at further improving the TypeScript code generation.
  • I copied the interface stubs needed into an interfaces.ts file in the extension. Maybe it's possible to cross-import from the main codebase into the extension but I do not know the right incantation of tsconfig.json/package.json configurations to do this.

In action

https://github.com/user-attachments/assets/70dabb96-6330-49e4-8db1-10293c331051

QA Notes

You can click on .parquet, .csv, or .tsv files in the file explorer after Positron has loaded to open the data explorer.

wesm avatar Oct 09 '24 18:10 wesm

I'm not planning to do any more work in this branch, and will work on additional features in a branch based on this until this gets merged.

wesm avatar Oct 11 '24 12:10 wesm

One thing I could use some help on is how to determine when the built-in positron-duckdb extension has been loaded (if you click on a file too fast when the application is initializing, it will create a broken data explorer). I could add some sleep/retry logic but maybe there is a cleaner way to wait for built-in extensions to be loaded.

wesm avatar Oct 11 '24 15:10 wesm

One thing I could use some help on is how to determine when the built-in positron-duckdb extension has been loaded

This is a surprisingly hard problem in the VS Code system that I also ran into when trying to resolve all the asynchronous behavior around runtime startup. If your extension activates eagerly, you can use whenAllExtensionHostsStarted (which I added to solve a related problem); if not then the easiest way through is to have your extension invoke a command in its activate() method.

jmcphers avatar Oct 11 '24 19:10 jmcphers

If your extension activates eagerly, you can use whenAllExtensionHostsStarted

It does activate eagerly, so I'll use that! How do you tell what positron-* extensions activate eagerly and which ones not (mine does by chance from copying lines of code from other Positron built-in extensions, not knowingly on my part)?

wesm avatar Oct 11 '24 20:10 wesm

I added

await this._extensionService.whenAllExtensionHostsStarted();

to the main _execRpc method and it seems to hang / never resolve, both before and after the application loading phase completes. So maybe we'll have to figure that out in a follow up PR.

I'll try to make a release build and check that everything still works there.

wesm avatar Oct 11 '24 20:10 wesm

I tried making a release build and it has a webpack error:

[16:27:00] Bundled extension: positron-duckdb/extension.webpack.config.js...
[16:27:00] 'vscode' errored after 38 min
[16:27:00] Error: ModuleDependencyWarning: Critical dependency: the request of a dependency is an expression
ModuleDependencyWarning: Critical dependency: the request of a dependency is an expression
ModuleDependencyWarning: Critical dependency: the request of a dependency is an expression
ModuleDependencyWarning: Critical dependency: the request of a dependency is an expression
    at formatError (/home/wesm/code/positron/node_modules/gulp-cli/lib/versioned/^4.0.0/format-error.js:21:10)
    at Gulp.<anonymous> (/home/wesm/code/positron/node_modules/gulp-cli/lib/versioned/^4.0.0/log/events.js:33:15)
    at Gulp.emit (node:events:531:35)
    at Gulp.emit (node:domain:488:12)
    at Object.error (/home/wesm/code/positron/node_modules/undertaker/lib/helpers/createExtensions.js:61:10)
    at handler (/home/wesm/code/positron/node_modules/now-and-later/lib/map.js:50:14)
    at f (/home/wesm/code/positron/node_modules/once/once.js:25:25)
    at f (/home/wesm/code/positron/node_modules/once/once.js:25:25)
    at tryCatch (/home/wesm/code/positron/node_modules/async-done/index.js:24:15)
    at done (/home/wesm/code/positron/node_modules/async-done/index.js:40:12)
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.

The critical code path here seems to be what webpack does not like because of dynamic resolution:

const modPath = require.resolve('@duckdb/duckdb-wasm');
const dist_path = dirname(modPath);

const MANUAL_BUNDLES = {
	mvp: {
		mainModule: resolve(dist_path, './duckdb-mvp.wasm'),
		mainWorker: resolve(dist_path, './duckdb-node-mvp.worker.cjs')
	},
	eh: {
		mainModule: resolve(dist_path, './duckdb-eh.wasm'),
		mainWorker: resolve(dist_path, './duckdb-node-eh.worker.cjs')
	}
};

const bundle = await duckdb.selectBundle(MANUAL_BUNDLES);

The duckdb-wasm package has a section about use with webpack, but I tinkered with this and wasn't able to get it working and don't really know what I'm doing, so I'm going to need some help from others @petetronic @jmcphers @seeM

https://duckdb.org/docs/api/wasm/instantiation.html#webpack

wesm avatar Oct 12 '24 20:10 wesm

Here's what ChatGPT has to say on the matter if it is not hallucinating:

https://gist.github.com/wesm/f6e227b72653167dbc966031e7933782

wesm avatar Oct 12 '24 20:10 wesm

It seems like we will have to do some work to get the wasm bundles loading both in a development context and a webpack context, maybe similar to the tree-sitter-wasm stuff. Let me know if there is someone who is available to help me with this, and I'll just work on follow-on data explorer features using this in a separate branch

wesm avatar Oct 12 '24 20:10 wesm

I've spent half my weekend on trying to get the webpack build to work and I'm completely stumped.

I have an error like:

$ yarn gulp compile-extensions-build
<SNIP>
[14:47:46] Bundled extension: positron-duckdb/extension.webpack.config.js...
[14:47:47] 'compile-extensions-build' errored after 1.25 min
[14:47:47] Error: ModuleDependencyWarning: Critical dependency: the request of a dependency is an expression
ModuleDependencyWarning: Critical dependency: the request of a dependency is an expression
ModuleDependencyWarning: Critical dependency: the request of a dependency is an expression
ModuleDependencyWarning: Critical dependency: the request of a dependency is an expression
    at formatError (/home/wesm/code/positron/node_modules/gulp-cli/lib/versioned/^4.0.0/format-error.js:21:10)
    at Gulp.<anonymous> (/home/wesm/code/positron/node_modules/gulp-cli/lib/versioned/^4.0.0/log/events.js:33:15)
    at Gulp.emit (node:events:531:35)
    at Gulp.emit (node:domain:488:12)
    at Object.error (/home/wesm/code/positron/node_modules/undertaker/lib/helpers/createExtensions.js:61:10)
    at handler (/home/wesm/code/positron/node_modules/now-and-later/lib/map.js:50:14)
    at f (/home/wesm/code/positron/node_modules/once/once.js:25:25)
    at f (/home/wesm/code/positron/node_modules/once/once.js:25:25)
    at tryCatch (/home/wesm/code/positron/node_modules/async-done/index.js:24:15)
    at done (/home/wesm/code/positron/node_modules/async-done/index.js:40:12)
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.

I fiddled with the webpack config to try to isolate the NodeJS DuckDB wasm configuration (e.g. using webpack.IgnorePlugin), but it appears to be trying to analyze and bundle the getDuckDBNodeBundles function. So I'm going to stop here and send out an SOS for someone else to help figure this out

wesm avatar Oct 13 '24 19:10 wesm

If it helps others, I found this webpack-based web app on the duckdb-wasm repository which may help

https://github.com/duckdb/duckdb-wasm/tree/main/packages/duckdb-wasm-app

I'm going to stop spending more time on this before I tear all my hair out =)

wesm avatar Oct 13 '24 20:10 wesm

I don't believe I'm going to be able to get this working on my own, so I am going to stop fiddling with it and making more of a mess

wesm avatar Oct 15 '24 23:10 wesm

I tested out release builds locally on Linux and macOS, so going ahead to merge this, thanks @jmcphers for the save on the webpack issues!

wesm avatar Oct 17 '24 22:10 wesm