parquet-wasm icon indicating copy to clipboard operation
parquet-wasm copied to clipboard

No functioning example

Open v4lue4dded opened this issue 3 months ago • 1 comments

I just spend an entire day trying to get parquet-wasm to read a parquet file and console.log() the result and couldn't get it done. Admittedly I'm an python programmer and new to javascript.

However as far as I could tell none of the examples that are currently in the README.MD work out of the box.

This is very unfortunate, since this is a javascript library so it should be able to run a functioning example right in the GitHub pages of a repo. (Not necessarily this repo but just some example repo with some code that runs already).

Something similar to https://hyparam.github.io/hyparquet/ would go a long way make this library a lot more user friendly to people like me.

For now I will be giving up on this library since I can not get it to work in a reasonable amount of time.

v4lue4dded avatar Apr 07 '24 17:04 v4lue4dded

These two Observable examples are online, reproducible examples: https://github.com/kylebarron/parquet-wasm#published-examples

kylebarron avatar Apr 08 '24 15:04 kylebarron

Thank you for the reply :)

I did see the observable examples I did admittedly find the platform very clunky and very unintuitive to use and like it was hiding a lot of code from me.

I did by now figure out how to download of the code form the example. Though from what I can tell it is not simple javascript at work but instead some proprietary wrapper that I can't really replicate around the javascript.

//...



function _d3(require){return(
require("https://d3js.org/d3.v5.min.js")
)}

function _mapboxgl(require){return(
require("[email protected]/dist/mapbox-gl.js")
)}

function _arrow(require){return(
require("apache-arrow")
)}

function _deck(require){return(
require.alias({
  h3: {}
})("[email protected]/dist.min.js")
)}

function _deckgl(mapContainer,deck,mapboxgl)
{
  // This is an Observable hack: clear previously generated content
  mapContainer.innerHTML = "";

  return new deck.DeckGL({
    // The HTML container to render into
    container: mapContainer,
    map: mapboxgl,
    mapStyle:
      "https://basemaps.cartocdn.com/gl/positron-nolabels-gl-style/style.json",

    // Viewport settings
    initialViewState: {
      longitude: 0,
      latitude: 15,
      zoom: 1,
      pitch: 0,
      bearing: 0
    },
    controller: true
  });
}


export default function define(runtime, observer) {
  const main = runtime.module();
  function toString() { return this.url; }
  const fileAttachments = new Map([
    ["[email protected]", {url: new URL("./files/ad0a1f0e7e5cc8290068443d99bbd1307877e1ba631e30622bbd5fd8adca660d2644fe8181db5dbd8d41be0c2eae868304deeb0efc8690d373553dcb859bc767.bin", import.meta.url), mimeType: "application/octet-stream", toString}]
  ]);
  main.builtin("FileAttachment", runtime.fileAttachments(name => fileAttachments.get(name)));
  main.variable(observer()).define(["md"], _1);
  main.variable(observer()).define(["md"], _2);
  main.variable(observer()).define(["md"], _3);
  main.variable(observer()).define(["md"], _4);
  main.variable(observer()).define(["md"], _5);
  main.variable(observer()).define(["md"], _6);
  main.variable(observer()).define(["md"], _7);
  main.variable(observer()).define(["md"], _8);
  main.variable(observer("viewof form")).define("viewof form", ["Inputs"], _form);
  main.variable(observer("form")).define("form", ["Generators", "viewof form"], (G, _) => G.input(_));
  main.variable(observer("mapContainer")).define("mapContainer", ["html"], _mapContainer);
  main.variable(observer("metricMapping")).define("metricMapping", _metricMapping);
  main.variable(observer("readParquet")).define("readParquet", _readParquet);
  main.variable(observer("arrowTable")).define("arrowTable", ["parquetFile","readParquet","arrow"], _arrowTable);
  main.variable(observer("parquetFile")).define("parquetFile", ["FileAttachment"], _parquetFile);
  main.variable(observer("geometryColumn")).define("geometryColumn", ["arrowTable"], _geometryColumn);
  main.variable(observer("flatCoordinateArray")).define("flatCoordinateArray", ["geometryColumn"], _flatCoordinateArray);
  main.variable(observer("layer")).define("layer", ["arrowTable","flatCoordinateArray","colorAttribute","deck","deckgl"], _layer);
  main.variable(observer("colorAttribute")).define("colorAttribute", ["metricMapping","form","arrowTable","colorScale"], _colorAttribute);
  main.variable(observer("colorScale")).define("colorScale", ["d3","form"], _colorScale);
  main.variable(observer("d3")).define("d3", ["require"], _d3);
  main.variable(observer("mapboxgl")).define("mapboxgl", ["require"], _mapboxgl);
  main.variable(observer("arrow")).define("arrow", ["require"], _arrow);
  main.variable(observer("deck")).define("deck", ["require"], _deck);
  main.variable(observer("deckgl")).define("deckgl", ["mapContainer","deck","mapboxgl"], _deckgl);
  return main;
}

I'll probably try again next weekend to unwrap that code to see if I can get it working for my project.

Both examples do seem to use outdated version of the library though: https://observablehq.com/@bmschmidt/hello-parquet-wasm uses https://unpkg.com/[email protected]/web.js which seems like a very early version and https://observablehq.com/@kylebarron/geoparquet-on-the-web uses https://unpkg.com/[email protected]/esm/arrow2.js which is no longer recommended since it is a 2 if I understand things correctly.

It would just have been very useful to a javascript beginner like me to have a very simple example on github pages that uses the currently recommended version of the library to simply read a complete parquet file (either a small example from the github repo or a drop in file) and displays the result on screen. That would be a lot easier for me to iterate from.

v4lue4dded avatar Apr 10 '24 16:04 v4lue4dded

which is no longer recommended since it is a 2 if I understand things correctly

The arrow2 API is deprecated and won't receive updates, but it should still work. The API of the latest beta is very similar to the previous API though.

It would just have been very useful to a javascript beginner like me to have a very simple example on github pages that uses the currently recommended version of the library to simply read a complete parquet file (either a small example from the github repo or a drop in file) and displays the result on screen. That would be a lot easier for me to iterate from.

I agree that would be nice, but I don't have time to create a standalone example at this point. Contributions (from you or someone else) would be welcome.

I generally recommend that the easiest way to get started is to use the type hints on each function to guide the user for how to fetch data.

kylebarron avatar Apr 10 '24 16:04 kylebarron

In case it's useful to you, I'm using this in production here: https://github.com/developmentseed/lonboard/blob/dca942da9b5bd40769068a76c45e76c9b1c9c49c/src/parquet.ts

kylebarron avatar Apr 10 '24 16:04 kylebarron

I published 0.6.0, added new content to the README, and updated https://observablehq.com/@kylebarron/geoparquet-on-the-web to use parquet-wasm 0.6. Hopefully this is easier to follow

kylebarron avatar Apr 21 '24 21:04 kylebarron

This should work in vanilla JavaScript:

import initParquetWasm, {readParquet} from "https://cdn.jsdelivr.net/npm/[email protected]/+esm";

await initParquetWasm("https://cdn.jsdelivr.net/npm/[email protected]/esm/parquet_wasm_bg.wasm");

(Unfortunately the default path to parquet_wasm_bg.wasm doesn’t work when using /+esm because it resolves to the wrong directory. I think it’s possible that it would work if you used import.meta.resolve instead of new URL(…, import.meta.url), but I’m not sure whether jsDelivr will rewrite import.meta.resolve calls to fix the relative path when using /+esm.)

mbostock avatar Apr 21 '24 22:04 mbostock

It does work for me (at least in Deno) with

import initParquetWasm, {readParquet} from "https://cdn.jsdelivr.net/npm/[email protected]/esm/parquet_wasm.js";
await initParquetWasm();

I don't know how if it's possible rewrite the import with +esm. I specifically enabled that path as a known entry point so that import "parquet-wasm/esm/parquet_wasm.js" would work both in an application and from a browser. https://github.com/kylebarron/parquet-wasm/blob/09bc32e9b0cc2a44fd55dc7990f594fbaa08988b/templates/package.json#L37-L40

I think it’s possible that it would work if you used import.meta.resolve instead of new URL(…, import.meta.url)

That part is auto-generated by wasm-bindgen, so it's not something easy for me to change.

kylebarron avatar Apr 22 '24 02:04 kylebarron

Yes, that would work too. The /+esm is nice because it bundles and minifies local imports, so the module publisher (you) typically doesn’t haven’t to build and publish the bundle — the CDN does it.

It also works if you do this:

import initParquetWasm, {readParquet} from "https://cdn.jsdelivr.net/npm/[email protected]/esm/+esm";

await initParquetWasm();

This uses your ./esm entry point, and because it’s in the same folder as the source file, the relative path to the .wasm file works.

I would consider using import.meta.resolve instead of import.meta.url though, as it’s the more semantic way of resolving a relative resource.

Also, I think you’ll want to add the .wasm to your exports map in the package.json because these files are part of your module’s public API and you expect people to load them.

mbostock avatar Apr 22 '24 03:04 mbostock

Thanks for the tips!

Yes, that would work too. The /+esm is nice because it bundles and minifies local imports, so the module publisher (you) typically doesn’t haven’t to build and publish the bundle — the CDN does it.

Oh very cool. I probably should suggest that from the README.

I would consider using import.meta.resolve instead of import.meta.url though, as it’s the more semantic way of resolving a relative resource.

I see. That makes sense. MDN does say

you should use import.meta.resolve(moduleName) instead of new URL(moduleName, import.meta.url) for these use cases wherever possible

I'll make an issue in wasm-bindgen tomorrow.

Also, I think you’ll want to add the .wasm to your exports map in the package.json because these files are part of your module’s public API and you expect people to load them.

Thanks for pointing this out. I see duckdb-wasm does this too. https://github.com/duckdb/duckdb-wasm/blob/58fcb9a46b73eac1abb9b0dee9d7c46d1a84f628/packages/duckdb-wasm/package.json#L99-L101

kylebarron avatar Apr 22 '24 04:04 kylebarron

In case it's useful to you, I'm using this in production here: https://github.com/developmentseed/lonboard/blob/dca942da9b5bd40769068a76c45e76c9b1c9c49c/src/parquet.ts

@kylebarron FYI: I had gotten it working a week ago with that code snippet sorry that I hadn't answerd yet!! Thanks for that!! Had to use the bundler webpack though which was a bit of a step for me. ^^

Do I understand it correctly that (https://github.com/kylebarron/parquet-wasm/issues/489#issuecomment-2068228673) means it would work without working with a bundler, just with a cdn.jsdelivr.net import? :)

That would be really cool!!

v4lue4dded avatar Apr 23 '24 12:04 v4lue4dded

it would work without working with a bundler, just with a cdn.jsdelivr.net import? :)

Yes. But you need to ensure you manually initialize the wasm code, whereas with the bundler entry point the wasm should be initialized behind the scenes I think.

I made a PR to update the jsdelivr link in the readme, and made new issues for the other comments above. So I think this issue can be closed.

kylebarron avatar Apr 23 '24 14:04 kylebarron