plot
plot copied to clipboard
Optimize field shorthand for columnar data?
We currently support field (named properties specified as strings) as channel value shorthand, e.g., y: "foo" is upgraded to y: d => d["foo"] or equivalently y: data.map(d => d["foo"]). For Arquero (and perhaps other column-oriented data structures), this still works because Arquero provides an object iterator; however, it’d be more efficient if Plot supported the same shorthand with Arquero’s column accessors, upgrading y: "foo" to ~~y: data.column("foo")~~ y: data.array("foo"), which returns an iterable over the values. ~~Although, currently this would still involve copying the column into a new array, so we actually want y: data.column("foo").data if possible (e.g., Arquero’s Column class could support an toArray method which returns the data if possible, and otherwise constructs a new array using the iterator).~~
I was going to post an issue to this effect after a brief skim of the source code, as it appears to involve a fair bit of data copying. Glad to see an issue is already here! Arquero includes a table.array method that extracts an array (optionally into a specified typed array type). Currently this method performs a copy (necessary if coercing types or if the table has filter or orderby criteria) but we could optimize this to return an underlying array directly (zero copy) when applicable. Even with an array copy here, it might be nice to avoid object serialization of table rows. Along those lines, Arquero can also provide a column iterator using table.values.
Also, I would advise against column('foo').data as a general pattern, as other column implementations are not guaranteed to support this (Apache Arrow data, etc.).
If Observable/Plot eventually settles on a more general convention for accessing columnar data sources, we're happy to look into adapting Arquero along those lines.
I think we should focus on doing this specifically for Apache Arrow table instances first (since these are the most widely-used columnar data representation, including by the DuckDBClient/SQL code blocks in Observable Framework).
Here’s is trivial example of how to pass in columnar data. Given an Arrow table:
const data = Arrow.tableFromArrays({
id: [1, 2, 3],
name: ["Alice", "Bob", "Charlie"],
age: [35, 25, 45]
});
You can pass the columns in like so:
Plot.barY({length: data.numRows}, {x: data.getChild("name"), y: data.getChild("age")}).plot()
The goal of this issue is to make the above equivalent to the following shorthand syntax:
Plot.barY(data, {x: "name", y: "age"}).plot()
To do this, we will need to detect when data is an Apache Arrow table, and change the definition of the field shorthand to use data.getChild(name) instead of Array.from(data, (d) => d[name]). Also, we’ll want to avoid materializing the array-of-objects as data (the arrayify(this.data) call in mark.initialize), and instead use an efficient numeric index [0, … numRows - 1].