Persistent shaper configs applied upon ingest to pools
To date, applying shapers has been something that's always been associated with a client-side operation. A couple examples:
-
zqmay be used (often with-I) to shape data before importing it into a lake viazed load, dragging it into the Brim app, etc. - Brimcap performs all its shaping to turn data into rich ZNG client-side before it's posted to a Zed Lake
However, a useful app workflow might be to define a persistent shaper config such that unshaped data could be incrementally added to a pool and shaped server-side without the user having to include or mention the shaper code every time. For example, in a multi-person org, a Zed-savvy user may be responsible for perfecting the "golden" shaping configs and defining policy that ensures they're applied on incoming data for certain pools. Then other users could just import their NDJSON/CSV/etc. directly to those pools without having to know anything about shapers.
We expect this issue to start with design tasks such as determining how the shaping configs are attached/persisted in the Lake, then thinking about how it's invoked by the Brim app and zed.
Note: This may overlap with the "intake" concept we've discussed in the past and tracked via brimdata/brim#1481.
Moving to icebox for now as this should probably part of an ingest system design instead of an attribute of a pool in the backend.
Today I thought of this issue again in the context of a community user's inquiry in a recent Slack thread. They were trying to use the Python client to replicate command lines they'd traditionally done at the shell. Their specific question:
is there a way to specify the type of some fields like I can with the
zq | zed load -for eg:zq -i json '_ts:=time(created)' infile.json | zed load -
I couldn't think of a way for them to replicate that whole pipeline within Python unless they were invoking the zq binary. That is, the Zed Python client can load data from file-like objects into the lake, or read data back out of the lake using queries, but the first half of that pipeline is entirely "non-lake". That got me to thinking again about how it would be handy if that kind of shaping were somehow persisted server-side so it could be applied on ingest, since that way a dumber client like the Python one (or an even dumber one like curl) could post the unshaped data and have it be shaped before being stored. FWIW, when I described this to the user, their response was "oh that would be awesome!", but per @mccanne's most recent comment above, I'm sure there's other design considerations that might favor an approach other than the one I originally thought of in this issue.
A community user asked about this in another recent Slack thread. In their own words:
Im loading data via python and zed does not recognize any of the timestamps in the row as a timestamp even when selected within zui. Is there a way when loading data to tell zed a field is a timestamp? I have also tried
client.create_pool(pool, layout={'keys':[['ts']]})These didnt work
ts: "2024-12-05T09:35:27.000000+00:00" timestamp: "2024-12-05T09:35:52.763Z", eventTime: "2024-12-05 09:35:27.000000", epochTime: 1733391327000000,
Their use case was unique because they were using the load method in the Python client, so the typical approaches like shaping via the Preview & Load screen during load via the GUI was not an option, and they explained that using Python subprocesses to call out to zq before load was not feasible. I then mentioned how the current work on SuperSQL may result in a feature similar to the one described in this issue, since a common pattern in SQL is to define the schema of a particular table that includes type definitions, all before data is inserted into the table, which results in a shape-on-load behavior. This did indeed with resonate with the user. Once again in their own words:
You are correct giving a schema or etl capabilities would be great. Im pulling data in python from kafka so its all json but if I can set field types proactively to ip etc... it would be amazing