Handle non-adjacent fields in nest_dotted()
Repro is with Zed commit 48342ce.
Start from the simple example in the nest_dotted() docs.
$ zq -version
Version: v1.11.1-3-g48342ce9
$ echo '{"a.b.c":"foo"}' | zq -Z 'yield nest_dotted()' -
{
a: {
b: {
c: "foo"
}
}
}
Now let's add on a second nested field and a top-level field, which also works fine.
$ echo '{"a.b.c":"foo", "a.b.d":"bar", "x":"baz"}' | zq -Z 'yield nest_dotted()' -
{
a: {
b: {
c: "foo",
d: "bar"
}
},
x: "baz"
}
However, if the top-level field appears one spot leftward in the order, now it fails.
$ echo '{"a.b.c":"foo", "x":"baz", "a.b.d":"bar"}' | zq -Z 'yield nest_dotted()' -
error({
message: "nest_dotted: fields in record a must be adjacent",
on: {
"a.b.c": "foo",
x: "baz",
"a.b.d": "bar"
}
})
How difficult would it be to start supporting this?
Context
When testing the Grok functionality added in #4827, things improved to the point where we could successfully parse using the example Grok from this Elastic article. Field names containing dots are produced, with the understanding the user could keep them if that's what they wanted or apply nested_dotted() downstream in their Zed pipeline if desired.
$ zq -version
Version: v1.11.1-3-gcc689620
$ echo '"55.3.244.1 GET /index.html 15824 0.043 other stuff"' | zq -Z 'yield grok("%{IP:host.ip} %{WORD:http.request.method} %{URIPATHPARAM:url.original} %{NUMBER:http.request.bytes:int} %{NUMBER:event.duration:double} %{GREEDYDATA:my_greedy_match}", this)' -
{
"host.ip": "55.3.244.1",
"http.request.method": "GET",
"url.original": "/index.html",
"http.request.bytes": "15824",
"event.duration": "0.043",
my_greedy_match: "other stuff"
}
However, if I go ahead and take the next logical step, I get the kind of failure I showed above.
$ echo '"55.3.244.1 GET /index.html 15824 0.043 other stuff"' | zq -Z 'yield grok("%{IP:host.ip} %{WORD:http.request.method} %{URIPATHPARAM:url.original} %{NUMBER:http.request.bytes:int} %{NUMBER:event.duration:double} %{GREEDYDATA:my_greedy_match}", this) | nest_dotted()' -
error({
message: "nest_dotted: fields in record http must be adjacent",
on: {
"host.ip": "55.3.244.1",
"http.request.method": "GET",
"url.original": "/index.html",
"http.request.bytes": "15824",
"event.duration": "0.043",
my_greedy_match: "other stuff"
}
})
It took a bit of head scratching for me to remember this limitation even existed. I went on a historical tour and it looks like this limitation first arrived almost 4 years ago as part of #127. A relevant comment from @aswan back then:
// Note that we require any nested fields from the same parent record
// to be adjacent. Alternatively we could re-order provided fields
// so the output record can be constructed efficiently, though we don't
// do this now since it might confuse users who expect to see output
// fields in the order they specified.
That was written in the context of the cut processor. While I can still see some sense in that argument, it seems that needs to be weighed against the user experiencing this failure and now having to:
- Understand what it's saying, then,
- Perform a potentially lengthy exercise with
order(),rename,put,cut, etc. to create the desired nesting by hand.
Personally, I think if it "anchored" the nesting starting with the left-most appearance of a top level field name (e.g., http in this case) that would seem just as defensible but would avoid the error. Is that super complicated or a performance killer? 🤔
While we're waiting on a proper fix for this, I realized it's possible to put together a decent workaround using existing building blocks. Expressed as a user-defined operator and stored in a file nest_dotted_nonadj.zed:
op nest_dotted_nonadj(): (
over flatten(this) => (
sort key
| collect(this)
| unflatten(this)
)
| nest_dotted(this)
)
To test it, we'll use this input data.json that contains both the objects shown above that trigger the issue if sent directly to nest_dotted().
{"a.b.c":"foo", "x":"baz", "a.b.d":"bar"}
{"host.ip": "55.3.244.1", "http.request.method": "GET", "url.original": "/index.html", "http.request.bytes": "15824", "event.duration": "0.043", my_greedy_match: "other stuff"}
Putting it all together:
$ zq -version
Version: v1.12.0-5-g90f5e2c1
$ zq -Z -I nest_dotted_nonadj.zed 'nest_dotted_nonadj()' data.json
{
a: {
b: {
c: "foo",
d: "bar"
}
},
x: "baz"
}
{
event: {
duration: "0.043"
},
host: {
ip: "55.3.244.1"
},
http: {
request: {
bytes: "15824",
method: "GET"
}
},
my_greedy_match: "other stuff",
url: {
original: "/index.html"
}
}