cascalog
cascalog copied to clipboard
Cascalog should support nested tuples
e.g.
(<- [?blarg ?pivot] (src ?blarg ?blurg) (pivot ?blurg :>> ?pivot))
:>>
into a var will capture the output into a nested tuple (just a seq of fields)
Unclear how to handle nested serialization. Perhaps Cascading can handle nested tuples?
Nested serialization is now trivial with Kryo.
I wonder what the efficiency (and syntax) of arbitrary destructuring forms in a query might be:
(let [src [[1 2 [[3] 4]]]]
(<- [?three]
(src [_ _ [[?three] ?four]])))
Arbitrary destructuring syntax is essential to cope with scheme like protobuf message elephant-bird, the "flatten" keyword in piglatin works quite well in this situation.
Interesting, can you give me an example of how this would look in Cascalog, with a protobuf message?
We sort of talked about this on the forum already.
message DateOfBirth {
message Date {
required int32 year = 1;
required int32 month = 2;
required int32 day = 3;
}
required int64 timestamp = 1;
optional string user_id = 2;
required Date date = 3;
}
Accessing this we're currently using:
(defn to-dob-y-m-d [x]
(let [y (.getInteger x 0)
m (.getInteger x 1)
d (.getInteger x 2)]
[y m d]))
(defn dob-generator [dir]
(let [src (hfs-protobuf dir Customer$DateOfBirth ;; custom tap
:outfields customer-date-of-birth-names)]
(<- customer-date-of-birth-fields
(src :>> (to-cascalog-fields customer-date-of-birth-names))
(to-dob-y-m-d ?date :> ?dob-year ?dob-month ?dob-day))))
Ideally, a destructuring approach (suggested by @pingles) to skip to-dob-y-m-d
such that we can do
(defn dob-generator [dir]
(let [src (hfs-protobuf dir Customer$DateOfBirth ;; custom tap
:outfields customer-date-of-birth-names)]
(<- customer-date-of-birth-fields
(src ?timestamp ?user_id [?dob-year ?dob-month ?dob-day]))))
I bet we could do this through something like a Cascalog destructuring protocol. Cascalog could provide implementations for the sequential data structures in Clojure, and users could extend the protocol to Thrift and Protobuf objects. This would be awesome.
I think it's time for me to buckle down and learn a bit more about core.logic so we can start pulling more ideas and syntax from that project. (Cascalog's Datalog is different from the Prolog in core.logic, but I'd like to follow their lead with destructuring, at least.)
I'm not sure when I'll be able to get to this, but I'd be happy to accept a pull request with some initial work. What do you think?
Sounds good to me. Will start looking into it
Nice idea sam, I'll see if Paul and I could take a look some time today. I've not looked through much of Cascalog yet so any pointers to the current destructuring/var binding from tuples would be cool.
Time to dive into core.logic then :)