cascalog icon indicating copy to clipboard operation
cascalog copied to clipboard

Cascalog should support nested tuples

Open sritchie opened this issue 13 years ago • 8 comments

e.g.

(<- [?blarg ?pivot] (src ?blarg ?blurg) (pivot ?blurg :>> ?pivot))

:>> into a var will capture the output into a nested tuple (just a seq of fields)

Unclear how to handle nested serialization. Perhaps Cascading can handle nested tuples?

sritchie avatar Dec 06 '11 02:12 sritchie

Nested serialization is now trivial with Kryo.

sritchie avatar Dec 06 '11 02:12 sritchie

I wonder what the efficiency (and syntax) of arbitrary destructuring forms in a query might be:

(let [src [[1 2 [[3] 4]]]]
   (<- [?three]
        (src [_ _ [[?three] ?four]])))

sritchie avatar Jan 23 '12 18:01 sritchie

Arbitrary destructuring syntax is essential to cope with scheme like protobuf message elephant-bird, the "flatten" keyword in piglatin works quite well in this situation.

isaiah avatar Mar 08 '12 07:03 isaiah

Interesting, can you give me an example of how this would look in Cascalog, with a protobuf message?

sritchie avatar Mar 08 '12 08:03 sritchie

We sort of talked about this on the forum already.

message DateOfBirth {
  message Date {
    required int32 year = 1;
    required int32 month = 2;
    required int32 day = 3;
  }
  required int64 timestamp = 1;
  optional string user_id = 2;
  required Date date = 3;
}

Accessing this we're currently using:

(defn to-dob-y-m-d [x]
  (let [y  (.getInteger x 0)
        m  (.getInteger x 1)
        d  (.getInteger x 2)]
    [y m d]))

(defn dob-generator [dir]
  (let [src    (hfs-protobuf dir Customer$DateOfBirth     ;; custom tap
                             :outfields customer-date-of-birth-names)]
    (<- customer-date-of-birth-fields
        (src :>> (to-cascalog-fields customer-date-of-birth-names))
        (to-dob-y-m-d ?date :> ?dob-year ?dob-month ?dob-day))))

Ideally, a destructuring approach (suggested by @pingles) to skip to-dob-y-m-d such that we can do

(defn dob-generator [dir]
  (let [src    (hfs-protobuf dir Customer$DateOfBirth     ;; custom tap
                             :outfields customer-date-of-birth-names)]
    (<- customer-date-of-birth-fields
        (src ?timestamp ?user_id [?dob-year ?dob-month ?dob-day]))))

Quantisan avatar Mar 08 '12 10:03 Quantisan

I bet we could do this through something like a Cascalog destructuring protocol. Cascalog could provide implementations for the sequential data structures in Clojure, and users could extend the protocol to Thrift and Protobuf objects. This would be awesome.

I think it's time for me to buckle down and learn a bit more about core.logic so we can start pulling more ideas and syntax from that project. (Cascalog's Datalog is different from the Prolog in core.logic, but I'd like to follow their lead with destructuring, at least.)

I'm not sure when I'll be able to get to this, but I'd be happy to accept a pull request with some initial work. What do you think?

sritchie avatar Mar 08 '12 18:03 sritchie

Sounds good to me. Will start looking into it

Quantisan avatar Mar 09 '12 09:03 Quantisan

Nice idea sam, I'll see if Paul and I could take a look some time today. I've not looked through much of Cascalog yet so any pointers to the current destructuring/var binding from tuples would be cool.

Time to dive into core.logic then :)

pingles avatar Mar 09 '12 09:03 pingles