elephant-bird json handling improvements

json handling improvements

Open traviscrawford opened this issue 12 years ago • 1 comments

JSON is currently handled a little weird. Pig can read json from any file input format, however, map reduce jobs can only read json from lzo files. Additionally, parsing is done a little different in both places.

I think the ideal situation would be a json input format that wraps a given actual input format, converting each valid value to json. The json pig loader would simply use the json input format, and convert each record into a tuple (rather than parsing records into json like it does today).

Thoughts? I think it would be cool to separate the app data format (pig tuple), from the storage record format (json) from the file format (lzo compressed file) so users can mix & match as they see fit. For example, there's no reason why someone who wants to compress files with snappy shouldn't be able to use the json & pig layers.

I started playing around with this idea in the recent json loader refactor, and tonight in https://github.com/traviscrawford/elephant-bird/tree/json_record_reader but want to get some feedback before going too far.

Mar 03 '12 06:03 traviscrawford

It will be easier to look at the diff if this is a pull request. Can you switch to a pull request?

Mar 05 '12 07:03 rangadi

elephant-bird elephant-bird copied to clipboard

json handling improvements

elephant-bird
elephant-bird copied to clipboard