elephant-bird
elephant-bird copied to clipboard
json handling improvements
JSON is currently handled a little weird. Pig can read json from any file input format, however, map reduce jobs can only read json from lzo files. Additionally, parsing is done a little different in both places.
I think the ideal situation would be a json input format that wraps a given actual input format, converting each valid value to json. The json pig loader would simply use the json input format, and convert each record into a tuple (rather than parsing records into json like it does today).
Thoughts? I think it would be cool to separate the app data format (pig tuple), from the storage record format (json) from the file format (lzo compressed file) so users can mix & match as they see fit. For example, there's no reason why someone who wants to compress files with snappy shouldn't be able to use the json & pig layers.
I started playing around with this idea in the recent json loader refactor, and tonight in https://github.com/traviscrawford/elephant-bird/tree/json_record_reader but want to get some feedback before going too far.
It will be easier to look at the diff if this is a pull request. Can you switch to a pull request?