elephant-bird icon indicating copy to clipboard operation
elephant-bird copied to clipboard

Added Support for Handling Invalid JSON Records

Open gurmeetsaran opened this issue 10 years ago • 3 comments

Hi Kevin,

I have added support for handling Invalid JSON Records. Currently It just filters out invalid JSON Records. I have added support to return invalid records which can be segregated into good and bad records. This can help if we want to report invalid records to the upstream user who is generating the records.

This can be changed by setting the -invalidRecord option in the JsonLoader constructor Example: source_data = LOAD '$input' USING com.intuit.iac.pig.udf.JsonLoader('-nestedLoad -invalidRecord') as (json:map[]); SPLIT source_data INTO source_data_good_record IF json#'error_string' is null,source_data_bad_record IF json#'error_string' !='';

gurmeetsaran avatar Mar 14 '14 06:03 gurmeetsaran

+1! This would be awesome to have. (Try debugging TB's of input data without something like this)

ojilles avatar Apr 28 '14 08:04 ojilles

(Try debugging TB's of input data without something like this)

You can do this even now. It increments counters and logs such lines to mapper stderr. You can check the per task counters. Hadoop 2 makes it much easier to find such tasks (with non-zero value for a specific counter).

rangadi avatar Apr 28 '14 17:04 rangadi

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


gurmeetsaran seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

CLAassistant avatar Jul 18 '19 15:07 CLAassistant