elephant-bird
elephant-bird copied to clipboard
provide an option for Lzo inputformat not to read index file (or read remotely)
The index for a an lzo file is read on the client while making the splits. for large inputs, this takes very long since the file are read serially.
Some times users may not need to split the file (say, there are already lots of files), a simple option to disable readin the index might be good enough.
Another option is to read the index on the remote tasks. Each record reader adjusts its split based on the the index.