elephant-bird
elephant-bird copied to clipboard
SequenceFileLoader fails if given comma-separated list of files
data = LOAD 'hdfs://localhost//foo/23,hdfs://localhost/foo/24' USING com.twitter.elephantbird.pig.load.SequenceFileLoader ()
produces:
Backend error message during job submission
org.apache.pig.backend.executionengine.ExecException: ERROR 2118: java.net.URISyntaxException: Illegal character in scheme name at index 0: 23,hdfs:
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:280)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:944)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:961)
at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:807)
at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Illegal character in scheme name at index 0: 23,hdfs:
at org.apache.hadoop.fs.Path.initialize(Path.java:148)
at org.apache.hadoop.fs.Path.
Verified this behavior -- it's a real bug, thanks. We'll check it out.
Thanks Dmitriy. As a workaround, I tried using a hadoop style glob:
LOAD '{hdfs://localhost//foo/23,hdfs://localhost/foo/24}'
This also failed due to Pig thinking it was a relative path and trying to dereference it (Pig 0.8.1):
java.lang.IllegalArgumentException: java.net.URISyntaxException: Illegal character in scheme name at index 0: {hdfs
at org.apache.hadoop.fs.Path.initialize(Path.java:148)
at org.apache.hadoop.fs.Path.
Any ideas on how I can hack around this bug until a fix is ready?
2 loads and a union?
Unfortunately it's actually about 300 files, not just 2. I'll just fallback on indexed lzo for the time being.
This is really an issue with SequenceFileInputFormat in MapReduce. This class needs to override listFiles().
btw, you can try LOAD ' 'hdfs://localhost//foo/{23,24}' ...