elephant-bird SequenceFileLoader fails if given comma-separated list of files

data = LOAD 'hdfs://localhost//foo/23,hdfs://localhost/foo/24' USING com.twitter.elephantbird.pig.load.SequenceFileLoader ()

produces:

Backend error message during job submission

org.apache.pig.backend.executionengine.ExecException: ERROR 2118: java.net.URISyntaxException: Illegal character in scheme name at index 0: 23,hdfs: at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:280) at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:944) at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:961) at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:807) at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378) at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247) at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279) at java.lang.Thread.run(Thread.java:662) Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Illegal character in scheme name at index 0: 23,hdfs: at org.apache.hadoop.fs.Path.initialize(Path.java:148) at org.apache.hadoop.fs.Path.(Path.java:126) at org.apache.hadoop.fs.Path.(Path.java:50) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1063) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1066) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1066) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1066) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1066) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1066) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1066) at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1002) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:966) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:211) at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:55) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:268) ... 14 more Caused by: java.net.URISyntaxException: Illegal character in scheme name at index 0: 23,hdfs: at java.net.URI$Parser.fail(URI.java:2809) at java.net.URI$Parser.checkChars(URI.java:2982) at java.net.URI$Parser.checkChar(URI.java:2992) at java.net.URI$Parser.parse(URI.java:3008) at java.net.URI.(URI.java:736) at org.apache.hadoop.fs.Path.initialize(Path.java:145) ... 29 more

Jan 18 '12 23:01 keolson

Verified this behavior -- it's a real bug, thanks. We'll check it out.

Jan 19 '12 00:01 dvryaboy

Thanks Dmitriy. As a workaround, I tried using a hadoop style glob:

LOAD '{hdfs://localhost//foo/23,hdfs://localhost/foo/24}'

This also failed due to Pig thinking it was a relative path and trying to dereference it (Pig 0.8.1):

java.lang.IllegalArgumentException: java.net.URISyntaxException: Illegal character in scheme name at index 0: {hdfs at org.apache.hadoop.fs.Path.initialize(Path.java:148) at org.apache.hadoop.fs.Path.(Path.java:126) at org.apache.pig.LoadFunc.getAbsolutePath(LoadFunc.java:252) at org.apache.pig.LoadFunc.relativeToAbsolutePath(LoadFunc.java:74) at org.apache.pig.impl.logicalLayer.parser.QueryParser.LoadClause(QueryParser.java:1615)

Any ideas on how I can hack around this bug until a fix is ready?

Jan 19 '12 00:01 keolson

2 loads and a union?

Jan 19 '12 01:01 dvryaboy

Unfortunately it's actually about 300 files, not just 2. I'll just fallback on indexed lzo for the time being.

Jan 19 '12 01:01 keolson

This is really an issue with SequenceFileInputFormat in MapReduce. This class needs to override listFiles().

Jan 19 '12 16:01 rangadi

btw, you can try LOAD ' 'hdfs://localhost//foo/{23,24}' ...

Jan 19 '12 16:01 rangadi

elephant-bird elephant-bird copied to clipboard

SequenceFileLoader fails if given comma-separated list of files

Backend error message during job submission

elephant-bird
elephant-bird copied to clipboard