python-orc icon indicating copy to clipboard operation
python-orc copied to clipboard

Py4JJavaError: An error occurred while calling o2.iterator.

Open Vimos opened this issue 8 years ago • 7 comments
trafficstars

Hi, I am trying to read an orc file.

In [1]: from orcreader import OrcReader
   ...: reader = OrcReader('dt=2017-05-14_os=android_part=000004_0')
   ...: reader.open()
   ...: 

I have successfully get the schema like this

In [3]: reader.schema()
Out[3]: 
OrderedDict([(u'log_id', u'string'),
             (u'city_id', u'string'),
             (u'city_name', u'string'),
             (u'city_name_en', u'string'),
             (u'province_id', u'string'),
             (u'province_name', u'string'),
 ...  
             (u'activity_flag', u'string')])

But when I am trying to read rows, it reports the following error

In [2]: for row in reader:
   ...:     print row
   ...:     
---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-2-df0cbab3b6b5> in <module>()
----> 1 for row in reader:
      2     print row
      3 

/usr/local/lib/python2.7/dist-packages/python_orc-0.0.1-py2.7.egg/orcreader/reader.pyc in __iter__(self)
     79 
     80     def __iter__(self):
---> 81         return OrcRecordIterator(self.reader.iterator())
     82 
     83     def __enter__(self):

/usr/local/lib/python2.7/dist-packages/py4j-0.10.4-py2.7.egg/py4j/java_gateway.pyc in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135         for temp_arg in temp_args:

/usr/local/lib/python2.7/dist-packages/py4j-0.10.4-py2.7.egg/py4j/protocol.pyc in get_return_value(answer, gateway_client, target_id, name)
    317                 raise Py4JJavaError(
    318                     "An error occurred while calling {0}{1}{2}.\n".
--> 319                     format(target_id, ".", name), value)
    320             else:
    321                 raise Py4JError(

Py4JJavaError: An error occurred while calling o2.iterator.
: java.lang.RuntimeException: Unable to init iterator
	at com.pythonorc.SimplifiedOrcReader.iterator(SimplifiedOrcReader.java:72)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)

Any suggestions on how to debug this error?

Vimos avatar May 15 '17 06:05 Vimos

Is it possible to share the ORC file? I can try to take a look at it.

nqbao avatar May 15 '17 06:05 nqbao

I am sorry, I am not permitted to send you the data. I may offer more debug info from Java.

/usr/lib/jvm/java-8-oracle/bin/java -agentlib:jdwp=transport=dt_socket,address=127.0.0.1:32805,suspend=y,server=n -Dfile.encoding=UTF-8 -classpath /usr/lib/jvm/java-8-oracle/jre/lib/charsets.jar:/usr/lib/jvm/java-8-oracle/jre/lib/deploy.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/cldrdata.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/dnsns.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/jaccess.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/jfxrt.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/localedata.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/nashorn.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/sunec.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/sunjce_provider.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/sunpkcs11.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/zipfs.jar:/usr/lib/jvm/java-8-oracle/jre/lib/javaws.jar:/usr/lib/jvm/java-8-oracle/jre/lib/jce.jar:/usr/lib/jvm/java-8-oracle/jre/lib/jfr.jar:/usr/lib/jvm/java-8-oracle/jre/lib/jfxswt.jar:/usr/lib/jvm/java-8-oracle/jre/lib/jsse.jar:/usr/lib/jvm/java-8-oracle/jre/lib/management-agent.jar:/usr/lib/jvm/java-8-oracle/jre/lib/plugin.jar:/usr/lib/jvm/java-8-oracle/jre/lib/resources.jar:/usr/lib/jvm/java-8-oracle/jre/lib/rt.jar:/home/vimos/Public/github/ml/python-orc/java-gateway/target/classes:/data/home/vimos/.m2/repository/net/sf/py4j/py4j/0.10.2.1/py4j-0.10.2.1.jar:/data/home/vimos/.m2/repository/org/apache/orc/orc-core/1.1.1/orc-core-1.1.1.jar:/data/home/vimos/.m2/repository/com/google/protobuf/protobuf-java/2.5.0/protobuf-java-2.5.0.jar:/data/home/vimos/.m2/repository/commons-cli/commons-cli/1.2/commons-cli-1.2.jar:/data/home/vimos/.m2/repository/commons-lang/commons-lang/2.6/commons-lang-2.6.jar:/data/home/vimos/.m2/repository/org/apache/hadoop/hadoop-common/2.6.0/hadoop-common-2.6.0.jar:/data/home/vimos/.m2/repository/org/apache/hadoop/hadoop-annotations/2.6.0/hadoop-annotations-2.6.0.jar:/usr/lib/jvm/java-8-oracle/lib/tools.jar:/data/home/vimos/.m2/repository/org/apache/commons/commons-math3/3.1.1/commons-math3-3.1.1.jar:/data/home/vimos/.m2/repository/xmlenc/xmlenc/0.52/xmlenc-0.52.jar:/data/home/vimos/.m2/repository/commons-httpclient/commons-httpclient/3.1/commons-httpclient-3.1.jar:/data/home/vimos/.m2/repository/commons-codec/commons-codec/1.4/commons-codec-1.4.jar:/data/home/vimos/.m2/repository/commons-io/commons-io/2.4/commons-io-2.4.jar:/data/home/vimos/.m2/repository/commons-net/commons-net/3.1/commons-net-3.1.jar:/data/home/vimos/.m2/repository/commons-collections/commons-collections/3.2.1/commons-collections-3.2.1.jar:/data/home/vimos/.m2/repository/com/sun/jersey/jersey-core/1.9/jersey-core-1.9.jar:/data/home/vimos/.m2/repository/com/sun/jersey/jersey-json/1.9/jersey-json-1.9.jar:/data/home/vimos/.m2/repository/com/sun/xml/bind/jaxb-impl/2.2.3-1/jaxb-impl-2.2.3-1.jar:/data/home/vimos/.m2/repository/javax/xml/bind/jaxb-api/2.2.2/jaxb-api-2.2.2.jar:/data/home/vimos/.m2/repository/javax/xml/stream/stax-api/1.0-2/stax-api-1.0-2.jar:/data/home/vimos/.m2/repository/javax/activation/activation/1.1/activation-1.1.jar:/data/home/vimos/.m2/repository/org/codehaus/jackson/jackson-jaxrs/1.8.3/jackson-jaxrs-1.8.3.jar:/data/home/vimos/.m2/repository/org/codehaus/jackson/jackson-xc/1.8.3/jackson-xc-1.8.3.jar:/data/home/vimos/.m2/repository/com/sun/jersey/jersey-server/1.9/jersey-server-1.9.jar:/data/home/vimos/.m2/repository/asm/asm/3.1/asm-3.1.jar:/data/home/vimos/.m2/repository/tomcat/jasper-compiler/5.5.23/jasper-compiler-5.5.23.jar:/data/home/vimos/.m2/repository/tomcat/jasper-runtime/5.5.23/jasper-runtime-5.5.23.jar:/data/home/vimos/.m2/repository/commons-el/commons-el/1.0/commons-el-1.0.jar:/data/home/vimos/.m2/repository/commons-logging/commons-logging/1.1.3/commons-logging-1.1.3.jar:/data/home/vimos/.m2/repository/log4j/log4j/1.2.17/log4j-1.2.17.jar:/data/home/vimos/.m2/repository/net/java/dev/jets3t/jets3t/0.9.0/jets3t-0.9.0.jar:/data/home/vimos/.m2/repository/org/apache/httpcomponents/httpclient/4.1.2/httpclient-4.1.2.jar:/data/home/vimos/.m2/repository/org/apache/httpcomponents/httpcore/4.1.2/httpcore-4.1.2.jar:/data/home/vimos/.m2/repository/com/jamesmurty/utils/java-xmlbuilder/0.4/java-xmlbuilder-0.4.jar:/data/home/vimos/.m2/repository/commons-configuration/commons-configuration/1.6/commons-configuration-1.6.jar:/data/home/vimos/.m2/repository/commons-digester/commons-digester/1.8/commons-digester-1.8.jar:/data/home/vimos/.m2/repository/commons-beanutils/commons-beanutils/1.7.0/commons-beanutils-1.7.0.jar:/data/home/vimos/.m2/repository/commons-beanutils/commons-beanutils-core/1.8.0/commons-beanutils-core-1.8.0.jar:/data/home/vimos/.m2/repository/org/slf4j/slf4j-log4j12/1.7.5/slf4j-log4j12-1.7.5.jar:/data/home/vimos/.m2/repository/org/codehaus/jackson/jackson-core-asl/1.9.13/jackson-core-asl-1.9.13.jar:/data/home/vimos/.m2/repository/org/codehaus/jackson/jackson-mapper-asl/1.9.13/jackson-mapper-asl-1.9.13.jar:/data/home/vimos/.m2/repository/com/google/code/gson/gson/2.2.4/gson-2.2.4.jar:/data/home/vimos/.m2/repository/org/apache/hadoop/hadoop-auth/2.6.0/hadoop-auth-2.6.0.jar:/data/home/vimos/.m2/repository/org/apache/directory/server/apacheds-kerberos-codec/2.0.0-M15/apacheds-kerberos-codec-2.0.0-M15.jar:/data/home/vimos/.m2/repository/org/apache/directory/server/apacheds-i18n/2.0.0-M15/apacheds-i18n-2.0.0-M15.jar:/data/home/vimos/.m2/repository/org/apache/directory/api/api-asn1-api/1.0.0-M20/api-asn1-api-1.0.0-M20.jar:/data/home/vimos/.m2/repository/org/apache/directory/api/api-util/1.0.0-M20/api-util-1.0.0-M20.jar:/data/home/vimos/.m2/repository/org/apache/curator/curator-framework/2.6.0/curator-framework-2.6.0.jar:/data/home/vimos/.m2/repository/com/jcraft/jsch/0.1.42/jsch-0.1.42.jar:/data/home/vimos/.m2/repository/org/apache/curator/curator-client/2.6.0/curator-client-2.6.0.jar:/data/home/vimos/.m2/repository/org/apache/curator/curator-recipes/2.6.0/curator-recipes-2.6.0.jar:/data/home/vimos/.m2/repository/org/htrace/htrace-core/3.0.4/htrace-core-3.0.4.jar:/data/home/vimos/.m2/repository/org/apache/zookeeper/zookeeper/3.4.6/zookeeper-3.4.6.jar:/data/home/vimos/.m2/repository/org/apache/commons/commons-compress/1.4.1/commons-compress-1.4.1.jar:/data/home/vimos/.m2/repository/org/tukaani/xz/1.0/xz-1.0.jar:/data/home/vimos/.m2/repository/org/apache/hadoop/hadoop-hdfs/2.6.0/hadoop-hdfs-2.6.0.jar:/data/home/vimos/.m2/repository/commons-daemon/commons-daemon/1.0.13/commons-daemon-1.0.13.jar:/data/home/vimos/.m2/repository/io/netty/netty/3.6.2.Final/netty-3.6.2.Final.jar:/data/home/vimos/.m2/repository/xerces/xercesImpl/2.9.1/xercesImpl-2.9.1.jar:/data/home/vimos/.m2/repository/xml-apis/xml-apis/1.3.04/xml-apis-1.3.04.jar:/data/home/vimos/.m2/repository/org/apache/hive/hive-storage-api/2.1.0-pre-orc/hive-storage-api-2.1.0-pre-orc.jar:/data/home/vimos/.m2/repository/org/codehaus/jettison/jettison/1.1/jettison-1.1.jar:/data/home/vimos/.m2/repository/stax/stax-api/1.0.1/stax-api-1.0.1.jar:/data/home/vimos/.m2/repository/org/iq80/snappy/snappy/0.2/snappy-0.2.jar:/data/home/vimos/.m2/repository/org/slf4j/slf4j-api/1.7.5/slf4j-api-1.7.5.jar:/data/home/vimos/.m2/repository/com/google/guava/guava/14.0.1/guava-14.0.1.jar:/opt/jetbrains/idea-IU-171.4249.39/lib/idea_rt.jar com.pythonorc.SimplifiedOrcReader
Connected to the target VM, address: '127.0.0.1:32805', transport: 'socket'
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
[log_id, mhotelid, city_id, city_name, city_name_en, province_id, province_name, province_name_en, m_city_id, m_province_id, log_type, uid, source, os_version, card_number, user_name, user_ip, appid, user_agent, trace_id, latitude, longitude, carrier, userinfo_channel, level, model, brand, orderid, proxyid, caller_attr_channel, economic_hotel, fast_filter_keywords, mhotel_ids, return_has_xianfu_hotel, return_has_yufu_hotel, hotel_brand_id, only_limitime_sale, facility_ids, theme_ids, star_rates, district_id, district_type, price_pair, payment_methods, nearby, poi_id, region_id, check_in, check_out, id, executetime, keywords, setkeywords, setbrandid, setstarrates, inner_search_type, hotel_group_id, sorting_method, setfilterattr, mrankflag, mranktype, setnearby, setfastfilter_attr, setpoi_id, sethotel_group_id, response_mhotelids, setprice_pair, star_ratessize, facility_idssize, setdistrict_type, settheme_ids, setdistrict_id, settrace_id, pageindex, pagesize, recreqattrtype, ifun, crawled_flag, geo_type, activity_flag]
80
1149130
Exception in thread "main" java.lang.IllegalArgumentException: Buffer size too small. size = 262144 needed = 3863789
	at org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:217)
	at org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:262)
	at java.io.InputStream.read(InputStream.java:101)
	at com.google.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:737)
	at com.google.protobuf.CodedInputStream.isAtEnd(CodedInputStream.java:701)
	at com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:99)
	at org.apache.orc.OrcProto$StripeFooter.<init>(OrcProto.java:10679)
	at org.apache.orc.OrcProto$StripeFooter.<init>(OrcProto.java:10643)
	at org.apache.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:10748)
	at org.apache.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:10743)
	at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:89)
	at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:95)
	at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
	at org.apache.orc.OrcProto$StripeFooter.parseFrom(OrcProto.java:10976)
	at org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readStripeFooter(RecordReaderUtils.java:165)
	at org.apache.orc.impl.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:236)
	at org.apache.orc.impl.RecordReaderImpl.beginReadStripe(RecordReaderImpl.java:849)
	at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:820)
	at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:977)
	at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1012)
	at org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:212)
	at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:579)
	at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:566)
	at com.pythonorc.SimplifiedOrcReader.iterator(SimplifiedOrcReader.java:70)
	at com.pythonorc.SimplifiedOrcReader.main(SimplifiedOrcReader.java:285)
Disconnected from the target VM, address: '127.0.0.1:32805', transport: 'socket'

Vimos avatar May 15 '17 06:05 Vimos

It gives me some hint, let me try to work on it.

nqbao avatar May 16 '17 04:05 nqbao

I don't have any sample that can produce this error. But if i understand correctly then the bufferSize is set inside the footer of the ORC file. Maybe for some reason, the bufferSize is incorrect in the footer.

Can you help me to checkout this branch add-fetch-filemetainfo, build again and then fetch reader.fileMetaInfo and paste me back the information. A sample output would be

{u'metadataSize': u'250', u'compressionType': u'ZLIB', u'writerVersion': u'1', u'versionLists': u'
[0, 12]', u'bufferSize': u'10000'}

This info will help me to debug further into the problem.

nqbao avatar May 16 '17 17:05 nqbao

I used the orc tools and got this

➜  src git:(master) ./orc-metadata ../../../../../python-orc/dt=2017-05-14_os=android_part=000004_0
{ "name": "../../../../../python-orc/dt=2017-05-14_os=android_part=000004_0",
  "type": "struct<log_id:string,mhotelid:string,city_id:string,city_name:string,city_name_en:string,province_id:string,province_name:string,province_name_en:string,m_city_id:string,m_province_id:string,log_type:string,uid:string,source:string,os_version:string,card_number:string,user_name:string,user_ip:string,appid:string,user_agent:string,trace_id:string,latitude:string,longitude:string,carrier:string,userinfo_channel:string,level:string,model:string,brand:string,orderid:string,proxyid:string,caller_attr_channel:string,economic_hotel:string,fast_filter_keywords:string,mhotel_ids:string,return_has_xianfu_hotel:string,return_has_yufu_hotel:string,hotel_brand_id:string,only_limitime_sale:string,facility_ids:string,theme_ids:string,star_rates:string,district_id:string,district_type:string,price_pair:string,payment_methods:string,nearby:string,poi_id:string,region_id:string,check_in:string,check_out:string,id:string,executetime:string,keywords:string,setkeywords:string,setbrandid:string,setstarrates:string,inner_search_type:string,hotel_group_id:string,sorting_method:string,setfilterattr:string,mrankflag:string,mranktype:string,setnearby:string,setfastfilter_attr:string,setpoi_id:string,sethotel_group_id:string,response_mhotelids:array<string>,setprice_pair:string,star_ratessize:string,facility_idssize:string,setdistrict_type:string,settheme_ids:string,setdistrict_id:string,settrace_id:string,pageindex:string,pagesize:string,recreqattrtype:string,ifun:string,crawled_flag:string,geo_type:string,activity_flag:string>",
  "rows": 1149130,
  "stripe count": 3,
  "format": "0.12", "writer version": "original",
  "compression": "zlib", "compression block": 262144,
  "file length": 116043721,
  "content": 116041038, "stripe stats": 3599, "footer": 2549, "postscript": 23,
  "row index stride": 10000,
  "user metadata": {
  },
  "stripes": [
    { "stripe": 0, "rows": 575000,
      "offset": 3, "length": 57272824,
      "index": 72291, "data": 57198680, "footer": 1853
    },
    { "stripe": 1, "rows": 510000,
      "offset": 57272827, "length": 51277701,
      "index": 64624, "data": 51211228, "footer": 1849
    },
    { "stripe": 2, "rows": 64130,
      "offset": 108550528, "length": 7490510,
      "index": 12819, "data": 7475943, "footer": 1748
    }
  ]
}

Vimos avatar May 17 '17 02:05 Vimos

Using the new branch, I got this.

In [1]: from orcreader import OrcReader
   ...: reader = OrcReader('dt=2017-05-14_os=android_part=000004_0')
   ...: reader.open()
   ...: 

In [2]: reader.fileMetaInfo
Out[2]: {u'metadataSize': u'3599', u'compressionType': u'ZLIB', u'writerVersion': u'0', u'versionLists': u'[0, 12]', u'bufferSize': u'262144'}

Vimos avatar May 17 '17 02:05 Vimos

Yeah. The reader by default will use the blockSize from the metadata, which is "compression block": 262144

The possible option is to manually override the blockSize. I will work on this later today.

nqbao avatar May 17 '17 06:05 nqbao