python-orc
python-orc copied to clipboard
Py4JJavaError: An error occurred while calling o2.iterator.
Hi, I am trying to read an orc file.
In [1]: from orcreader import OrcReader
...: reader = OrcReader('dt=2017-05-14_os=android_part=000004_0')
...: reader.open()
...:
I have successfully get the schema like this
In [3]: reader.schema()
Out[3]:
OrderedDict([(u'log_id', u'string'),
(u'city_id', u'string'),
(u'city_name', u'string'),
(u'city_name_en', u'string'),
(u'province_id', u'string'),
(u'province_name', u'string'),
...
(u'activity_flag', u'string')])
But when I am trying to read rows, it reports the following error
In [2]: for row in reader:
...: print row
...:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-2-df0cbab3b6b5> in <module>()
----> 1 for row in reader:
2 print row
3
/usr/local/lib/python2.7/dist-packages/python_orc-0.0.1-py2.7.egg/orcreader/reader.pyc in __iter__(self)
79
80 def __iter__(self):
---> 81 return OrcRecordIterator(self.reader.iterator())
82
83 def __enter__(self):
/usr/local/lib/python2.7/dist-packages/py4j-0.10.4-py2.7.egg/py4j/java_gateway.pyc in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
/usr/local/lib/python2.7/dist-packages/py4j-0.10.4-py2.7.egg/py4j/protocol.pyc in get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(
Py4JJavaError: An error occurred while calling o2.iterator.
: java.lang.RuntimeException: Unable to init iterator
at com.pythonorc.SimplifiedOrcReader.iterator(SimplifiedOrcReader.java:72)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Any suggestions on how to debug this error?
Is it possible to share the ORC file? I can try to take a look at it.
I am sorry, I am not permitted to send you the data. I may offer more debug info from Java.
/usr/lib/jvm/java-8-oracle/bin/java -agentlib:jdwp=transport=dt_socket,address=127.0.0.1:32805,suspend=y,server=n -Dfile.encoding=UTF-8 -classpath /usr/lib/jvm/java-8-oracle/jre/lib/charsets.jar:/usr/lib/jvm/java-8-oracle/jre/lib/deploy.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/cldrdata.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/dnsns.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/jaccess.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/jfxrt.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/localedata.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/nashorn.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/sunec.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/sunjce_provider.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/sunpkcs11.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/zipfs.jar:/usr/lib/jvm/java-8-oracle/jre/lib/javaws.jar:/usr/lib/jvm/java-8-oracle/jre/lib/jce.jar:/usr/lib/jvm/java-8-oracle/jre/lib/jfr.jar:/usr/lib/jvm/java-8-oracle/jre/lib/jfxswt.jar:/usr/lib/jvm/java-8-oracle/jre/lib/jsse.jar:/usr/lib/jvm/java-8-oracle/jre/lib/management-agent.jar:/usr/lib/jvm/java-8-oracle/jre/lib/plugin.jar:/usr/lib/jvm/java-8-oracle/jre/lib/resources.jar:/usr/lib/jvm/java-8-oracle/jre/lib/rt.jar:/home/vimos/Public/github/ml/python-orc/java-gateway/target/classes:/data/home/vimos/.m2/repository/net/sf/py4j/py4j/0.10.2.1/py4j-0.10.2.1.jar:/data/home/vimos/.m2/repository/org/apache/orc/orc-core/1.1.1/orc-core-1.1.1.jar:/data/home/vimos/.m2/repository/com/google/protobuf/protobuf-java/2.5.0/protobuf-java-2.5.0.jar:/data/home/vimos/.m2/repository/commons-cli/commons-cli/1.2/commons-cli-1.2.jar:/data/home/vimos/.m2/repository/commons-lang/commons-lang/2.6/commons-lang-2.6.jar:/data/home/vimos/.m2/repository/org/apache/hadoop/hadoop-common/2.6.0/hadoop-common-2.6.0.jar:/data/home/vimos/.m2/repository/org/apache/hadoop/hadoop-annotations/2.6.0/hadoop-annotations-2.6.0.jar:/usr/lib/jvm/java-8-oracle/lib/tools.jar:/data/home/vimos/.m2/repository/org/apache/commons/commons-math3/3.1.1/commons-math3-3.1.1.jar:/data/home/vimos/.m2/repository/xmlenc/xmlenc/0.52/xmlenc-0.52.jar:/data/home/vimos/.m2/repository/commons-httpclient/commons-httpclient/3.1/commons-httpclient-3.1.jar:/data/home/vimos/.m2/repository/commons-codec/commons-codec/1.4/commons-codec-1.4.jar:/data/home/vimos/.m2/repository/commons-io/commons-io/2.4/commons-io-2.4.jar:/data/home/vimos/.m2/repository/commons-net/commons-net/3.1/commons-net-3.1.jar:/data/home/vimos/.m2/repository/commons-collections/commons-collections/3.2.1/commons-collections-3.2.1.jar:/data/home/vimos/.m2/repository/com/sun/jersey/jersey-core/1.9/jersey-core-1.9.jar:/data/home/vimos/.m2/repository/com/sun/jersey/jersey-json/1.9/jersey-json-1.9.jar:/data/home/vimos/.m2/repository/com/sun/xml/bind/jaxb-impl/2.2.3-1/jaxb-impl-2.2.3-1.jar:/data/home/vimos/.m2/repository/javax/xml/bind/jaxb-api/2.2.2/jaxb-api-2.2.2.jar:/data/home/vimos/.m2/repository/javax/xml/stream/stax-api/1.0-2/stax-api-1.0-2.jar:/data/home/vimos/.m2/repository/javax/activation/activation/1.1/activation-1.1.jar:/data/home/vimos/.m2/repository/org/codehaus/jackson/jackson-jaxrs/1.8.3/jackson-jaxrs-1.8.3.jar:/data/home/vimos/.m2/repository/org/codehaus/jackson/jackson-xc/1.8.3/jackson-xc-1.8.3.jar:/data/home/vimos/.m2/repository/com/sun/jersey/jersey-server/1.9/jersey-server-1.9.jar:/data/home/vimos/.m2/repository/asm/asm/3.1/asm-3.1.jar:/data/home/vimos/.m2/repository/tomcat/jasper-compiler/5.5.23/jasper-compiler-5.5.23.jar:/data/home/vimos/.m2/repository/tomcat/jasper-runtime/5.5.23/jasper-runtime-5.5.23.jar:/data/home/vimos/.m2/repository/commons-el/commons-el/1.0/commons-el-1.0.jar:/data/home/vimos/.m2/repository/commons-logging/commons-logging/1.1.3/commons-logging-1.1.3.jar:/data/home/vimos/.m2/repository/log4j/log4j/1.2.17/log4j-1.2.17.jar:/data/home/vimos/.m2/repository/net/java/dev/jets3t/jets3t/0.9.0/jets3t-0.9.0.jar:/data/home/vimos/.m2/repository/org/apache/httpcomponents/httpclient/4.1.2/httpclient-4.1.2.jar:/data/home/vimos/.m2/repository/org/apache/httpcomponents/httpcore/4.1.2/httpcore-4.1.2.jar:/data/home/vimos/.m2/repository/com/jamesmurty/utils/java-xmlbuilder/0.4/java-xmlbuilder-0.4.jar:/data/home/vimos/.m2/repository/commons-configuration/commons-configuration/1.6/commons-configuration-1.6.jar:/data/home/vimos/.m2/repository/commons-digester/commons-digester/1.8/commons-digester-1.8.jar:/data/home/vimos/.m2/repository/commons-beanutils/commons-beanutils/1.7.0/commons-beanutils-1.7.0.jar:/data/home/vimos/.m2/repository/commons-beanutils/commons-beanutils-core/1.8.0/commons-beanutils-core-1.8.0.jar:/data/home/vimos/.m2/repository/org/slf4j/slf4j-log4j12/1.7.5/slf4j-log4j12-1.7.5.jar:/data/home/vimos/.m2/repository/org/codehaus/jackson/jackson-core-asl/1.9.13/jackson-core-asl-1.9.13.jar:/data/home/vimos/.m2/repository/org/codehaus/jackson/jackson-mapper-asl/1.9.13/jackson-mapper-asl-1.9.13.jar:/data/home/vimos/.m2/repository/com/google/code/gson/gson/2.2.4/gson-2.2.4.jar:/data/home/vimos/.m2/repository/org/apache/hadoop/hadoop-auth/2.6.0/hadoop-auth-2.6.0.jar:/data/home/vimos/.m2/repository/org/apache/directory/server/apacheds-kerberos-codec/2.0.0-M15/apacheds-kerberos-codec-2.0.0-M15.jar:/data/home/vimos/.m2/repository/org/apache/directory/server/apacheds-i18n/2.0.0-M15/apacheds-i18n-2.0.0-M15.jar:/data/home/vimos/.m2/repository/org/apache/directory/api/api-asn1-api/1.0.0-M20/api-asn1-api-1.0.0-M20.jar:/data/home/vimos/.m2/repository/org/apache/directory/api/api-util/1.0.0-M20/api-util-1.0.0-M20.jar:/data/home/vimos/.m2/repository/org/apache/curator/curator-framework/2.6.0/curator-framework-2.6.0.jar:/data/home/vimos/.m2/repository/com/jcraft/jsch/0.1.42/jsch-0.1.42.jar:/data/home/vimos/.m2/repository/org/apache/curator/curator-client/2.6.0/curator-client-2.6.0.jar:/data/home/vimos/.m2/repository/org/apache/curator/curator-recipes/2.6.0/curator-recipes-2.6.0.jar:/data/home/vimos/.m2/repository/org/htrace/htrace-core/3.0.4/htrace-core-3.0.4.jar:/data/home/vimos/.m2/repository/org/apache/zookeeper/zookeeper/3.4.6/zookeeper-3.4.6.jar:/data/home/vimos/.m2/repository/org/apache/commons/commons-compress/1.4.1/commons-compress-1.4.1.jar:/data/home/vimos/.m2/repository/org/tukaani/xz/1.0/xz-1.0.jar:/data/home/vimos/.m2/repository/org/apache/hadoop/hadoop-hdfs/2.6.0/hadoop-hdfs-2.6.0.jar:/data/home/vimos/.m2/repository/commons-daemon/commons-daemon/1.0.13/commons-daemon-1.0.13.jar:/data/home/vimos/.m2/repository/io/netty/netty/3.6.2.Final/netty-3.6.2.Final.jar:/data/home/vimos/.m2/repository/xerces/xercesImpl/2.9.1/xercesImpl-2.9.1.jar:/data/home/vimos/.m2/repository/xml-apis/xml-apis/1.3.04/xml-apis-1.3.04.jar:/data/home/vimos/.m2/repository/org/apache/hive/hive-storage-api/2.1.0-pre-orc/hive-storage-api-2.1.0-pre-orc.jar:/data/home/vimos/.m2/repository/org/codehaus/jettison/jettison/1.1/jettison-1.1.jar:/data/home/vimos/.m2/repository/stax/stax-api/1.0.1/stax-api-1.0.1.jar:/data/home/vimos/.m2/repository/org/iq80/snappy/snappy/0.2/snappy-0.2.jar:/data/home/vimos/.m2/repository/org/slf4j/slf4j-api/1.7.5/slf4j-api-1.7.5.jar:/data/home/vimos/.m2/repository/com/google/guava/guava/14.0.1/guava-14.0.1.jar:/opt/jetbrains/idea-IU-171.4249.39/lib/idea_rt.jar com.pythonorc.SimplifiedOrcReader
Connected to the target VM, address: '127.0.0.1:32805', transport: 'socket'
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
[log_id, mhotelid, city_id, city_name, city_name_en, province_id, province_name, province_name_en, m_city_id, m_province_id, log_type, uid, source, os_version, card_number, user_name, user_ip, appid, user_agent, trace_id, latitude, longitude, carrier, userinfo_channel, level, model, brand, orderid, proxyid, caller_attr_channel, economic_hotel, fast_filter_keywords, mhotel_ids, return_has_xianfu_hotel, return_has_yufu_hotel, hotel_brand_id, only_limitime_sale, facility_ids, theme_ids, star_rates, district_id, district_type, price_pair, payment_methods, nearby, poi_id, region_id, check_in, check_out, id, executetime, keywords, setkeywords, setbrandid, setstarrates, inner_search_type, hotel_group_id, sorting_method, setfilterattr, mrankflag, mranktype, setnearby, setfastfilter_attr, setpoi_id, sethotel_group_id, response_mhotelids, setprice_pair, star_ratessize, facility_idssize, setdistrict_type, settheme_ids, setdistrict_id, settrace_id, pageindex, pagesize, recreqattrtype, ifun, crawled_flag, geo_type, activity_flag]
80
1149130
Exception in thread "main" java.lang.IllegalArgumentException: Buffer size too small. size = 262144 needed = 3863789
at org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:217)
at org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:262)
at java.io.InputStream.read(InputStream.java:101)
at com.google.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:737)
at com.google.protobuf.CodedInputStream.isAtEnd(CodedInputStream.java:701)
at com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:99)
at org.apache.orc.OrcProto$StripeFooter.<init>(OrcProto.java:10679)
at org.apache.orc.OrcProto$StripeFooter.<init>(OrcProto.java:10643)
at org.apache.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:10748)
at org.apache.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:10743)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:89)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:95)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
at org.apache.orc.OrcProto$StripeFooter.parseFrom(OrcProto.java:10976)
at org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readStripeFooter(RecordReaderUtils.java:165)
at org.apache.orc.impl.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:236)
at org.apache.orc.impl.RecordReaderImpl.beginReadStripe(RecordReaderImpl.java:849)
at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:820)
at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:977)
at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1012)
at org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:212)
at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:579)
at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:566)
at com.pythonorc.SimplifiedOrcReader.iterator(SimplifiedOrcReader.java:70)
at com.pythonorc.SimplifiedOrcReader.main(SimplifiedOrcReader.java:285)
Disconnected from the target VM, address: '127.0.0.1:32805', transport: 'socket'
It gives me some hint, let me try to work on it.
I don't have any sample that can produce this error. But if i understand correctly then the bufferSize is set inside the footer of the ORC file. Maybe for some reason, the bufferSize is incorrect in the footer.
Can you help me to checkout this branch add-fetch-filemetainfo, build again and then fetch reader.fileMetaInfo and paste me back the information. A sample output would be
{u'metadataSize': u'250', u'compressionType': u'ZLIB', u'writerVersion': u'1', u'versionLists': u'
[0, 12]', u'bufferSize': u'10000'}
This info will help me to debug further into the problem.
I used the orc tools and got this
➜ src git:(master) ./orc-metadata ../../../../../python-orc/dt=2017-05-14_os=android_part=000004_0
{ "name": "../../../../../python-orc/dt=2017-05-14_os=android_part=000004_0",
"type": "struct<log_id:string,mhotelid:string,city_id:string,city_name:string,city_name_en:string,province_id:string,province_name:string,province_name_en:string,m_city_id:string,m_province_id:string,log_type:string,uid:string,source:string,os_version:string,card_number:string,user_name:string,user_ip:string,appid:string,user_agent:string,trace_id:string,latitude:string,longitude:string,carrier:string,userinfo_channel:string,level:string,model:string,brand:string,orderid:string,proxyid:string,caller_attr_channel:string,economic_hotel:string,fast_filter_keywords:string,mhotel_ids:string,return_has_xianfu_hotel:string,return_has_yufu_hotel:string,hotel_brand_id:string,only_limitime_sale:string,facility_ids:string,theme_ids:string,star_rates:string,district_id:string,district_type:string,price_pair:string,payment_methods:string,nearby:string,poi_id:string,region_id:string,check_in:string,check_out:string,id:string,executetime:string,keywords:string,setkeywords:string,setbrandid:string,setstarrates:string,inner_search_type:string,hotel_group_id:string,sorting_method:string,setfilterattr:string,mrankflag:string,mranktype:string,setnearby:string,setfastfilter_attr:string,setpoi_id:string,sethotel_group_id:string,response_mhotelids:array<string>,setprice_pair:string,star_ratessize:string,facility_idssize:string,setdistrict_type:string,settheme_ids:string,setdistrict_id:string,settrace_id:string,pageindex:string,pagesize:string,recreqattrtype:string,ifun:string,crawled_flag:string,geo_type:string,activity_flag:string>",
"rows": 1149130,
"stripe count": 3,
"format": "0.12", "writer version": "original",
"compression": "zlib", "compression block": 262144,
"file length": 116043721,
"content": 116041038, "stripe stats": 3599, "footer": 2549, "postscript": 23,
"row index stride": 10000,
"user metadata": {
},
"stripes": [
{ "stripe": 0, "rows": 575000,
"offset": 3, "length": 57272824,
"index": 72291, "data": 57198680, "footer": 1853
},
{ "stripe": 1, "rows": 510000,
"offset": 57272827, "length": 51277701,
"index": 64624, "data": 51211228, "footer": 1849
},
{ "stripe": 2, "rows": 64130,
"offset": 108550528, "length": 7490510,
"index": 12819, "data": 7475943, "footer": 1748
}
]
}
Using the new branch, I got this.
In [1]: from orcreader import OrcReader
...: reader = OrcReader('dt=2017-05-14_os=android_part=000004_0')
...: reader.open()
...:
In [2]: reader.fileMetaInfo
Out[2]: {u'metadataSize': u'3599', u'compressionType': u'ZLIB', u'writerVersion': u'0', u'versionLists': u'[0, 12]', u'bufferSize': u'262144'}
Yeah. The reader by default will use the blockSize from the metadata, which is "compression block": 262144
The possible option is to manually override the blockSize. I will work on this later today.