orc icon indicating copy to clipboard operation
orc copied to clipboard

Support read large statistics exceed 2GB

Open deshanxiao opened this issue 2 years ago • 7 comments

In Java and C++ reader, we cannot read the orc file with statistics exceed 2GB. We should find a new way or design to support read these files.

com.google.protobuf.InvalidProtocolBufferException: Protocol message was too large.  May be malicious.  Use CodedInputStream.setSizeLimit() to increase the size limit.
	at com.google.protobuf.InvalidProtocolBufferException.sizeLimitExceeded(InvalidProtocolBufferException.java:154)
	at com.google.protobuf.CodedInputStream$StreamDecoder.readRawBytesSlowPathOneChunk(CodedInputStream.java:2954)
	at com.google.protobuf.CodedInputStream$StreamDecoder.readBytesSlowPath(CodedInputStream.java:3035)
	at com.google.protobuf.CodedInputStream$StreamDecoder.readBytes(CodedInputStream.java:2446)
	at org.apache.orc.OrcProto$StringStatistics.<init>(OrcProto.java:2118)
	at org.apache.orc.OrcProto$StringStatistics.<init>(OrcProto.java:2070)
	at org.apache.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:3285)
	at org.apache.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:3279)
	at com.google.protobuf.CodedInputStream$StreamDecoder.readMessage(CodedInputStream.java:2423)
	at org.apache.orc.OrcProto$ColumnStatistics.<init>(OrcProto.java:8172)
	at org.apache.orc.OrcProto$ColumnStatistics.<init>(OrcProto.java:8093)
	at org.apache.orc.OrcProto$ColumnStatistics$1.parsePartialFrom(OrcProto.java:10494)
	at org.apache.orc.OrcProto$ColumnStatistics$1.parsePartialFrom(OrcProto.java:10488)

deshanxiao avatar Feb 09 '23 05:02 deshanxiao

In fact, it is not reasonable to have such large statistics for a single ORC file, it requires too much memory. My suggestion is to limit writing such huge statistics in writer side. What do you think? @zabetak

deshanxiao avatar Feb 22 '23 08:02 deshanxiao

At first glance 2GB of metadata seems big especially when considering the toy example that I made in ORC-1361. However, if you have a 500GB ORC file then 2GB of metadata does not appear too big anymore so things are relative.

Are there limitations on the maximum size of an ORC file? Do we support such kind of use-cases?

If we add a limit while writing (which by the way I also suggested in https://github.com/protocolbuffers/protobuf/issues/11729) then we should define what happens when the limit is exceeded:

  • drop all metadata
  • fail the entire write
  • ignore metadata over the limit (keep partial metadata)

zabetak avatar Feb 22 '23 11:02 zabetak

@zabetak Could you provide a scenario for a 500GB ORC file? In my experience, columnar storage generally serves big data engines, and each of these big data files is generally about 512M/256M.

deshanxiao avatar Feb 22 '23 12:02 deshanxiao

Yeah, I've seen ORC files upto 2gb or so, but @deshanxiao 's point is accurate that you'll probably get better performance out of Spark & Trino if you keep them smaller than 1-2gb.

That said, can you share how many rows & columns are in the the 500gb file? How big did you configure the stripe buffers and how large are your actual stripes? You should bump up your stripe buffer to get your stripe size to be 256mb or so. That will still give you 2000 stripes, which unless you have a crazy number of columns should be under 2gb.

My guess is that you have a huge number of very small stripes. That will hurt performance, especially if you are using the C++ reader.

omalley avatar Feb 22 '23 18:02 omalley

I am rather positive on the idea of enforcing limits on the writer but I would expect this to be on the protobuf layer (https://github.com/protocolbuffers/protobuf/issues/11729). The limitation comes from protobuf so it seems natural to add checks there and not in ORC code.

The reason that I brought up the question about maximum size is because as the file increases so does the metadata and clearly here we have a hard limitation on till where it can go. If there is a compelling use-case to support arbitrary big files (with arbitrary big metadata) then investing on a new design would make sense.

To be clear, I am not pushing for a new design since I fully agree with both @deshanxiao and @omalley in that with proper schema design + configuration the chances of hitting the problem are rather small. From the ORC perspective, it seems acceptable to settle with the 1/2GB limit on the metadata section.

I didn't mean to imply that having 500GB is a good/normal thing but if nothing prevents user to do it, eventually they will get there. :)

Speaking about actual use-cases, in Hive, I have recently seen OrcSplit reporting a fileLength of 216488139040 for files under certain table partitions. I am not sure how well this number translates to the actual file size nor about the actual configuration that led to this situation since I am not the end-user myself; I was just investigating the problem by checking the Hive application logs.

Summing up, I don't think a new metadata design is worth it at the moment and limiting the writer seems more appropriate to be done in the protobuf layer. From my perspective, the only actionable item regarding this issue at this point would be to add a brief mention about the metadata size limitation on the website (e.g., https://orc.apache.org/specification/ORCv1/).

zabetak avatar Feb 23 '23 13:02 zabetak

@zabetak Could you explain in detail why we can write but not read in protobuf layer?

deshanxiao avatar Feb 24 '23 06:02 deshanxiao

@deshanxiao In https://github.com/protocolbuffers/protobuf/issues/11729 I shared a very simple project reproducing/explaining the write/read problem. Please have a look and if you have questions I will be happy to elaborate more.

zabetak avatar Feb 24 '23 09:02 zabetak