protobuf icon indicating copy to clipboard operation
protobuf copied to clipboard

Silent data corruption when writing a message exceeding 2GB in size

Open zabetak opened this issue 2 years ago • 11 comments

What version of protobuf and what language are you using? Version: v21.12 Language: Java

What operating system (Linux, Windows, ...) and version? Ubuntu 20.04.5 LTS

What runtime / compiler are you using (e.g., python version or gcc version)

What did you do?

  1. Create a (huge) message greater than 2GB.
  2. Write the message to a file.
  3. Read back the message from the file.

https://github.com/zabetak/protobuf-large-message is simple project reproducing the problem with a reduced message model from the protobuf tutorial.

What did you expect to see I was expecting the message creation (Step 1) or write (Step 2) to fail with a meaningful message or the message to be read correctly from the file.

What did you see instead? I got the following exception while attempting to read the message from the file.

Exception in thread "main" com.google.protobuf.InvalidProtocolBufferException: Protocol message was too large.  May be malicious.  Use CodedInputStream.setSizeLimit() to increase the size limit.
	at com.google.protobuf.InvalidProtocolBufferException.sizeLimitExceeded(InvalidProtocolBufferException.java:162)
	at com.google.protobuf.CodedInputStream$StreamDecoder.refillBuffer(CodedInputStream.java:2781)
	at com.google.protobuf.CodedInputStream$StreamDecoder.readRawByte(CodedInputStream.java:2859)
	at com.google.protobuf.CodedInputStream$StreamDecoder.readRawVarint64SlowPath(CodedInputStream.java:2648)
	at com.google.protobuf.CodedInputStream$StreamDecoder.readRawVarint32(CodedInputStream.java:2542)
	at com.google.protobuf.CodedInputStream$StreamDecoder.readMessage(CodedInputStream.java:2405)
	at com.github.zabetak.protobuf.bugs.protos.AddressBook$Builder.mergeFrom(AddressBook.java:440)
	at com.github.zabetak.protobuf.bugs.protos.AddressBook$1.parsePartialFrom(AddressBook.java:742)
	at com.github.zabetak.protobuf.bugs.protos.AddressBook$1.parsePartialFrom(AddressBook.java:734)
	at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:86)
	at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:91)
	at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:48)
	at com.google.protobuf.GeneratedMessageV3.parseWithIOException(GeneratedMessageV3.java:364)
	at com.github.zabetak.protobuf.bugs.protos.AddressBook.parseFrom(AddressBook.java:219)
	at com.github.zabetak.protobuf.bugs.serde.WriteReadBook.readBooks(WriteReadBook.java:63)
	at com.github.zabetak.protobuf.bugs.serde.WriteReadBook.main(WriteReadBook.java:35)

The reason for the exception is clear and understood but now that the file is (permanently?) corrupted there is not much we can do.

Anything else we should know about your project / environment I understand that it is not a good practice to write huge messages but there should be some kind of guard against data corruption.

This problem came up in actual deployments of Apache Hive, Spark, etc, due to ORC metadata that are stored as protobuf messages.

Probably the ORC team will reconsider the message layout in future versions but it may be a good idea to find a way to prevent this from within the protobuf library.

Relevant tickets:

  • https://issues.apache.org/jira/browse/HIVE-11268
  • https://issues.apache.org/jira/browse/HIVE-11592
  • https://issues.apache.org/jira/browse/HIVE-26987
  • https://issues.apache.org/jira/browse/ORC-1361

zabetak avatar Jan 31 '23 18:01 zabetak

While I'm seeking for an answer, have you tried increase the limit as prompted?

shaod2 avatar Jan 31 '23 22:01 shaod2

Thanks for looking into this @shaod2 ! It is not possible to increase the limit further cause it is already set to the maximum value Integer.MAX_VALUE ~2GB.

If there is no way to read back messages greater than 2GB then I would suggest to add checks to prevent this messages from being constructed/serialized in the first place.

zabetak avatar Feb 01 '23 09:02 zabetak

@shaod2 Did you have a chance to look into this? Do you confirm that it is a problem that should be addressed?

zabetak avatar Feb 22 '23 11:02 zabetak

We triage inactive PRs and issues in order to make it easier to find active work. If this issue should remain active or becomes active again, please add a comment.

This issue is labeled inactive because the last activity was over 90 days ago.

github-actions[bot] avatar Dec 10 '23 10:12 github-actions[bot]

The issue is still relevant and IMHO quite important to avoid unexpected data loss.

zabetak avatar Dec 11 '23 08:12 zabetak

We triage inactive PRs and issues in order to make it easier to find active work. If this issue should remain active or becomes active again, please add a comment.

This issue is labeled inactive because the last activity was over 90 days ago.

github-actions[bot] avatar Mar 12 '24 10:03 github-actions[bot]

This is still relevant and should remain active.

zabetak avatar Mar 13 '24 08:03 zabetak

We triage inactive PRs and issues in order to make it easier to find active work. If this issue should remain active or becomes active again, please add a comment.

This issue is labeled inactive because the last activity was over 90 days ago.

github-actions[bot] avatar Jun 12 '24 10:06 github-actions[bot]

Still relevant!

zabetak avatar Jun 12 '24 10:06 zabetak

Hi! Thank you for reporting this issue.

I'm also a user of Protobuf, and I wanted to ask if the size limit applies to

  1. Any Protobuf that reaches more than 2GB,
  2. Protobufs that store N amount of objects, where N surpasses Integer.MAX_VALUE,

This is unclear from the documentation.

Thank you in advance :pray:

Best regards, Yoshua Nava

YoshuaNava avatar Jul 12 '24 14:07 YoshuaNava

We triage inactive PRs and issues in order to make it easier to find active work. If this issue should remain active or becomes active again, please add a comment.

This issue is labeled inactive because the last activity was over 90 days ago. This issue will be closed and archived after 14 additional days without activity.

github-actions[bot] avatar Oct 11 '24 10:10 github-actions[bot]

We triage inactive PRs and issues in order to make it easier to find active work. If this issue should remain active or becomes active again, please reopen it.

This issue was closed and archived because there has been no new activity in the 14 days since the inactive label was added.

github-actions[bot] avatar Oct 26 '24 10:10 github-actions[bot]

@YoshuaNava max size is 2GB, https://protobuf.dev/programming-guides/proto-limits/

ufengtao avatar Dec 26 '24 01:12 ufengtao