JCR-4369: Avoid S3 Incomplete Read Warning by elegant aborting

Open woonsan opened this issue 7 years ago • 0 comments

AWS S3 SDK recommends to abort the S3ObjectInputStream if not intended to consume the data because the http connection pool behavior (by consuming data on close to reuse the connection) of HttpClient under the hood could cause a big performance issue when reading big amount of data. By aborting, it's better to simply abort the underlying HttpRequestBase and kick out the connection from the pool from AWS S3 SDK perspective.

In multi-threaded working environment (due to multiple requests and/or proactiveCaching mode of CachingDataStore), the reading and storing actions in o.a.j.c.data.CachingDataStore.getStream(DataIdentifier) results in falling in the else block of o.a.j.core.data.LocalCache.store(String, InputStream) while the file by the name could already exist when executing the else block. In that case, the S3ObjectInputStream is never read and aborted. As a result, com.amazonaws.services.s3.internal.S3AbortableInputStream#close() ends up complaining about non-aborted/non-read-fully input stream.

Therefore, my fix includes the following:

LocalCache checks if the backend resource input stream is abortable. If abortable, it tries to abort the backend resource stream. For this purpose, BackendResourceAbortable interface in jackrabbit-data is introduced.
S3Backend wraps the S3ObjectInputStream to implement BackendResourceAbortable by leveraging commons-io's ProxyInputStream.
Some unit tests.
Just FYI, also personally tested locally with S3 compatible system (ref: https://github.com/woonsanko/hippo-davstore-demo/tree/feature/vfs-file-system#option-4-using-the-aws-s3-datastore-instead-of-vfs-datastore)

Sep 05 '18 22:09 woonsan