presto
presto copied to clipboard
Presto Alluxio sdk Cache issue for file changes of the same s3 URI
we have used presto SDK cache for some time in version 0.275 with alluxio 2.9.3, cache might become invalid and can't be queried once in about 1-2 months and everything will be fine after manually clearing all the caches. So we decided to upgrade Presto + Alluxio to the latest release: presto 0.285.1 + Alluxio 304 for new features and bug fixes. But things seem to be worse:
we have some hive tables with no partition, the content of the table might be updated hourly or daily as we only care about the latest data. Queries and cache work fine for the first version of the files, after file content changes for the same file S3 URI, the table can't be queried anymore with exceptions. Queries can be resumed after manually emptying the cached files.
The error type might be different, seems related to different files read
first seen: don't know what type: 15
and then Not valid Parquet file
and sometimes: java.lang.ArrayIndexOutOfBoundsException
our previous presto version 0.275 with Alluxio 2.9.3 doesn't have this issue, and changed files can be read successfully most of the time.
We currently disabled cache for our 0.285.1 deployment.
Your Environment
- Presto version used: 0.285.1 with alluxio version 304
- Storage (HDFS/S3/GCS..):S3
- Data source and connector used: hive + parquet
- Deployment (Cloud or On-prem): native deploy on AWS ec2, java version: 1.8.0_181
- complete debug logs: see files attached
error_1: Not valid Parquet file
coordinator_error_1.log
worker_error_1.log
error 2: java.lang.ArrayIndexOutOfBoundsException
coordinator_error_2.log
worker_error_2.log
error 3: don't know what type: 15
coordinator_error_3.log
worker_error_3.log
Expected Behavior
changed files should be read successfully and the cache should be updated accordingly.
Current Behavior
after file content changes for the same file S3 URI, the table can't be queried anymore with exceptions. Queries can be resumed after manually emptying the cached files.
Steps to Reproduce
- starting presto with an empty cache
- query a table with parquet storage, with a fixed location(no partition). And get the data cached, such as
select * from db.table limit 100
; - update the table content with the same S3 file name(same URI), by an
insert overwrite
for example, hive or spark normally will produce the same file name under the fixed location such as : 000001_0 - rerun the previous query and get the error
Screenshots (if appropriate)
Context
Looks very like cached metadata didn't match with new files.
I tried to upgrade Alluxio to 307 as 0.286 prepared, didn't work.
presto setup and configs are attached, weird that github doesn't support .properties file...
hive.properties.txt coordinator-config.properties.txt worker-config.properties.txt jvm.config.txt
full server start log: server.log
CC: @beinan @apc999
presto 0.285.1 + alluxio 310 have this issue as well
There is a hack on presto side to fix this issue by adding the last_modified_timestamp into the cache key. Let me try to find the code change or I will post a PR for this issue later
Hi @diablo47 , here is the fix: https://github.com/prestodb/presto/pull/22750
A new config should be added to enable this cache refresh feature as below:
cache.last-modified-time-check-enabled = true
do you mind to give it a shot? Thanks!
For error 2, it should be fixed in 310 right? We haven't seen this issue for a while now.
fix: https://github.com/prestodb/presto/pull/22750 merged.
For error 2, it should be fixed in 310 right? We haven't seen this issue for a while now.
Yes, I believe so
Hi @diablo47 , here is the fix: #22750
A new config should be added to enable this cache refresh feature as below:
cache.last-modified-time-check-enabled = true
do you mind to give it a shot? Thanks!
Hi beinan, thank you for this fix, I will try it along with alluxio 310