presto icon indicating copy to clipboard operation
presto copied to clipboard

Presto Alluxio sdk Cache issue for file changes of the same s3 URI

Open diablo47 opened this issue 11 months ago • 5 comments

we have used presto SDK cache for some time in version 0.275 with alluxio 2.9.3, cache might become invalid and can't be queried once in about 1-2 months and everything will be fine after manually clearing all the caches. So we decided to upgrade Presto + Alluxio to the latest release: presto 0.285.1 + Alluxio 304 for new features and bug fixes. But things seem to be worse:

we have some hive tables with no partition, the content of the table might be updated hourly or daily as we only care about the latest data. Queries and cache work fine for the first version of the files, after file content changes for the same file S3 URI, the table can't be queried anymore with exceptions. Queries can be resumed after manually emptying the cached files.

The error type might be different, seems related to different files read first seen: don't know what type: 15 and then Not valid Parquet file and sometimes: java.lang.ArrayIndexOutOfBoundsException

our previous presto version 0.275 with Alluxio 2.9.3 doesn't have this issue, and changed files can be read successfully most of the time.

We currently disabled cache for our 0.285.1 deployment.

Your Environment

  • Presto version used: 0.285.1 with alluxio version 304
  • Storage (HDFS/S3/GCS..):S3
  • Data source and connector used: hive + parquet
  • Deployment (Cloud or On-prem): native deploy on AWS ec2, java version: 1.8.0_181
  • complete debug logs: see files attached

error_1: Not valid Parquet file coordinator_error_1.log worker_error_1.log

error 2: java.lang.ArrayIndexOutOfBoundsException coordinator_error_2.log worker_error_2.log

error 3: don't know what type: 15 coordinator_error_3.log worker_error_3.log

Expected Behavior

changed files should be read successfully and the cache should be updated accordingly.

Current Behavior

after file content changes for the same file S3 URI, the table can't be queried anymore with exceptions. Queries can be resumed after manually emptying the cached files.

Steps to Reproduce

  1. starting presto with an empty cache
  2. query a table with parquet storage, with a fixed location(no partition). And get the data cached, such as select * from db.table limit 100;
  3. update the table content with the same S3 file name(same URI), by an insert overwrite for example, hive or spark normally will produce the same file name under the fixed location such as : 000001_0
  4. rerun the previous query and get the error

Screenshots (if appropriate)

Context

Looks very like cached metadata didn't match with new files.

I tried to upgrade Alluxio to 307 as 0.286 prepared, didn't work.

presto setup and configs are attached, weird that github doesn't support .properties file...

hive.properties.txt coordinator-config.properties.txt worker-config.properties.txt jvm.config.txt

full server start log: server.log

diablo47 avatar Feb 27 '24 12:02 diablo47

CC: @beinan @apc999

imjalpreet avatar Feb 27 '24 16:02 imjalpreet

presto 0.285.1 + alluxio 310 have this issue as well

diablo47 avatar Feb 29 '24 03:02 diablo47

There is a hack on presto side to fix this issue by adding the last_modified_timestamp into the cache key. Let me try to find the code change or I will post a PR for this issue later

beinan avatar May 14 '24 23:05 beinan

Hi @diablo47 , here is the fix: https://github.com/prestodb/presto/pull/22750

A new config should be added to enable this cache refresh feature as below:

cache.last-modified-time-check-enabled = true

do you mind to give it a shot? Thanks!

beinan avatar May 15 '24 00:05 beinan

For error 2, it should be fixed in 310 right? We haven't seen this issue for a while now.

zacw7 avatar May 16 '24 18:05 zacw7

fix: https://github.com/prestodb/presto/pull/22750 merged.

beinan avatar May 19 '24 03:05 beinan

For error 2, it should be fixed in 310 right? We haven't seen this issue for a while now.

Yes, I believe so

beinan avatar May 19 '24 03:05 beinan

Hi @diablo47 , here is the fix: #22750

A new config should be added to enable this cache refresh feature as below:

cache.last-modified-time-check-enabled = true

do you mind to give it a shot? Thanks!

Hi beinan, thank you for this fix, I will try it along with alluxio 310

diablo47 avatar Jun 07 '24 02:06 diablo47