spark-rapids [BUG] Always read old data from alluxio regardless of S3 changes when using CONVERT

Describe the bug When using the CONVERT_TIME Alluxio replacement algorithm, auto mount method, and adding new files into s3, Spark rapids can't read the new files. The query path is replaced with the Alluxio path, and Alluxio does not fetch new files from S3, thus this issue occurs.

For example: Read from s3://bucket/tab, the replaced path is alluxio://master_ip:19998/tab Adding files to s3://bucket/tab will reproduce this issue.

Steps/Code to reproduce bug Read from an s3 path

val base_path = "s3a://liangcail/chongg-test"
spark.conf.set("spark.rapids.alluxio.replacement.algo", "CONVERT_TIME")
spark.conf.set("spark.sql.adaptive.enabled", "false")

spark.read.parquet(base_path).createOrReplaceTempView("tbl")
spark.sql("select (sum(ws_sold_time_sk) + sum(ws_ship_date_sk)) / 3800809931000 / 2 from tbl").show()

Add files to the s3 path

aws s3 put x.parquet s3://liangcail/chongg-test/new_name.parquet

Logs:

22/10/13 09:57:47 DEBUG AlluxioUtils: 
Replace s3a://liangcail/chongg-test to alluxio://10.59.255.53:19998/liangcail/chongg-test
22/10/13 09:57:47 DEBUG AlluxioUtils: 
Automount replacing paths: AlluxioPathReplaceConvertTime(alluxio://10.59.255.53:19998/liangcail/chongg-test,Some(s3a://))

22/10/13 09:57:47 DEBUG FileSystemMasterClient: Exit (OK): ListStatus(path=/liangcail/chongg-test,options=loadMetadataType: ONCE
commonOptions {
  syncIntervalMs: -1
  ttl: -1
  ttlAction: DELETE
}
loadMetadataOnly: false
) in 5 ms

Seems Alluxio did not sync the directory list from S3 when doing ListStatus.

Environment details (please complete the following information) Databricks with Alluxio open

Additional context For the default installation of Alluxio, Alluxio will never synchronize the files list from S3 for any directories. Seems The alluxio ListStatus operation also does not sync by default.

Note: createOrReplaceTempView before each query also has this issue.

Adding the following config works:

alluxio.user.file.metadata.sync.interval=5s

Not sure about the delay time, maybe 2 ways:

When the query running, immediately sync and the query result is correct.
Query use old meta, thus query result is incorrect. Start an async thread to sync, after 5s the following query will be OK.

Oct 13 '22 11:10 res-life

This seems like it works as designed in Alluxio. did you try setting the metadata sync to 0? I wouldn't expect a 5 second interval to work immediately unless the put happens in aws s3 and then you wait 5+ seconds for alluxio to do the sync and then after that it would work.

When its value is set to 0, Alluxio will always resync with under storage whenever metadata is accessed.

I also don't understand how that differs in CONVERT_TIME and TASK_TIME can you explain? is it because the TASK_TIME doesn't do the list status so the metadata in Alluxio isn't filled in? which I wouldn't expect since as soon as you actually read it I would think it would get filled in.

Oct 13 '22 13:10 tgravescs

@res-life I still have questions above, please answer

Oct 17 '22 13:10 tgravescs

Default values of 2 configs:

alluxio.user.file.metadata.load.type ONCE
`ONCE` will access the UFS the "first" time (according to a cache), but not after that. 

alluxio.user.file.metadata.sync.interval  -1

By default, ListStatus only syncs the sub-files the first time, but never syncs after the first time.

For CONVERT_TIME it will construct a new FileIndex, in the constructor it triggers the sync(By default sync is skipped). If we set alluxio.user.file.metadata.sync.interval as a positive value, it will sync, see logs:

22/11/01 11:37:15 DEBUG FileSystemMasterClient: Exit (OK): ListStatus(path=/nds20/nds2-sf2-parquet/store_sales,options=loadMetadataType: ONCE
commonOptions {
  syncIntervalMs: 5000
  ttl: -1
  ttlAction: DELETE
}
loadMetadataOnly: false
) in 281 ms

For TASK_TIME algorithm will not construct a new FileIndex, and thus will not trigger the sync. For this issue, we want to remove the CONVERT_TIME algorithm.

For TASK_TIME algorithm, the setting of sync.interval does not work, but we can sync the root path periodically. The command is:

alluxio fs ls -R -Dalluxio.user.file.metadata.sync.interval=5s /
// Note we can invoke the equivalent Java API.

The time of the above command is 11s if the root path contains 36,121 files(including folders).

This command significantly impacts queries: The query time increased from about 500ms to about 5,000ms.

Details:

code:

%scala
sc.setLogLevel("DEBUG")

spark.conf.set("spark.rapids.alluxio.automount.enabled", "true")
val base_path = "s3a://nds20/nds2-sf2-parquet"
spark.conf.set("spark.rapids.alluxio.replacement.algo", "TASK_TIME")
spark.conf.set("spark.sql.adaptive.enabled", "false")
spark.read.parquet(base_path + "/" + "store_sales").createOrReplaceTempView("store_sales")

for(i <- 0 to 10000) {
  spark.sql("select count(*) from store_sales").show()
  Thread.sleep(1000L)
}

sc.setLogLevel("INFO")

Sync scripts:

for (( i = 0; i < 1000; ++i )) ;  do  \
./alluxio fs ls -R -Dalluxio.user.file.metadata.sync.interval=5s / ;  \
done

Output:

// Every about 5s,  the performance degrades.
used: 593
used: 469
used: 544
used: 726
used: 5939
used: 635
used: 817
used: 808
used: 652
used: 5178
used: 890
used: 677
used: 777
used: 801
used: 634
used: 2723
used: 3095
used: 629
used: 765
used: 802
used: 609
used: 655
used: 4885
used: 733
used: 890
used: 715
used: 605
used: 4018
used: 2973
used: 585
used: 753
used: 851
used: 660
used: 521
used: 5082

Nov 01 '22 13:11 res-life

we should document the behavior of alluxio in our docs. Mention sync is disabled by default so it won't pick up changed files. If you want to enable sync it has performance impact and we need to document where and how it works.

Nov 03 '22 13:11 tgravescs

@res-life do you have another issue to document the sync setup?

Nov 08 '22 14:11 tgravescs

OK, I'll file a follow-up issue to track.

Nov 09 '22 01:11 res-life

The follow-up issue: https://github.com/NVIDIA/spark-rapids/issues/7079

Nov 16 '22 10:11 res-life

spark-rapids
spark-rapids copied to clipboard

[BUG] Always read old data from alluxio regardless of S3 changes when using CONVERT_TIME replacement algorithm

spark-rapids spark-rapids copied to clipboard

[BUG] Always read old data from alluxio regardless of S3 changes when using CONVERT_TIME replacement algorithm

spark-rapids
spark-rapids copied to clipboard