starrocks icon indicating copy to clipboard operation
starrocks copied to clipboard

[BugFix] The hdfs directory is not synchronized when the spark resource is deleted

Open blanklin030 opened this issue 1 year ago • 0 comments

Steps to reproduce the behavior (Required)

    1. create spark load
LOAD LABEL pre_stream.test_load_ly_2 (
DATA FROM TABLE test_list_dup_sr_external_h2s_foit_820240510
INTO TABLE test_list_dup_sr
TEMPORARY PARTITION(temp__p20230930_BR)
SET (
    `id` = `id`,
    `name` = `name`,
    `dt` = '2023-09-30',
    `country_code` = 'BR'
) 
)WITH RESOURCE 'spark_resource' (
  "spark.yarn.tags" = "xxx05131",
  "spark.dynamicAllocation.enabled" = "true",
  "spark.executor.memory" = "3g",
  "spark.executor.memoryOverhead" = "2g",
  "spark.streaming.batchDuration" = "5",
  "spark.executor.cores" = "1",
  "spark.yarn.executor.memoryOverhead" = "2g",
  "spark.speculation" = "false",
  "spark.dynamicAllocation.minExecutors" = "2",
  "spark.dynamicAllocation.maxExecutors" = "100"
) PROPERTIES (
  "timeout" = "72000",
  "spark_load_submit_timeout" = "7200"
)
;
    1. some directory is created
2024-05-14 01:42:12,013 INFO (pending_load_task_scheduler_pool-1|498) [SparkRepository.upload():302] finished to upload file, localPath=/home/hadoop/starrocks-current/fe/spark-dpp/spark-dpp-1.0.0-jar-with-dependencies.jar, remotePath=hdfs://ClusterNmg/user/prod_xxx/sparketl/1384206915/__spark_repository__db__tb_sr__1019adb1d38c/__archive_1.0.0/__lib__spark-dpp-1.0.0-jar-with-dependencies.jar


2024-05-14 01:42:12,077 INFO (pending_load_task_scheduler_pool-1|498) [SparkRepository.rename():316] finished to rename file, originPath=hdfs://ClusterNmg/user/prod_xxx/sparketl/1384206915/__spark_repository__db__tb_sr__1019adb1d38c/__archive_1.0.0/__lib__spark-dpp-1.0.0-jar-with-dependencies.jar, destPath=hdfs://ClusterNmg/user/prod_xxx/sparketl/1384206915/__spark_repository__db__tb_sr__1019adb1d38c/__archive_1.0.0/__lib_70688c469808112f344091125a860404_spark-dpp-1.0.0-jar-with-dependencies.jar
    1. drop spark resource
drop resource spark_resource
    1. The hdfs directory is not synchronized when the spark resource is deleted
[hadoop@bigdata-starrocks-xxx ~]$ hdfs dfs -ls hdfs://ClusterNmg/user/prod_xxx/sparketl/1384206915/__spark_repository__spark_resource/__archive_1.0.0/
Found 2 items
-rw-r--r--   3 prod_xxx supergroup  394653421 2024-05-20 10:54 hdfs://ClusterNmg/user/prod_xxx/sparketl/1384206915/__spark_repository__spark_resource/__archive_1.0.0/__lib_62eff19a2751990e17b47aa258fb7623_spark-2x.zip
-rw-r--r--   3 prod_xxx supergroup    4013682 2024-05-20 10:53 hdfs://ClusterNmg/user/prod_xxx/sparketl/1384206915/__spark_repository__spark_resource/__archive_1.0.0/__lib_70688c469808112f344091125a860404_spark-dpp-1.0.0-jar-with-dependencies.jar

Expected behavior (Required)

drop spark resource and delete spark directory

Real behavior (Required)

drop spark resource and the spark directory didn't remove

StarRocks version (Required)

  • You can get the StarRocks version by executing SQL select current_version()

blanklin030 avatar May 20 '24 03:05 blanklin030