carbondata icon indicating copy to clipboard operation
carbondata copied to clipboard

[CARBONDATA-3985] Optimize the segment-timestamp file clean up

Open su-article opened this issue 4 years ago • 12 comments

Why is this PR needed?

For data update, in the CarbonProjectForUpdateCommand process, after the delete delta file is generated, the status of each segment is checked. If the status is not successful, all the segment directories are traversed to clean up the timestamp corresponding .carbondata, .carbonindex and .deletedelta files. If a great many segments have been generated in the Partion directory, it will be very time-consuming.

What changes were proposed in this PR?

In the process of generating delete delta, record the segment path involved in this update; after entering the checkAndUpdateStatusFiles() function, if a segment status is found to be not successful, it will be cleaned directly according to the segment path list that has been recorded during generating delete delta, without searching all the segment directories.

Does this PR introduce any user interface change?

  • No

Is any new testcase added?

  • No

su-article avatar Sep 14 '20 06:09 su-article

Can one of the admins verify this patch?

CarbonDataQA1 avatar Sep 14 '20 06:09 CarbonDataQA1

Add to whitelist

ajantha-bhat avatar Sep 14 '20 07:09 ajantha-bhat

retest this please

ajantha-bhat avatar Sep 14 '20 07:09 ajantha-bhat

add to whitelist

akashrn5 avatar Sep 14 '20 08:09 akashrn5

@su-article as i can see from code, the delete status which is deleteStatus is updated in deleteDeltaFunc function and inside the executor, and we are returning the result in tuple.

Observations:

  1. In deleteDeltaFunc function, the deleteStatus is updated to success, if everything passes, or else it will be as failure(default value). If any exception occurs, we catch and throw from the executor, which is complete delete operation failure.
  2. Once the exception is thrown from the function, it's directly caught on the driver side at CarbonProjectForUpdateCommand or CarbonProjectForDeleteCommand catch blocks.
  3. The code never hits the checkAndUpdateStatusFiles in failure case of delete, as you have mentioned in description.
  4. The CarbonUpdateUtil.cleanStaleDeltaFiles is called from the catch block of update and delete command classes, where we pass the currentTimestamp which is the factTimeStamp for update and delete to delete the stale files in case of any failure.

So, from my analysis, i think you are handling this in place of dead codes and it wont improve anything.

can you please check the above points from your testing and code review? Correct me if i'm wrong in the analysis.

You may need to handle in cleanStaleDeltaFiles call from both the command classes, instead of the existing change. Please check and reply.

akashrn5 avatar Sep 14 '20 08:09 akashrn5

@marchpure can you please check my comment and give your opinion?

akashrn5 avatar Sep 14 '20 08:09 akashrn5

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4062/

CarbonDataQA1 avatar Sep 14 '20 09:09 CarbonDataQA1

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2323/

CarbonDataQA1 avatar Sep 14 '20 10:09 CarbonDataQA1

Actually, we shall not do the clean work during UPDATE/DELETE. I think we can directly remove this code to avoid any possiblity of data lose.

@marchpure can you please check my comment and give your opinion?

marchpure avatar Sep 15 '20 03:09 marchpure

Actually, we shall not do the clean work during UPDATE/DELETE. I think we can directly remove this code to avoid any possiblity of data lose.

@marchpure can you please check my comment and give your opinion?

@marchpure i agree that, it will create some serious problems in some corner cases. But if we remove the cleaning code now, there can issues of data mismatch or wrong data, which is also a serious one. So basically we need to decide how to delete correctly and at what time. For this already @vikramahuja1001 have raised a discussion, can you please check and give inputs there?. You can check here http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Clean-files-enhancement-tt100088.html

akashrn5 avatar Sep 15 '20 04:09 akashrn5

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4089/

CarbonDataQA1 avatar Sep 16 '20 03:09 CarbonDataQA1

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2349/

CarbonDataQA1 avatar Sep 16 '20 03:09 CarbonDataQA1