carbondata
carbondata copied to clipboard
[CARBONDATA-3985] Optimize the segment-timestamp file clean up
Why is this PR needed?
For data update, in the CarbonProjectForUpdateCommand process, after the delete delta file is generated, the status of each segment is checked. If the status is not successful, all the segment directories are traversed to clean up the timestamp corresponding .carbondata, .carbonindex and .deletedelta files. If a great many segments have been generated in the Partion directory, it will be very time-consuming.
What changes were proposed in this PR?
In the process of generating delete delta, record the segment path involved in this update; after entering the checkAndUpdateStatusFiles() function, if a segment status is found to be not successful, it will be cleaned directly according to the segment path list that has been recorded during generating delete delta, without searching all the segment directories.
Does this PR introduce any user interface change?
- No
Is any new testcase added?
- No
Can one of the admins verify this patch?
Add to whitelist
retest this please
add to whitelist
@su-article as i can see from code, the delete status which is deleteStatus
is updated in deleteDeltaFunc
function and inside the executor, and we are returning the result in tuple.
Observations:
- In
deleteDeltaFunc
function, thedeleteStatus
is updated to success, if everything passes, or else it will be as failure(default value). If any exception occurs, we catch and throw from the executor, which is complete delete operation failure. - Once the exception is thrown from the function, it's directly caught on the driver side at
CarbonProjectForUpdateCommand
orCarbonProjectForDeleteCommand
catch blocks. - The code never hits the
checkAndUpdateStatusFiles
in failure case of delete, as you have mentioned in description. - The
CarbonUpdateUtil.cleanStaleDeltaFiles
is called from the catch block of update and delete command classes, where we pass thecurrentTimestamp
which is thefactTimeStamp
for update and delete to delete the stale files in case of any failure.
So, from my analysis, i think you are handling this in place of dead codes and it wont improve anything.
can you please check the above points from your testing and code review? Correct me if i'm wrong in the analysis.
You may need to handle in cleanStaleDeltaFiles
call from both the command classes, instead of the existing change. Please check and reply.
@marchpure can you please check my comment and give your opinion?
Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4062/
Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2323/
Actually, we shall not do the clean work during UPDATE/DELETE. I think we can directly remove this code to avoid any possiblity of data lose.
@marchpure can you please check my comment and give your opinion?
Actually, we shall not do the clean work during UPDATE/DELETE. I think we can directly remove this code to avoid any possiblity of data lose.
@marchpure can you please check my comment and give your opinion?
@marchpure i agree that, it will create some serious problems in some corner cases. But if we remove the cleaning code now, there can issues of data mismatch or wrong data, which is also a serious one. So basically we need to decide how to delete correctly and at what time. For this already @vikramahuja1001 have raised a discussion, can you please check and give inputs there?. You can check here http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Clean-files-enhancement-tt100088.html
Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4089/
Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2349/