DIRAC icon indicating copy to clipboard operation
DIRAC copied to clipboard

dirac-dms-remove-catalog-replicas should remove LFN from file catalog if it deletes the last replica of a file

Open marianne013 opened this issue 2 years ago • 2 comments

Related to https://github.com/DIRACGrid/DIRAC/issues/7075

While this is not the only way this can happen, we think the most likely cause for a database entry for a file with zero replicas was the following sequence of events:

  • A number of files were lost on disk
  • The file catalogue was tidied up using dirac-dms-remove-catalog-replicas specifying the misbehaving storage element
  • If a lost file was the last replica, dirac-dms-remove-catalog-replicas leaves an entry in the database behind
  • This entry can be removed with dirac-dms-remove-catalog-files, but given an (all too common) scenario of files being lost as a specific storage element dirac-dms-remove-catalog-replicas seems the obvious choice of tools to tidy up the catalogue as other replicas are unaffected.
  • This then would require an extra check by the user to see if it was the last replica. They will not be aware of this. The differences in output are also rather subtle (see below) and do not indicate a problem even to an experienced admin.

This scenario can be reproduced with the following sequence:

(base) gridpp_py3 > dirac-dms-add-file /t2k.org/user/d/reptest.txt reptest.txt UKI-LT2-IC-HEP-disk

Uploading /t2k.org/user/d/reptest.txt
Successfully uploaded file to UKI-LT2-IC-HEP-disk

[now sneakily delete file on disk to simulate storage meltdown]

Try to clean up the catalog:

(base) gridpp_py3 >
dirac-dms-remove-catalog-replicas /t2k.org/user/d/reptest.txt UKI-LT2-IC-HEP-disk
Successfully remove 1 catalog replicas at UKI-LT2-IC-HEP-disk

It still knows the LFN:

(base) gridpp_py3 > dirac-dms-lfn-replicas/t2k.org/user/d/reptest.txt
No output

That's rather subtle and you only notice that something is amiss once you realize that the output for a truly non-existent LFN is different:

(base) gridpp_py3 > dirac-dms-lfn-replicas /t2k.org/user/d/reptest.txt0
LFN                          StorageElement URL
===============================================
/t2k.org/user/d/reptest.txt0  Unknown        No such file or directory

Apply the nuclear option:

(base) gridpp_py3 > dirac-dms-remove-catalog-files /t2k.org/user/d/reptest.txt
Successfully removed 1 catalog files.

(base) lx04:2023_May_16_1234_gridpp_py3 > dirac-dms-lfn-replicas/t2k.org/user/d/reptest.txt
LFN                         StorageElement URL
==============================================
/t2k.org/user/d/reptest.txt Unknown        No such file or directory

There are probably other ways database entries can get into this state, but this is one of the more likely scenarios.

Can you please fix the following issues:

  • [ ] dirac-dms-remove-catalog-replicas should delete the file catalog entry if the replica it removes is the last of its kind (1 bonus point)
  • [x] Instead of returning "No output" dirac-dms-lfn-replicas should return an error if there are zero replicas of a file as this is an error state (i.e. not foreseen in the DIRAC code) (1 bonus point)

Once your bonus stamp card is full, you can claim your free beer.

Tagging @sfayer so he knows it's all filed.

marianne013 avatar Jun 23 '23 14:06 marianne013

Part of it is in https://github.com/DIRACGrid/DIRAC/pull/7077/commits/f6f0e37504c97e6a38549c40397ddbf15addebe2

For remove-catalog-replicas, why not just using dirac-dms-remove-replicas ? it will tell you if you are trying to delete the last replicas.

Also, this behavior is rather inconsistent with the rest of the system, which prevents you to call removeReplicas on the last replicas.

chaen avatar Jun 24 '23 21:06 chaen

Sorry, my google filter ate the reply. dirac-dms-remove-replicas was ruled out as it would not be able to remove the replica as the file is already gone on disk.

marianne013 avatar Jun 29 '23 13:06 marianne013