dcache Request to rework the pool's checksum module

trafficstars

Hello dCache.org,

the past two weeks we've had two major incidents with our storage setup. It was not the fault of dCache, no worries there. 😉 After we had recovered from the incident, we wanted to let dCache validate the data it has by using the checksum module. Quickly we realized, that this feature in its current state is of little to now help here and these are the major issues we have with it:

csm currently has just two modes of operation: Either check a single file or the entire inventory for a pool.
- You cannot give a list of files or maybe a storage class. So if you want to check ten specific files, you have to start the process for one file and check regularly when the task is done to start the next one. That is acceptable only for a very small number of files and doesn't scale at all.
- The other extreme is to validate the entire inventory with zero customization. csm will work through the entire catalog sequentially, no parallelization, so it will take ages with modern pool sizes (>1 PB for us).
As mentioned before, csm works in the background asynchronously.
- That's desirable for a large batch of files and a nuisance for small numbers.
- The pool does not inform the admin about the current state or progress. We have to poll that regularly ourselves. That is fine for long running tasks, I guess, yet the information provided still is insufficient.
- Scanning all files may abort with any exception. It depends on the error whether we can find out what the actual problem was or even which file was affected. There is no way to resume from where the scan aborted or to tell csm to keep going anyway, as to process all files eventually (even if it takes weeks). You can only try to fix the problem and start over from the beginning.

With our most recent incident we encountered two situations where the full pool scan aborts:

A file has an unexpected file size. csm status shows this: FullScan Aborted (failure in underlying storage: failed to read 00009EE4DF198CF1468B8E99180D8CFA9522: Failed to read the file, because file is Broken.) 1038 files: 0 corrupt, 1 unable to check
- The file is marked bad by the pool. I'm not entirely sure the files wasn't already marked bad when the csm got to it, but at least now it is.
- FullScan aborted, as the message says, after 1038 files. Which files were those? No clue. Can I skip recalculating the checksum for those in order to also validate the other x-million files? Doesn't look like it.
- 1 file unable to check. Which file? For what reason?
- csm show errors yields nothing. Not the broken file, nor the file that was uncheckable.
- So we only know that at least 1037 have a correct checksum.
We actually don't know what the problem was when csm aborted this full scan: FullScan Idle java.lang.IllegalArgumentException: No expected checksums 27280 files: 0 corrupt, 0 unable to check
- I interpreted it such that a file was found for which dCache didn't know a checksum beforehand - that should be impossible, but just maybe? So I checked whether Chimera knew any files for that pool that actually had no checksum, but found nothing.
- OK, then maybe that was a glitch and next time things will go better? So I restarted csm check * - because we cannot resume - but the same issue occurred again. We don't know whether the same file was the break point or not, because csm doesn't tell us.

Because csm does such a poor job for us - sorry for this harsh criticism - we resorted to calculating the checksums ourselves now. Once that is done, we'll have to compare it to what Chimera knows ourselves, too, but at least there is some progress.

Here a couple of ideas/proposals for how to improve the csm.

If it wasn't clear by now, we need more versatility overall:
- More modes of operation. Providing a list of PNFS-IDs at minimum, but a more versatile filter like for the migration tool would be much appreciated.
- csm has to be able to work multithreaded. Alternatively, it should be possible to start several csm tasks, again similar to the migration tool (which only makes sense when we can filter the files to be checked).
- Most importantly, csm should complete the task and not abort when it finds a problem with a file (that is not a checksum mismatch)! So much wasted time waiting/idling and restarts processing some files multiple times over could be avoided that way.
We need more details from csm status. How many files are to be checked? At what rate is csm working? How far did it progress already? Once more, the migration tool does a very good job in this regard, too.
I my humble opinion, csm should be able to write out all of its findings, maybe with configurable level of detail, into a logfile (not the pool's logfile).

Again, I'm sorry to bash on csm so much. It is a valuable tool and has been for the longest time. But in order to stay relevant it needs a bit of ❤️ 🙂

Thank you for your time and effort, Xavier.

/cc @nsc-jens, @samuambroj , @cfgamboa

Mar 23 '22 15:03 XMol

I didn't get the ping for some reason, but here are my comments:

I too have problems with the single file check in the csm module with the same use case, external issues on the underlaying file system. In my case I want to check files touched in a small time frame.

Two solutions could be to give csm check a flag to run in the foreground and reporting the status of the checked file so I could script this, give "csm check" the ability to have multiple files as an argument (*|PNFSID [PNFSID]). Both would solve my problems if the output from "csm status" reports something reasonable (like which files didn't match the checksum).

Mar 30 '22 13:03 nsc-jens

In other notes. You can start a "csm check *" checksum operation but I don't think you can abort it without restarting the pool. If the code is touched, perhaps this could be implemented.

Mar 30 '22 13:03 nsc-jens

if the file has two checksums csm check would verify both of them. Should this command select the checksum type to check?

Mar 30 '22 14:03 cfgamboa

if the file has two checksums csm check would verify both of them. Should this command select the checksum type to check?

I don't know really. I would assume that it would check all.

Mar 30 '22 14:03 nsc-jens

Hi Xavier, we've also faced the problem "no expected checksums" and the files affected where those which had not finished the upload, and that remain with the temporary name "/upload/...."

See below an example on how we find the files:

chimera=# select count(*) from t_locationinfo where t_locationinfo.ilocation='pool_name' and t_locationinfo.inumber not in (select t_locationinfo.inumber from t_locationinfo,t_inodes_checksum where t_locationinfo.inumber=t_inodes_checksum.inumber and t_locationinfo.ilocation='pool_name');

chimera=# select ipnfsid,inumber2path(inumber) from t_inodes where inumber in (select t_locationinfo.inumber from t_locationinfo where t_locationinfo.ilocation='dc023_2' and t_locationinfo.inumber not in (select t_locationinfo.inumber from t_locationinfo,t_inodes_checksum where t_locationinfo.inumber=t_inodes_checksum.inumber and t_locationinfo.ilocation='dc023_2')); ipnfsid | inumber2path --------------------------------------+---------------------------------------------------------------------------------------- 0000E207FF0AB45B49BD9C328994EE9EA5D7 | /upload/15/ba724d4f-1ea3-4b27-bec4-0c5619554bff/00155700_00030184_1.sim 000020C76A3405464EC4A598AE5F853E0A6C | /upload/2/f792e67a-714a-4818-b827-bda75c0f720f/00156798_00024585_1.sim 0000A3AB707DAB534AED94984319E2B0BA04 | /upload/5/3319990e-13db-4253-8b02-78e70828d9c7/00156789_00015844_1.sim 0000EFB384FD546E4A01820EA53C648D6248 | /upload/2/c73a4853-a5aa-4e47-823b-fec57fa8b1f7/00156786_00010064_1.sim 00004B2CDE2431CB4182AD30FDF64D71BBF0 | /upload/5/df174ae5-3dbe-4207-a029-0bd406b16083/00155791_00023104_1.sim [...]

Once identified the files we delete them with "chimera rm"

Jun 01 '22 19:06 elenamplanas

We discussed this issue today. Here are some examples of the wishes for functionality. Suggestions only.

csm check * -size=0
csm check * -size=..1000000000 like the migration module, but perhaps with suffixes, i.e 1000M..10GB
csm check * -accessed=..3600 like the migration module or perhaps something more evolved (ctime?).
csm check * -storage=snic:chimera
csm check 0000A3AB707DAB534AED94984319E2B0BA04 -foreground or
csm check 0000A3AB707DAB534AED94984319E2B0BA04 -wait

Aug 31 '22 08:08 nsc-jens

Maybe been able to check files that are not popular or recently accessed.

On Aug 31, 2022, at 4:55 AM, Jens Larsson @.***> wrote:

We discussed this issue today. Here are some examples of the wishes for functionality. Suggestions only.

csm check * -size=0 csm check * -size=..1000000000 like the migration module, but perhaps with suffixes, i.e 1000M..10GB csm check * -accessed=..3600 like the migration module or perhaps something more evolved (ctime?). csm check * -storage=snic:chimera csm check 0000A3AB707DAB534AED94984319E2B0BA04 -foreground or csm check 0000A3AB707DAB534AED94984319E2B0BA04 -wait — Reply to this email directly, view it on GitHub https://github.com/dCache/dcache/issues/6565#issuecomment-1232660663, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHIHMO4H3IECE3XQGQDUYBTV34MWXANCNFSM5ROMI6ZA. You are receiving this because you were mentioned.

Aug 31 '22 19:08 cfgamboa

Thanks for the suggestions @nsc-jens and @cfgamboa

To be honest, I'm a little unclear about csm check * -size=0. Would this be to check all zero-length files?

Just to add my 2c-worth.

I do see running a single-file checksum and reporting via the console as a useful feature. Currently, the admin interface isn't really "tuned" for long-running jobs; for example, one cannot abort/interrupt an admin command, once started, nor run admin commands "in the background" (like with Unix shells) and continue with other work. These may be things we would want to look at, before implementing a -foreground or -wait option.

On @XMol 's comment about not being informed about problems with a background checksum. I think the existing (built-in) alert mechanism might be one possible solution there.

He also mentioned a list of small-but-very-annoying problems. These should "just" be fixed.

Aborting ongoing checksum runs. Sure, why not?

@elenamplanas The files in the /upload directory are ongoing SRM-mediated uploads or xroot upload with the kXR_posc option. Once the upload is complete, the file is moved to the destination directory (if successful) and the upload-specific directory is removed. For SRM, the transfer is complete only if the client calls srmPutDone; if not, then the file should be garbage-collected once the upload times out. With either SRM or xroot, it's not possible for a client to specify an expected checksum. Therefore, dCache will only know about the checksum of the file once it is successfully uploaded. The expectation is that the client will verify the checksum (once uploaded) and remove the file if corrupt.

Sep 01 '22 07:09 paulmillar

Yes, csm check * -size=0 would check all files with size zero. A fault scenario we see sometimes is:

Writes are ongoing to the pool.
/Power outage/machine crash/raid controller hickup/ happens.
Machine and dCache restarted.
For some reason the written files are zero length in the file system, but the content of the local meta database disagrees. I haven't been able to check if the registered checksum is wrong of if the registered file size (is it registered in the local database?) is wrong. The pool startup doesn't detect this problem at least.
The file is read by a client somewhere.
dCache detects that the file is broken, marks the file as broken and disables the pool for self protection (Broken storage system!).

Being able to run a csm check of all zero bytes files (according to the file system!) or by age of the file would be a manual way to clean up the pool after a crash and prevent some of these automatic shutdowns. Today we are limited to run a check on all files csm check * which takes a very long time with the 500TB+ pools. Or running a pretty lengthy find on the file system with many millions of files to get a list of recent and/or zero bytes files and then run a pretty complicated csm check operation to check these.

Worst case so far was 10 pools with a huge CEPH backend. It seems that writes were acked and everything looked fine, but the data wasn't flushed to stable storage until 5 minutes later. So after a spectacular crash, all data written the past 5 minutes were lost. We used find (took a full day) to get a list of all files that were touched within those 5 minutes and made a script that checked one file, waited 2 minutes, checked a new file and then manually checked the few files that failed the operation (csm already running). It seems that a pool isn't disabled when a csm check command is run on a broken file, but the file is marked as broken, and when a read request later comes in and the file is marked as broken the request is just rejected with the pool left running.

PS. Perhaps csm check * should be csm check all instead? There is no wild card matching going on here I think.

Sep 01 '22 13:09 nsc-jens

Thanks @nsc-jens for the background!

What you describe sounds like a failure of the pool's start-up. Prima facie, I would have expected the pool to detect these problem files on start-up, rather than when a client attempted to read the file. The pool disabling itself is a panic reaction when (what the pool believed to be) a valid file is suddenly found to be corrupt (the pool doesn't know if it is causing this corruption, so disables itself). To me, this is a consequence of the first problem: the inconsistency was not detected during the pool's start-up.

I would say this should be a separate issue, as it's not really a checksum issue.

Further, I think adding a csm check for this scenario is treating the symptom, rather than the underlying problem. Sure, we could do it; however, you would be hoping that the checksum command would find any corrupt data before a client attempts to open that file. Such a race between two components doesn't sound like a good idea.

On the more general point of CEPH delaying writing of data.

To be honest, I'm not really sure what to make of a file-based storage system that doesn't guarantee data integrity until after five minutes has elapsed. (Is this a worse-case, average or guaranteed delay?) One strategy would be for dCache to delay the final OK response to the client (e.g., the 201 response status code for an HTTP PUT request) until after CEPH guarantees the data is stored. That would make sense, but may result in clients timing out.

Perhaps this could be the topic of a different issue. I'm not convinced this is really an issue with the pool's checksum module: I think it goes deeper.

Sep 05 '22 09:09 paulmillar

dcache dcache copied to clipboard

Request to rework the pool's checksum module

dcache
dcache copied to clipboard