dcache icon indicating copy to clipboard operation
dcache copied to clipboard

billing: some remove messages lack DN/FQAN information

Open calestyo opened this issue 11 months ago • 4 comments

Hey.

We recently had an issue where files got "missing" but where it turned out that most likely ATLAS had just deleted them earlier.

Alongside of debugging that, I noticed that for many "door-request-info" message we don't see any DN/FQAN in the billing log entry, despite being configured in our billing log format strings:

billing.text.format.door-request-info-message=$date; format="${lmu.miscellaneous.date-time-format}"$$$$\\t$drMsg:$\\t$[$cellType$:$cellName.cell$@$cellName.domain$:$type$]$\\t$[$session$]$\\t$[$pnfsid$:$path$]$\\t$$$$filesize$B$\\t$[$if(storage)$$$$storage.storageClass$@$storage.hsm$$$$else$<unknown>$endif$]$\\t$[$subject.loginName$]$\\t$[$subject.dn$]$\\t$[[$subject.primaryFqan$]:[$subject.fqans; separator="|"$]]$\\t$[$subject.userName$]$\\t$[$subject.uid$]$\\t$[$subject.primaryGid$:$subject.gids; separator="|"$]$\\t$$$$queuingTime$ms$\\t$[$clientChain$]$\\t$$$$transactionTime$ms$\\t$[$transferPath$]$\\t$[$rc$:"$message$"]
billing.text.format.pool-hit-info-message=$date; format="${lmu.miscellaneous.date-time-format}"$$$$\\t$phMsg:$\\t$[$cellType$:$cellName.cell$@$cellName.domain$:$type$]$\\t$[$session$]$\\t$[$pnfsid$:$path$]$\\t$$$$filesize$B$\\t$[$if(storage)$$$$storage.storageClass$@$storage.hsm$$$$else$<unknown>$endif$]$\\t$[$subject.loginName$]$\\t$[$subject.dn$]$\\t$[[$subject.primaryFqan$]:[$subject.fqans; separator="|"$]]$\\t$[$subject.userName$]$\\t$[$subject.uid$]$\\t$[$subject.primaryGid$:$subject.gids; separator="|"$]$\\t$$$$queuingTime$ms$\\t$[$protocol$]$\\t$$$$if(cached)$cached$else$not-cached$endif$$$$\\t$[$transferPath$]$\\t$[$rc$:"$message$"]
billing.text.format.storage-info-message=$date; format="${lmu.miscellaneous.date-time-format}"$$$$\\t$sMsg:$\\t$[$cellType$:$cellName.cell$@$cellName.domain$:$type$]$\\t$[$session$]$\\t$[$pnfsid$:$path$]$\\t$$$$filesize$B$\\t$[$if(storage)$$$$storage.storageClass$@$storage.hsm$$$$else$<unknown>$endif$]$\\t$[$subject.loginName$]$\\t$[$subject.dn$]$\\t$[[$subject.primaryFqan$]:[$subject.fqans; separator="|"$]]$\\t$[$subject.userName$]$\\t$[$subject.uid$]$\\t$[$subject.primaryGid$:$subject.gids; separator="|"$]$\\t$$$$queuingTime$ms$\\t$$$$transferTime$ms$\\t$[$rc$:"$message$"]
billing.text.format.mover-info-message=$date; format="${lmu.miscellaneous.date-time-format}"$$$$\\t$mMsg:$\\t$[$cellType$:$cellName.cell$@$cellName.domain$:$type$]$\\t$[$session$]$\\t$[$pnfsid$:$path$]$\\t$$$$filesize$B$\\t$[$if(storage)$$$$storage.storageClass$@$storage.hsm$$$$else$<unknown>$endif$]$\\t$[$subject.loginName$]$\\t$[$subject.dn$]$\\t$[[$subject.primaryFqan$]:[$subject.fqans; separator="|"$]]$\\t$[$subject.userName$]$\\t$[$subject.uid$]$\\t$[$subject.primaryGid$:$subject.gids; separator="|"$]$\\t$$$$queuingTime$ms$\\t$[$protocol$]$\\t$[$initiator$]$\\t$$$$if(p2p)$p2p$else$no-p2p$endif$$$$\\t$$$$if(created)$upload$else$download$endif$$$$\\t$$$$transferred$B$\\t$$$$meanReadBandwidth$B/s$\\t$$$$meanWriteBandwidth$B/s$\\t$$$$connectionTime$ms$\\t$$$$readActive$ms$\\t$$$$readIdle$ms$\\t$$$$writeActive$ms$\\t$$$$writeIdle$ms$\\t$[$transferPath$]$\\t$[$rc$:"$message$"]
billing.text.format.remove-file-info-message=$date; format="${lmu.miscellaneous.date-time-format}"$$$$\\t$rfMsg:$\\t$[$cellType$:$cellName.cell$@$cellName.domain$:$type$]$\\t$[$session$]$\\t$[$pnfsid$:$path$]$\\t$$$$filesize$B$\\t$[$if(storage)$$$$storage.storageClass$@$storage.hsm$$$$else$<unknown>$endif$]$\\t$[$subject.loginName$]$\\t$[$subject.dn$]$\\t$[[$subject.primaryFqan$]:[$subject.fqans; separator="|"$]]$\\t$[$subject.userName$]$\\t$[$subject.uid$]$\\t$[$subject.primaryGid$:$subject.gids; separator="|"$]$\\t$$$$queuingTime$ms$\\t$[$rc$:"$message$"]
billing.text.format.warning-pnfs-file-info-message=$date; format="${lmu.miscellaneous.date-time-format}"$$$$\\t$wpfMsg:$\\t$[$cellType$:$cellName.cell$@$cellName.domain$:$type$]$\\t$[$session$]$\\t$[$pnfsid$:$path$]$\\t$$$$filesize$B$\\t$[$if(storage)$$$$storage.storageClass$@$storage.hsm$$$$else$<unknown>$endif$]$\\t$[$subject.loginName$]$\\t$[$subject.dn$]$\\t$[[$subject.primaryFqan$]:[$subject.fqans; separator="|"$]]$\\t$[$subject.userName$]$\\t$[$subject.uid$]$\\t$[$subject.primaryGid$:$subject.gids; separator="|"$]$\\t$$$$queuingTime$ms$\\t$[$transferPath$]$\\t$[$rc$:"$message$"]

A concrete log entry, where the info is missing looks like:

billing-2023.07.01:2023-07-01 04:24:13+02:00	drMsg:	[door:webdav.tls_lcg-lrz-dc36@webdav_lcg-lrz-dc36:remove]	[door:webdav.tls_lcg-lrz-dc36@webdav_lcg-lrz-dc36:1688178253327-490853]	[0000841D6A56EDE047699BA6A6B39B2EDEC8:/pnfs/lrz-muenchen.de/data/atlas/dq2/atlasscratchdisk/rucio/panda/28/b5/panda.um.user.hiito.33917626._000421.hist-output.root]	10018B	[<unknown>]	[]	[]	[[]:[]]	[prdatl01]	[50201]	[1307:1307]	0ms	[unknown]	0ms	[/pnfs/lrz-muenchen.de/data/atlas/dq2/atlasscratchdisk/rucio/panda/28/b5/panda.um.user.hiito.33917626._000421.hist-output.root]	[0:""]

one where it's not missing like:

billing-2023.07.05:2023-07-05 21:10:52+02:00	drMsg:	[door:webdav.tls_lcg-lrz-dc35@webdav_lcg-lrz-dc35:remove]	[door:webdav.tls_lcg-lrz-dc35@webdav_lcg-lrz-dc35:1688584252881-574377]	[000079F49D6B2CA2473791F504BC9E98729B:/pnfs/lrz-muenchen.de/data/ops/test_webdav_access_84_moved]	121B	[<unknown>]	[]	[/DC=EU/DC=EGI/C=GR/O=Robots/O=Greek Research and Technology Network/CN=Robot:[email protected]]	[[/ops]:[/ops]]	[ops001]	[45001]	[45000:45000]	0ms	[unknown]	0ms	[/pnfs/lrz-muenchen.de/data/ops/test_webdav_access_84_moved]	[0:""]

with the following numbers of cases (for e.g. this month):

/var/lib/dcache/billing/2023/07# grep ":remove" billing-2023.07.* | grep -v cleaner | grep $'\[[]:[]]\t\[' | wc -l
897369
/var/lib/dcache/billing/2023/07# grep ":remove" billing-2023.07.* | grep -v cleaner | grep -v $'\[[]:[]]\t\[' | wc -l
2457

so for nearly all transfers, the information is missing. And from those where it's not:

/var/lib/dcache/billing/2023/07# grep ":remove" billing-2023.07.* | grep -v cleaner | grep -v $'\[[]:[]]\t\[' | grep '\[ops001]' | wc -l
2276

The bigger part is from ops tests.

So I'd presume there's some systematic difference in how these deletions are done.

Perhaps ATLAS uses macaroons, and then the info is missing?
But that would IMO be quite a deficiency in the system, as it seems to make tracking down who's to blame quite difficult.

So if it would be indeed macaroons,... couldn't it be made, that dcache keeps the information from which DN/FQAN the macaroon used for authentication was generated and include that in the billing?

Cheers, Chris.

calestyo avatar Jul 17 '23 11:07 calestyo

Hi Chris.

Which dCache version is this?

Cheers, Lea

lemora avatar Oct 10 '23 10:10 lemora

Hey.

9.1.2.

Cheers, Chris

calestyo avatar Oct 10 '23 10:10 calestyo

Okay. To be honest, this is very likely token-related, and in that case there is nothing we can do to retrrieve and log a DN/FQAN.

lemora avatar Oct 10 '23 16:10 lemora

Phew... that sounds pretty serious if one could't trace down which user is responsible for some action.

calestyo avatar Oct 10 '23 17:10 calestyo