dcache icon indicating copy to clipboard operation
dcache copied to clipboard

Migration move jobs random flag verification

Open cfgamboa opened this issue 1 year ago • 10 comments
trafficstars

Dear all,

As it was reported today in the Tier1 dev meeting. Our DMZ pools have uses migration move jobs to distribute files to TAPE and DISK ONLY poolgroups. The following is an example of the migration job used to move files from DMZ pools to TAPE like pools on a pool group.

migration move -storage=bnlt1d0:BNLT1D0 -permanent -concurrency=40 -eager -select=random -replicas=1 -target=pgroup -- DATATAPE-write

migration move -storage=MCTAPE:MC -permanent -concurrency=40 -eager -select=random -replicas=1 -target=pgroup -- MCTAPE\-write

There 16 DMZ which are enabled/configured in a similar way.

Attached a picture of the pool monitor, this corresponds to a period in which the DMZ pools are saturated ( many TAPE files awaiting to be move to the internal TAPE pool groups)

image

It is not clear why there is a few pools chosen as a destination from the migration jobs?

This situation was first observed when we used the default set form the -select parameter.

I was expecting a more distributed allocation of destination pools from the TAPE diskgroup.

Could you please advise?

All the best, Carlos

cfgamboa avatar Apr 17 '24 14:04 cfgamboa

Hi Carlos.

Could you please provide more information on these jobs via migration info? Is there anything logged on the origin pools or in PoolManager, were these other pool ever tried?

Thanks. Lea

lemora avatar Apr 17 '24 16:04 lemora

Hello Lea,

 migration info 179
Command    : migration move -storage=MCTAPE:MC -permanent -concurrency=40 -eager -select=random -replicas=1 -target=pgroup -- MCTAPE\-write
State      : SLEEPING
Queued     : 0
Attempts   : 2929
Targets    : dc266_10,dc263_10,dc245_10,dc268_10,dc269_10,dc249_10,dc246_10,dc267_10,dc248_10,dc264_10,dc265_10,dc259_10,dc260_10,dc261_10,dc254_10,dc253_10,dc252_10,dc270_10,dc255_10,dc258_10
Completed  : 2821 files; 4344642122070 bytes; 100%
Total      : 4344642122070 bytes
Concurrency: 40
Running tasks:
Most recent errors:
08:26:06 [4655] 0000FF579F67CAAF40D7926FBE1A57B40250: File does not exist, skipped
08:26:16 [4660] 00009A193BE526A244ECB444F4A210EC56A1: Transfer to [dc269_10@local] failed (No such file or directory: 00009A193BE526A244ECB444F4A210EC56A1); will not be retried

Carlos

cfgamboa avatar Apr 17 '24 16:04 cfgamboa

@lemora here there is an example were the selection goes to one pool

    Command    : migration move -storage=bnlt1d0:BNLT1D0 -permanent -concurrency=40 -eager -select=random -replicas=1 -target=pgroup -- DATATAPE\-write
    State      : RUNNING
    Queued     : 0
    Attempts   : 1731
    Targets    : dc266_10,dc263_10,dc245_10,dc268_10,dc269_10,dc249_10,dc246_10,dc267_10,dc248_10,dc264_10,dc265_10,dc259_10,dc260_10,dc261_10,dc254_10,dc253_10,dc252_10,dc270_10,dc255_10,dc258_10
    Completed  : 1675 files; 7823832367778 bytes; 98%
    Total      : 7968129878941 bytes
    Concurrency: 40
    Running tasks:
    [16785] 000038B8C629722342969DA89EFF9978416D: TASK.Copying -> [dc258_10@local]
    [16899] 0000AAC5B024EE984B5B8C9C748D0384C90C: TASK.Copying -> [dc258_10@local]
    [16928] 0000550312C67B0F499BA575AE34B0E82E03: TASK.Copying -> [dc258_10@local]
    [16937] 000091015BAE1D12491BB97546EE57906F20: TASK.Copying -> [dc258_10@local]
    [16994] 0000330AEAE002D942B3BCBA526AEDCF96D5: TASK.Copying -> [dc258_10@local]
    [17031] 0000476DD01AE9554AD9A9F1338A983C7F8A: TASK.Copying -> [dc258_10@local]
    [17351] 0000276485F3CE9349648672BCC6E65684BA: TASK.Copying -> [dc258_10@local]
    [17447] 0000F6747705F7CA4946BF641F828ED7007F: TASK.Copying -> [dc258_10@local]
    [17459] 0000132D43F0281D45D0B9481DDFC2F1D790: TASK.Copying -> [dc258_10@local]
    [17472] 0000156D3D980FB744CB85AF804115C5BD8E: TASK.Copying -> [dc258_10@local]
    [17651] 00005430A7E0A45F479DA1C7E0E3C4F80338: TASK.Copying -> [dc258_10@local]
    [17930] 00005EE96E2A319644B6B0152F19A9DD8790: TASK.Copying -> [dc258_10@local]
    [18300] 0000AE20E0EDEC8D4EC08538C148ED24A892: TASK.Copying -> [dc258_10@local]
    [18617] 00001390D214813F44449AFCFD9D9B855EDC: TASK.Copying -> [dc253_10@local]
    [18752] 00003BD558A54ADA430C81FBE2AAB170042B: TASK.Copying -> [dc258_10@local]
    [18764] 000011532DE9D5FA468E8516C38055CB6DD5: TASK.Copying -> [dc258_10@local]
    [18993] 00002D3A5F89BF85471BA65B6873D4C9B8C5: TASK.Copying -> [dc258_10@local]
    [19047] 00007415AE95AD1A4F649E83CDD9BD6FB8F7: TASK.Copying -> [dc258_10@local]
    [19125] 000029FC4B8489D2476FAD4EB078DE636875: TASK.Copying -> [dc258_10@local]
    [19171] 00002E6EF98D3FEA45B7B2ECC957866F22AA: TASK.Copying -> [dc258_10@local]
    [19257] 0000BCBBB24C9AD147AFB847CE43AA2E7327: TASK.Copying -> [dc253_10@local]
    [19293] 0000588897B48404491E9D2289658255D90C: TASK.Copying -> [dc253_10@local]
    [19329] 0000ECB0152EEDF34A4C9FB38E0DC5CDFF24: TASK.Copying -> [dc253_10@local]
    [19336] 0000C263D49A3AC74CC4B6F37E12A99F9F8D: TASK.Copying -> [dc258_10@local]

Many Migration jobs select the same pool

image

cfgamboa avatar Apr 18 '24 12:04 cfgamboa

Only when I cancel the on going migration stuck (hot pool) and exclude the HOT pool from the migration job destination the destination pools for transfers started to be more diverse.

    Command    : migration move -storage=MCTAPE:MC -permanent -concurrency=40 -eager -select=random -replicas=1 -exclude=dc258_10 -target=pgroup -- MCTAPE\-write
    State      : RUNNING
    Queued     : 380
    Attempts   : 103
    Targets    : dc266_10,dc263_10,dc245_10,dc268_10,dc269_10,dc249_10,dc246_10,dc267_10,dc248_10,dc264_10,dc265_10,dc259_10,dc260_10,dc261_10,dc254_10,dc253_10,dc252_10,dc270_10,dc255_10
    Completed  : 63 files; 108070950853 bytes; 6%
    Total      : 1766407491660 bytes
    Concurrency: 40
    Running tasks:
    [20366] 0000FF68EBD77BA84E39B94455CC0B90DF0A: TASK.Copying -> [dc254_10@local]
    [20368] 00003467F90D4D524A488AC8EC789E18C780: TASK.Copying -> [dc246_10@local]
    [20370] 0000580E1B24CF104896A7C8F5D03DDA3CDA: TASK.Copying -> [dc254_10@local]
    [20373] 00009ECC758E07F840E182D55601B52023AA: TASK.Copying -> [dc266_10@local]
    [20374] 000009980CDD21E840D39B7AE2DD21A4F49C: TASK.Copying -> [dc254_10@local]
    [20375] 0000D64A5282E0AD499D90A3036B3D685FFD: TASK.Copying -> [dc249_10@local]
    [20379] 0000352929A645D2463C8518E351980B498B: TASK.Copying -> [dc249_10@local]
    [20381] 00005FC09C36DB52460092C2F963589CC22E: TASK.Copying -> [dc253_10@local]
    [20393] 0000D5BF282B5C864C729F927122C279F551: TASK.Copying -> [dc264_10@local]
    [20399] 0000D276AB59EE5C4036B249AAFBD503EE0C: TASK.Copying -> [dc253_10@local]
    [20404] 00003DECF626706E41AF861BFF261AB69EAC: TASK.Copying -> [dc259_10@local]
    [20407] 00007417BDEE533C4E449DF48EA5C64F3469: TASK.Copying -> [dc249_10@local]
    [20411] 0000FAC0B45CF886425EAE11BE8F64672F69: TASK.Copying -> [dc254_10@local]
    [20413] 0000626112F23C7A4362B9F184C907A70C6E: TASK.Copying -> [dc254_10@local]
    [20426] 0000F9B07024EA5F4765883F4D9BCECE51C2: TASK.Copying -> [dc255_10@local]
    [20428] 0000E91CF3FEA9104B96AD086714F246EA23: TASK.Copying -> [dc254_10@local]
    [20435] 000042579D86E46B4E0290D34A723BA4AC46: TASK.Copying -> [dc265_10@local]
    [20436] 00007E30937DF40F432BBE6B3164C2AEACFF: TASK.Copying -> [dc254_10@local]
    [20437] 00005BCEE90C79DD46359F2A7AD05398585D: TASK.Copying -> [dc268_10@local]
    [20438] 00006E2857CB031042788240A9C7B45F85DB: TASK.Copying -> [dc245_10@local]
    [20449] 0000ABF2340E29364D09831942BF148445C5: TASK.Copying -> [dc254_10@local]
    [20453] 000067DE01BD61BC487CA29E91AA53E4958C: TASK.Copying -> [dc266_10@local]
    [20454] 000000D23598EEA2447996B47C0660E30B26: TASK.Copying -> [dc253_10@local]
    [20456] 0000091F04719AFC4984A1DA08753086629B: TASK.Copying -> [dc264_10@local]
    [20458] 0000CBAF5B648EEC4EA481366A8B87543CEF: TASK.Copying -> [dc254_10@local]
    [20460] 0000A9E9C70532074B918B264289E5039DAF: TASK.Copying -> [dc254_10@local]
    [20461] 00004E9436A3151142C6B2C4F3F31CA2DB1B: TASK.Copying -> [dc253_10@local]
    [20463] 00000D635C6F2DBA488EAB74383AD976E361: TASK.Copying -> [dc252_10@local]
    [20464] 00009DC49ADFF6724B56BF306C26F117E626: TASK.Copying -> [dc264_10@local]
    [20466] 0000F74DCE54C5A74217B31AC75F109B0E61: TASK.Copying -> [dc263_10@local]
    [20467] 0000893BA234596A4575B837A0ADDD4A45E7: TASK.Copying -> [dc270_10@local]
    [20468] 0000083A69109D5D4DA6BB5AF9B06F2C3CCA: TASK.Copying -> [dc268_10@local]
    [20471] 000094EBE42BE8174FD7B2079B967032FC06: TASK.Copying -> [dc261_10@local]
    [20472] 0000F1745EF967904B318C9595AF24BD6527: TASK.Copying -> [dc260_10@local]
    [20476] 0000267757BADBCB4BC9AACD99196F606619: TASK.Copying -> [dc248_10@local]
    [20477] 000018D1DEC8F9A1472F8A93B70F7C3B8C70: TASK.Copying -> [dc245_10@local]
    [20478] 0000B8D1CC202EF74E1EA28E45760D8A72A4: TASK.Copying -> [dc245_10@local]
    [20491] 0000B22DD9631AF14FADB85A59AD701F9A9D: TASK.Copying -> [dc245_10@local]
    [20492] 00006A94B6B915564D6390226D8987B7F95E: TASK.Copying -> [dc267_10@local]
    [20493] 00002FF8EE11B9634689997D73AE2FAABFF5: TASK.Copying -> [dc246_10@local]

cfgamboa avatar Apr 18 '24 12:04 cfgamboa

Is it possible that the migration was going on, but you see only stack tasks in the output?

-kofemann /** caffeinated mutations of the core personality */

On Thu, Apr 18, 2024 at 2:30 PM gamboa @.***> wrote:

Only when I cancel the on going migration stuck (hot pool) and exclude the HOT pool from the migration job destination the destination pools for transfers started to be more diverse.

Command    : migration move -storage=MCTAPE:MC -permanent -concurrency=40 -eager -select=random -replicas=1 -exclude=dc258_10 -target=pgroup -- MCTAPE\-write
State      : RUNNING
Queued     : 380
Attempts   : 103
Targets    : dc266_10,dc263_10,dc245_10,dc268_10,dc269_10,dc249_10,dc246_10,dc267_10,dc248_10,dc264_10,dc265_10,dc259_10,dc260_10,dc261_10,dc254_10,dc253_10,dc252_10,dc270_10,dc255_10
Completed  : 63 files; 108070950853 bytes; 6%
Total      : 1766407491660 bytes
Concurrency: 40
Running tasks:
[20366] 0000FF68EBD77BA84E39B94455CC0B90DF0A: TASK.Copying -> ***@***.***
[20368] 00003467F90D4D524A488AC8EC789E18C780: TASK.Copying -> ***@***.***
[20370] 0000580E1B24CF104896A7C8F5D03DDA3CDA: TASK.Copying -> ***@***.***
[20373] 00009ECC758E07F840E182D55601B52023AA: TASK.Copying -> ***@***.***
[20374] 000009980CDD21E840D39B7AE2DD21A4F49C: TASK.Copying -> ***@***.***
[20375] 0000D64A5282E0AD499D90A3036B3D685FFD: TASK.Copying -> ***@***.***
[20379] 0000352929A645D2463C8518E351980B498B: TASK.Copying -> ***@***.***
[20381] 00005FC09C36DB52460092C2F963589CC22E: TASK.Copying -> ***@***.***
[20393] 0000D5BF282B5C864C729F927122C279F551: TASK.Copying -> ***@***.***
[20399] 0000D276AB59EE5C4036B249AAFBD503EE0C: TASK.Copying -> ***@***.***
[20404] 00003DECF626706E41AF861BFF261AB69EAC: TASK.Copying -> ***@***.***
[20407] 00007417BDEE533C4E449DF48EA5C64F3469: TASK.Copying -> ***@***.***
[20411] 0000FAC0B45CF886425EAE11BE8F64672F69: TASK.Copying -> ***@***.***
[20413] 0000626112F23C7A4362B9F184C907A70C6E: TASK.Copying -> ***@***.***
[20426] 0000F9B07024EA5F4765883F4D9BCECE51C2: TASK.Copying -> ***@***.***
[20428] 0000E91CF3FEA9104B96AD086714F246EA23: TASK.Copying -> ***@***.***
[20435] 000042579D86E46B4E0290D34A723BA4AC46: TASK.Copying -> ***@***.***
[20436] 00007E30937DF40F432BBE6B3164C2AEACFF: TASK.Copying -> ***@***.***
[20437] 00005BCEE90C79DD46359F2A7AD05398585D: TASK.Copying -> ***@***.***
[20438] 00006E2857CB031042788240A9C7B45F85DB: TASK.Copying -> ***@***.***
[20449] 0000ABF2340E29364D09831942BF148445C5: TASK.Copying -> ***@***.***
[20453] 000067DE01BD61BC487CA29E91AA53E4958C: TASK.Copying -> ***@***.***
[20454] 000000D23598EEA2447996B47C0660E30B26: TASK.Copying -> ***@***.***
[20456] 0000091F04719AFC4984A1DA08753086629B: TASK.Copying -> ***@***.***
[20458] 0000CBAF5B648EEC4EA481366A8B87543CEF: TASK.Copying -> ***@***.***
[20460] 0000A9E9C70532074B918B264289E5039DAF: TASK.Copying -> ***@***.***
[20461] 00004E9436A3151142C6B2C4F3F31CA2DB1B: TASK.Copying -> ***@***.***
[20463] 00000D635C6F2DBA488EAB74383AD976E361: TASK.Copying -> ***@***.***
[20464] 00009DC49ADFF6724B56BF306C26F117E626: TASK.Copying -> ***@***.***
[20466] 0000F74DCE54C5A74217B31AC75F109B0E61: TASK.Copying -> ***@***.***
[20467] 0000893BA234596A4575B837A0ADDD4A45E7: TASK.Copying -> ***@***.***
[20468] 0000083A69109D5D4DA6BB5AF9B06F2C3CCA: TASK.Copying -> ***@***.***
[20471] 000094EBE42BE8174FD7B2079B967032FC06: TASK.Copying -> ***@***.***
[20472] 0000F1745EF967904B318C9595AF24BD6527: TASK.Copying -> ***@***.***
[20476] 0000267757BADBCB4BC9AACD99196F606619: TASK.Copying -> ***@***.***
[20477] 000018D1DEC8F9A1472F8A93B70F7C3B8C70: TASK.Copying -> ***@***.***
[20478] 0000B8D1CC202EF74E1EA28E45760D8A72A4: TASK.Copying -> ***@***.***
[20491] 0000B22DD9631AF14FADB85A59AD701F9A9D: TASK.Copying -> ***@***.***
[20492] 00006A94B6B915564D6390226D8987B7F95E: TASK.Copying -> ***@***.***
[20493] 00002FF8EE11B9634689997D73AE2FAABFF5: TASK.Copying -> ***@***.***

— Reply to this email directly, view it on GitHub https://github.com/dCache/dcache/issues/7550#issuecomment-2063753611, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEMTXJROYQ2IMC3HUGIDMDY564EZAVCNFSM6AAAAABGLPFSP2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRTG42TGNRRGE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

kofemann avatar Apr 18 '24 14:04 kofemann

The migration from source does not stop. The problem here is that it chooses the same destination pool. It does not seem to be a pure random process.

cfgamboa avatar Apr 18 '24 18:04 cfgamboa

@cfgamboa can you check in billing and confirm that all p2p when into one pool and all others get less traffic? Or on average the data distribution is flat.

kofemann avatar Apr 23 '24 15:04 kofemann

Hi

It seems that disabling the random flag helps to spread out the load to the poolgroup.

Carlos

On Apr 23, 2024, at 11:03 AM, Tiramisu Mokka @.***> wrote:

@cfgamboa https://github.com/cfgamboa can you check in billing and confirm that all p2p when into one pool and all others get less traffic? Or on average the data distribution is flat.

— Reply to this email directly, view it on GitHub https://github.com/dCache/dcache/issues/7550#issuecomment-2072606469, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHIHMO3BZHALFPC3LNGCT6DY6ZZ25AVCNFSM6AAAAABGLPFSP2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZSGYYDMNBWHE. You are receiving this because you were mentioned.

cfgamboa avatar Apr 23 '24 15:04 cfgamboa

This is the best indication that there is load pattern that sculpts the initially random distribution. Do you have other activities on destination pools? That may sculpt the initially random distributiob. Whereas non specifying random takes pool load (and space) into account.

(example of sculpting - a slow pool will seem as "attracting" many transfers when pools are selected randomly)

DmitryLitvintsev avatar Apr 23 '24 15:04 DmitryLitvintsev

Yes there are other activities at the destination pools also on the DMZ pools there are other migration jobs to other pool groups

cfgamboa avatar May 08 '24 13:05 cfgamboa