dcache icon indicating copy to clipboard operation
dcache copied to clipboard

Possible bug in ac_psu_match

Open alrossi opened this issue 2 years ago • 4 comments

This is more a placeholder for myself (I am working on some extensions to this method).

It seems to me that if the following match returns the indicated pools:

psu match read tape.dcache-devel-test@enstore * fndcatemp1.fnal.gov */*
Preference : 0
       Tag : stage
  dcatest03-5
  dcatest08-5
  dcatest09-5
  dcatest04-7
  dcatest04-5
(time used : 28661 millis)

then the "globbed" version for the protocol ("*") should return at least those pools. Yet it returns nothing:

psu match read tape.dcache-devel-test@enstore * fndcatemp1.fnal.gov *
(time used : 1 millis)

Also, 28661 millis cannot possibly be correct, as the command returns nearly instantaneously (it does not take 28.6 seconds).

With debug on, the following is observed.

For the first command:

22 Sep 2022 08:39:39 (PoolManager) [TrQ:GU admin] running match: type=READ store=tape.dcache-devel-test@enstore dCacheUnit=null net=fndcatemp1.fnal.gov protocol=*/* keys={} locations=[] linkGroup=null
22 Sep 2022 08:39:39 (PoolManager) [TrQ:GU admin] matching protocol unit found: */*  (type=PROTOCOL;canonical=*/*;uGroups=1)
22 Sep 2022 08:39:39 (PoolManager) [TrQ:GU admin] matching net unit found: 0.0.0.0/0.0.0.0  (type=NET;canonical=0.0.0.0/0;uGroups=1)
22 Sep 2022 08:39:39 (PoolManager) [TrQ:GU admin] link tape-link matching to unit tape.dcache-devel-test@enstore  (type=STORE;canonical=tape.dcache-devel-test@enstore;uGroups=1) (required=null; onlyOneCopyPer=[])
22 Sep 2022 08:39:39 (PoolManager) [TrQ:GU admin] link stage-link matching to unit tape.dcache-devel-test@enstore  (type=STORE;canonical=tape.dcache-devel-test@enstore;uGroups=1) (required=null; onlyOneCopyPer=[])
22 Sep 2022 08:39:49 (PoolManager) [TrQ:GU admin] link tape-link matching to unit */*  (type=PROTOCOL;canonical=*/*;uGroups=1)
22 Sep 2022 08:39:49 (PoolManager) [TrQ:GU admin] link stage-link matching to unit */*  (type=PROTOCOL;canonical=*/*;uGroups=1)
22 Sep 2022 08:39:49 (PoolManager) [TrQ:GU admin] link persistent-link matching to unit */*  (type=PROTOCOL;canonical=*/*;uGroups=1)
22 Sep 2022 08:39:49 (PoolManager) [TrQ:GU admin] link internal-link matching to unit */*  (type=PROTOCOL;canonical=*/*;uGroups=1)
22 Sep 2022 08:39:49 (PoolManager) [TrQ:GU admin] link highavail-link matching to unit */*  (type=PROTOCOL;canonical=*/*;uGroups=1)
22 Sep 2022 08:39:49 (PoolManager) [TrQ:GU admin] link persistent-tape-link matching to unit */*  (type=PROTOCOL;canonical=*/*;uGroups=1)
22 Sep 2022 08:39:49 (PoolManager) [TrQ:GU admin] link dmz-link matching to unit */*  (type=PROTOCOL;canonical=*/*;uGroups=1)
22 Sep 2022 08:39:57 (PoolManager) [TrQ:GU admin] link tape-link matching to unit 0.0.0.0/0.0.0.0  (type=NET;canonical=0.0.0.0/0;uGroups=1)
22 Sep 2022 08:39:57 (PoolManager) [TrQ:GU admin] link stage-link matching to unit 0.0.0.0/0.0.0.0  (type=NET;canonical=0.0.0.0/0;uGroups=1)
22 Sep 2022 08:39:57 (PoolManager) [TrQ:GU admin] link persistent-link matching to unit 0.0.0.0/0.0.0.0  (type=NET;canonical=0.0.0.0/0;uGroups=1)
22 Sep 2022 08:39:57 (PoolManager) [TrQ:GU admin] link internal-link matching to unit 0.0.0.0/0.0.0.0  (type=NET;canonical=0.0.0.0/0;uGroups=1)
22 Sep 2022 08:39:57 (PoolManager) [TrQ:GU admin] link highavail-link matching to unit 0.0.0.0/0.0.0.0  (type=NET;canonical=0.0.0.0/0;uGroups=1)
22 Sep 2022 08:39:57 (PoolManager) [TrQ:GU admin] link persistent-tape-link matching to unit 0.0.0.0/0.0.0.0  (type=NET;canonical=0.0.0.0/0;uGroups=1)
22 Sep 2022 08:39:57 (PoolManager) [TrQ:GU admin] link dmz-link matching to unit 0.0.0.0/0.0.0.0  (type=NET;canonical=0.0.0.0/0;uGroups=1)
22 Sep 2022 08:40:08 (PoolManager) [TrQ:GU admin] Pool: dcatest03-5  (enabled=true;active=56;rdOnly=false;links=0;pgroups=1;hsm=[enstore];mode=enabled) can read from tape? : true
22 Sep 2022 08:40:08 (PoolManager) [TrQ:GU admin] Pool: dcatest08-5  (enabled=true;active=40;rdOnly=false;links=0;pgroups=1;hsm=[enstore];mode=enabled) can read from tape? : true
22 Sep 2022 08:40:08 (PoolManager) [TrQ:GU admin] Pool: dcatest09-5  (enabled=true;active=39;rdOnly=false;links=0;pgroups=1;hsm=[enstore];mode=enabled) can read from tape? : true
22 Sep 2022 08:40:08 (PoolManager) [TrQ:GU admin] Pool: fndcatemp2-7  (enabled=true;active=no;rdOnly=false;links=0;pgroups=1;hsm=[];mode=disabled()) can read from tape? : false
22 Sep 2022 08:40:08 (PoolManager) [TrQ:GU admin] Pool: dcatest04-7  (enabled=true;active=46;rdOnly=false;links=0;pgroups=1;hsm=[enstore];mode=enabled) can read from tape? : true
22 Sep 2022 08:40:08 (PoolManager) [TrQ:GU admin] Pool: dcatest04-5  (enabled=true;active=46;rdOnly=false;links=0;pgroups=1;hsm=[enstore];mode=enabled) can read from tape? : true
22 Sep 2022 08:40:08 (PoolManager) [TrQ:GU admin] match done: [0] : dcatest03-5 dcatest08-5 dcatest09-5 dcatest04-7 dcatest04-5

for the second:

22 Sep 2022 08:40:17 (PoolManager) [TrQ:GU admin] running match: type=READ store=tape.dcache-devel-test@enstore dCacheUnit=null net=fndcatemp1.fnal.gov protocol=null keys={} locations=[] linkGroup=null
22 Sep 2022 08:40:17 (PoolManager) [TrQ:GU admin] matching net unit found: 0.0.0.0/0.0.0.0  (type=NET;canonical=0.0.0.0/0;uGroups=1)
22 Sep 2022 08:40:17 (PoolManager) [TrQ:GU admin] link tape-link matching to unit tape.dcache-devel-test@enstore  (type=STORE;canonical=tape.dcache-devel-test@enstore;uGroups=1) (required=null; onlyOneCopyPer=[])
22 Sep 2022 08:40:17 (PoolManager) [TrQ:GU admin] link stage-link matching to unit tape.dcache-devel-test@enstore  (type=STORE;canonical=tape.dcache-devel-test@enstore;uGroups=1) (required=null; onlyOneCopyPer=[])
22 Sep 2022 08:40:17 (PoolManager) [TrQ:GU admin] link tape-link matching to unit 0.0.0.0/0.0.0.0  (type=NET;canonical=0.0.0.0/0;uGroups=1)
22 Sep 2022 08:40:17 (PoolManager) [TrQ:GU admin] link stage-link matching to unit 0.0.0.0/0.0.0.0  (type=NET;canonical=0.0.0.0/0;uGroups=1)
22 Sep 2022 08:40:17 (PoolManager) [TrQ:GU admin] link persistent-link matching to unit 0.0.0.0/0.0.0.0  (type=NET;canonical=0.0.0.0/0;uGroups=1)
22 Sep 2022 08:40:17 (PoolManager) [TrQ:GU admin] link internal-link matching to unit 0.0.0.0/0.0.0.0  (type=NET;canonical=0.0.0.0/0;uGroups=1)
22 Sep 2022 08:40:17 (PoolManager) [TrQ:GU admin] link highavail-link matching to unit 0.0.0.0/0.0.0.0  (type=NET;canonical=0.0.0.0/0;uGroups=1)
22 Sep 2022 08:40:17 (PoolManager) [TrQ:GU admin] link persistent-tape-link matching to unit 0.0.0.0/0.0.0.0  (type=NET;canonical=0.0.0.0/0;uGroups=1)
22 Sep 2022 08:40:17 (PoolManager) [TrQ:GU admin] link dmz-link matching to unit 0.0.0.0/0.0.0.0  (type=NET;canonical=0.0.0.0/0;uGroups=1)
22 Sep 2022 08:40:17 (PoolManager) [TrQ:GU admin] match done: 

My guess is that the glob, which translates under the covers to NULL, makes the matcher skip that particular unit, but this results in the LinkMapEntry not being properly decremented so that those links actually do not count as having matched all their linked units.

In other words, instead of glob=>NULL, it should be glob=>ALL the mapped units of that type should be added.

I don't think this affects normal dCache operations, since matching internally does not use globs. But the shell commands seem to me to be broken, and as a consequence, so is the web interface matching when it tries to use the glob.

NOTE: This problem exists for net unit and protocol unit, but cache class seems to work with the glob.

alrossi avatar Sep 22 '22 14:09 alrossi

Just to add an observation/comment here, dCache PSU has the concept of a default units (default protocol unit, default net unit, etc). These match if nothing else matches. Having such a default or fall-back behaviour is often very useful in pool-manager configuration. It's also a fairly common concept in other context (outside of dCache).

The long standing problem is that dCache uses a * to represent the default unit.

Thing brings (at least) two problems:

  • people expect * to have wildcard (match everything) semantics, which isn't the case.
  • given * means "default" (or "fall-back"), there's now no nice way to use a glob pattern.

You may have found a bug in how the methods are written (I don't know enough about the use-case to be sure). Even if not, I would say this is another case where the choice of * to mean "default" is hurting us.

As a suggestion, you could write a unit test that currently fails, but would pass if dCache reacted the way you expect. We can look at this patch in RB and decide whether it's really a bug.

paulmillar avatar Sep 23 '22 08:09 paulmillar

then the "globbed" version for the protocol ("*") should return at least those pools. Yet it returns nothing: psu match read tape.dcache-devel-test@enstore fndcatemp1.fnal.gov *

The issue here is that protocols are not treated as string. They are combination of <protocol name> and <protocol version>. Thus you have to say any protocol/any version => */*

Jep. This is confusion. And globs in dCache works as 'others' (just line unix permissions) in opposite to 'everything' (line acl's EVERYONE), as Paul already have mentioned.

kofemann avatar Sep 23 '22 08:09 kofemann

Thanks to both of you for the input here. I have examined the code, and here is the problem that I see which underlies it.

  • '*' on the command line is translated into a null under the covers.
  • A null given to the bottom-most match method which actually does the work means that adding those units to the list to be matched is skipped.
  • Since those units are skipped, the decrementing of an internal count based on the unitGroup size on the linkMapEntry does not take place.
  • The algorithm specifies that only those entries with an internal count <= 0 are matches. Since a parameter is essentially excluded by the null, this means that the count does not reach 0.

I have attempted to fix this by changing this behavior to add all mapped units of the given type when its parameter value is null. This, however, results in overkill, because a link could then match on a unit twice, and not match on a different required unit, but still be returned as valid, since the count was successfully decremented to 0.

If "*" means "0.0.0.0/0.0.0.0" or "::/0" for a net unit and "*/*" for a protocol unit, one could instead translate the null into those, but that would require that those units actually be part of the poolmanager config; otherwise you get an IllegalArgumentException. But what is the default for a dCache unit (e.g., cacheClass)? ...

A second way to achieve this might be to treat the matching of a particular unit type as OR, but across unit types as AND. To me that seems to be the correct logic here. Then you could add all the unit types and the first to match short circuits, when no unit types are given as the parameter. This would require the match method to be rewritten, I think, which would not be a bad thing. I could do this, and add a fuller set of unit tests.

Another way to handle it might be to rewrite the sub-methods so that the glob is directly handled: that is, it just adds the link regardless. This might be the easiest fix, and is one I am currently trying out.

@paulmillar 's suggestion to write a unit test is not bad. I may do that.

In any case, it is also my opinion that

  1. The main match method needs some refactoring into a little more intelligible form
  2. All of the psu code should be updated to use the more modern command API.

Those, of course, are separate issues (patches) I am willing to undertake.

alrossi avatar Sep 23 '22 12:09 alrossi

I think the long and short of this is simply that if the glob is not supposed to work as all but as default, then the code is wrong, because it does not translate to default under the covers.

The easiest solution in the case of protocol is to translate the glob into "*/*".

I am not sure what to do for net (which mask is the correct one?), or for cacheClass/ dCacheUnit.

alrossi avatar Sep 23 '22 13:09 alrossi