aiida-core icon indicating copy to clipboard operation
aiida-core copied to clipboard

`QueryBuilder` is broken in v2.0 when joining `Group` onto `Node`

Open mkotiuga opened this issue 2 years ago • 11 comments

After upgrading to aiida v2 when I list my upf families using verdi data core.upf listfamilies each family is printed out once for each ufp it contains - instead of once.

So if I request all the families that have the elements "O" and "Ti" (which are in all the same upf families) I get each listed twice:

(aiida3.8) mkotiuga@theospc2:~$ verdi data core.upf listfamilies -e O Ti
Success: * Ba_GBRV_Ti_O_ssspP [3 pseudos]
Success: * PNO-upf [3 pseudos]
Success: * SSSP_PBEsolA [85 pseudos]
Success: * SSSP_PBEsolE [85 pseudos]
Success: * Ba_GBRV_Ti_O_ssspP [3 pseudos]
Success: * SSSP_PBEsolA [85 pseudos]
Success: * SSSP_PBEsolE [85 pseudos]
Success: * Gio [5 pseudos]
Success: * Gio_sept [5 pseudos]
Success: * Gio [5 pseudos]
Success: * Gio_sept [5 pseudos]
Success: * Gio_pbesol_us [72 pseudos]
Success: * Gio_pbesol_us [72 pseudos]
Success: * SSSP_PBE_E [85 pseudos]
Success: * SSSP_PBE_E [85 pseudos]
Success: * Gio_LDA [3 pseudos]
Success: * Gio_LDA [3 pseudos]

Environment:

  • Operating system: Ubuntu 18.04
  • Python version: 3.8.0
  • aiida-core version: 2.0.1

mkotiuga avatar May 23 '22 12:05 mkotiuga

I can't help but notice that there seem to be some families that are not duplicated. For example PNO-upf. Could you run the following snippet in verdi shell and report the output perhaps?

from aiida.orm import Group, QueryBuilder, UpfData

query = QueryBuilder()
query.append(UpfData, tag='upf')
query.append(Group, with_node='upf', project=['id', 'uuid', 'label', 'type_string'])

print(query.all())

Just to make sure there are not actual duplicates that are simply getting correctly reported

sphuber avatar May 31 '22 09:05 sphuber

Sorry - the family PNO-upf doesn't contain Ti, so it being listed once makes sense. The output of the above snippet it really long, but removing the duplicates results in the following :

In [1]: from aiida.orm import Group, QueryBuilder, UpfData
   ...: query = QueryBuilder()
   ...: query.append(UpfData, tag='upf')
   ...: query.append(Group, with_node='upf', project=['id', 'uuid', 'label', 'type_string'])
   ...: 
   ...: psp_set = set(tuple(i) for i in query.all())
   ...: psp_set
Out[1]: 
{(2, '9339fc73-ac27-45a6-8a84-71cc0655064e', 'SSSP_PBEsolE', 'core.upf'),
 (3, '7cc6aa98-46ba-472c-8daf-623d6d88ee6a', 'SSSP_PBEsolA', 'core.upf'),
 (15, 'c7e4e8a7-15e0-4db1-ac61-a0120d69ad12', 'Gio', 'core.upf'),
 (96,
  '28027471-08f3-4092-ac91-228185dc6a7c',
  'Ba_GBRV_Ti_O_ssspP',
  'core.upf'),
 (111, 'c9ea4e8e-955d-441b-ae9e-ec5d39bb6be6', 'Gio_sept', 'core.upf'),
 (141, '9e39941a-4727-4b50-8546-6c73871b8f0e', 'Gio_pbesol_us', 'core.upf'),
 (153, '72f21339-6ef2-4879-b020-be138a58face', 'SSSP_PBE_E', 'core.upf'),
 (268,
  'fbbb05ef-ebd5-48b1-96c5-d251e72ec4fd',
  '20210518-003510',
  'core.import'),
 (279,
  '0d197c9c-7973-4f5a-a8d4-43de65571506',
  'calculations/Giovanni/13pero',
  'core'),
 (316, 'ecfcda7d-3424-45f1-80cc-011a684c49f5', 'PNO-upf', 'core.upf'),
 (393, '071d562c-2319-48fe-948a-8386399270e9', 'Gio_LDA', 'core.upf')}

So it seems each family is stored just once.

mkotiuga avatar May 31 '22 22:05 mkotiuga

The point was to see if there were actually duplicates though :sweat_smile: Are you saying there are actually duplicates in those results that you had to remove? Because that would be the problem then, and would also be super weird and some big bug, because it would mean there are groups with identical pk and UUID which shouldn't happen.

sphuber avatar Jun 01 '22 05:06 sphuber

If I run that code query one step at a time. First I get all the pseudos stored - there are 283. Then the next I get the group for each with the data project=['id', 'uuid', 'label', 'type_string']. Here are the first 20 entries when I do not remove duplicates:

In [3]: from aiida.orm import Group, QueryBuilder, UpfData
   ...: query = QueryBuilder()
   ...: query.append(UpfData, tag='upf')
   ...: query.append(Group, with_node='upf', project=['id', 'uuid', 'label', 'type_string'])
   ...: 
   ...: psp_grp_list = query.all()
   ...: psp_grp_list[:20]
Out[3]: 
[[15, 'c7e4e8a7-15e0-4db1-ac61-a0120d69ad12', 'Gio', 'core.upf'],
 [268,
  'fbbb05ef-ebd5-48b1-96c5-d251e72ec4fd',
  '20210518-003510',
  'core.import'],
 [279,
  '0d197c9c-7973-4f5a-a8d4-43de65571506',
  'calculations/Giovanni/13pero',
  'core'],
 [3, '7cc6aa98-46ba-472c-8daf-623d6d88ee6a', 'SSSP_PBEsolA', 'core.upf'],
 [153, '72f21339-6ef2-4879-b020-be138a58face', 'SSSP_PBE_E', 'core.upf'],
 [268,
  'fbbb05ef-ebd5-48b1-96c5-d251e72ec4fd',
  '20210518-003510',
  'core.import'],
 [279,
  '0d197c9c-7973-4f5a-a8d4-43de65571506',
  'calculations/Giovanni/13pero',
  'core'],
 [2, '9339fc73-ac27-45a6-8a84-71cc0655064e', 'SSSP_PBEsolE', 'core.upf'],
 [153, '72f21339-6ef2-4879-b020-be138a58face', 'SSSP_PBE_E', 'core.upf'],
 [2, '9339fc73-ac27-45a6-8a84-71cc0655064e', 'SSSP_PBEsolE', 'core.upf'],
 [3, '7cc6aa98-46ba-472c-8daf-623d6d88ee6a', 'SSSP_PBEsolA', 'core.upf'],
 [2, '9339fc73-ac27-45a6-8a84-71cc0655064e', 'SSSP_PBEsolE', 'core.upf'],
 [3, '7cc6aa98-46ba-472c-8daf-623d6d88ee6a', 'SSSP_PBEsolA', 'core.upf'],
 [2, '9339fc73-ac27-45a6-8a84-71cc0655064e', 'SSSP_PBEsolE', 'core.upf'],
 [3, '7cc6aa98-46ba-472c-8daf-623d6d88ee6a', 'SSSP_PBEsolA', 'core.upf'],
 [2, '9339fc73-ac27-45a6-8a84-71cc0655064e', 'SSSP_PBEsolE', 'core.upf'],
 [3, '7cc6aa98-46ba-472c-8daf-623d6d88ee6a', 'SSSP_PBEsolA', 'core.upf'],
 [153, '72f21339-6ef2-4879-b020-be138a58face', 'SSSP_PBE_E', 'core.upf'],
 [2, '9339fc73-ac27-45a6-8a84-71cc0655064e', 'SSSP_PBEsolE', 'core.upf'],
 [3, '7cc6aa98-46ba-472c-8daf-623d6d88ee6a', 'SSSP_PBEsolA', 'core.upf']]

You can see the repeats. But is this a duplicate of the pk and UUID - or just how the QueryBuilder works?

mkotiuga avatar Jun 01 '22 07:06 mkotiuga

You can see the repeats. But is this a duplicate of the pk and UUID - or just how the QueryBuilder works?

Not a 100% sure. The only way to make sure is to go into your Postgres database. Do you know how to do this? Otherwise, we can do it together in/during the meeting at 11. I am afraid though that this is really what is in the database and so that would be really problematic. I would be really surprised though because there is a uniqueness constraint on the UUID field in the Group model, so that should have always failed on the PostgreSQL level.

sphuber avatar Jun 01 '22 08:06 sphuber

@mkotiuga I cannot reproduce the problem. Also, thinking about it again, the query I had you execute only projects on the Group, so that should only return one row for each group, not one row for each UpfData that it contains. Can you login to your postgresql database, something like

sudo su - postgres
psql

and then execute the following query:

\c <DATABASE_NAME>
select id, uuid, type_string, label from db_dbgroup where type_string = 'core.upf';

replacing <DATABASE_NAME> with the name of your database.

sphuber avatar Jun 01 '22 12:06 sphuber

@sphuber - the output looks sensical

postgres=# \c BTO_mkotiuga_d5233c38bc66abafc152ecd80b202f65
You are now connected to database "BTO_mkotiuga_d5233c38bc66abafc152ecd80b202f65" as user "postgres".
BTO_mkotiuga_d5233c38bc66abafc152ecd80b202f65=# select id, uuid, type_string, label from db_dbgroup where type_string = 'core.upf';
 id  |                 uuid                 | type_string |       label        
-----+--------------------------------------+-------------+--------------------
  96 | 28027471-08f3-4092-ac91-228185dc6a7c | core.upf    | Ba_GBRV_Ti_O_ssspP
  15 | c7e4e8a7-15e0-4db1-ac61-a0120d69ad12 | core.upf    | Gio
 141 | 9e39941a-4727-4b50-8546-6c73871b8f0e | core.upf    | Gio_pbesol_us
 153 | 72f21339-6ef2-4879-b020-be138a58face | core.upf    | SSSP_PBE_E
   1 | 83a42a80-07ac-4942-af43-c6f7beb68a68 | core.upf    | SSSP_PBEsol
   2 | 9339fc73-ac27-45a6-8a84-71cc0655064e | core.upf    | SSSP_PBEsolE
   3 | 7cc6aa98-46ba-472c-8daf-623d6d88ee6a | core.upf    | SSSP_PBEsolA
 111 | c9ea4e8e-955d-441b-ae9e-ec5d39bb6be6 | core.upf    | Gio_sept
 316 | ecfcda7d-3424-45f1-80cc-011a684c49f5 | core.upf    | PNO-upf
 393 | 071d562c-2319-48fe-948a-8386399270e9 | core.upf    | Gio_LDA
(10 rows)

mkotiuga avatar Jun 01 '22 21:06 mkotiuga

~~Huh 🤔 what version of aiida-core are you using?~~ Never mind, I see in the OP you mentioned v2.0.1. Really weird, I tried to reproduce it but can't.

sphuber avatar Jun 02 '22 07:06 sphuber

I just did a querybuilder test. If I query for structures in a given group and then append the group as the last step of the query - the length of res is the number of structures - not just the name of the group (which is just the group I name in the first line)

name = 'PZO'
gs = Group.get(label='structures/dispero/'+name+'/cubic5atom')
qb = QueryBuilder()
     
qb.append(Group, filters={'id': gs.pk}, tag='struct_group')
qb.append(StructureData, with_group='struct_group', tag='structs')
qb.append(Group, with_node='structs')
res = qb.all()
res[0:10]

output

[[<Group: 'structures/dispero/PZO/cubic5atom' [type core], of user [email protected]>],
 [<Group: 'structures/dispero/PZO/cubic5atom' [type core], of user [email protected]>],
 [<Group: 'structures/dispero/PZO/cubic5atom' [type core], of user [email protected]>],
 [<Group: 'structures/dispero/PZO/cubic5atom' [type core], of user [email protected]>],
 [<Group: 'structures/dispero/PZO/cubic5atom' [type core], of user [email protected]>],
 [<Group: 'structures/dispero/PZO/cubic5atom' [type core], of user [email protected]>],
 [<Group: 'structures/dispero/PZO/cubic5atom' [type core], of user [email protected]>],
 [<Group: 'structures/dispero/PZO/cubic5atom' [type core], of user [email protected]>],
 [<Group: 'structures/dispero/PZO/cubic5atom' [type core], of user [email protected]>],
 [<Group: 'structures/dispero/PZO/cubic5atom' [type core], of user [email protected]>]]

mkotiuga avatar Jun 02 '22 12:06 mkotiuga

So maybe this is an issues not with my psp families - but the query builder? If @sphuber cannot reproduce the issue, this seems to be an issue in my installation, i.e. it may not be an issue unique to me but doesn't appear to be universal....

mkotiuga avatar Jun 02 '22 12:06 mkotiuga

Thanks @mkotiuga . I found the problem. It is indeed with the QueryBuilder and it was indeed introduced in v2.0 as it works properly in v1.6.8. The problem is more generic and will occur anytime you query for a group containing a type of node, it will return rows for each node in that group.

Take the simple query:

query = QueryBuilder()
query.append(Node, tag='node')
query.append(Group, with_node='node', project='id')
query.distinct()

For v2.0 this produces the following SQL

SELECT DISTINCT db_dbnode_1.id, db_dbgroup_1.id AS id_1
FROM db_dbnode AS db_dbnode_1 JOIN db_dbgroup_dbnodes AS db_dbgroup_dbnodes_1 ON db_dbgroup_dbnodes_1.dbnode_id = db_dbnode_1.id JOIN db_dbgroup AS db_dbgroup_1 ON db_dbgroup_1.id = db_dbgroup_dbnodes_1.dbgroup_id
WHERE CAST(db_dbnode_1.node_type AS VARCHAR) LIKE '%%' AND CAST(db_dbgroup_1.type_string AS VARCHAR) LIKE '%%'

and for v1.6 instead this produces

SELECT DISTINCT db_dbgroup_1.id 
FROM db_dbnode AS db_dbnode_1 JOIN db_dbgroup_dbnodes AS db_dbgroup_dbnodes_1 ON db_dbgroup_dbnodes_1.dbnode_id = db_dbnode_1.id JOIN db_dbgroup AS db_dbgroup_1 ON db_dbgroup_1.id = db_dbgroup_dbnodes_1.dbgroup_id 
WHERE CAST(db_dbnode_1.node_type AS VARCHAR) LIKE '%%' AND CAST(db_dbgroup_1.type_string AS VARCHAR) LIKE '%%'

The problem is that the query in v2.0 also projects the db_dbnode_1.id even though that wasn't ever requested.

When looking at the QueryBuilder's internal representation we get the following

v2.0

{
    "path": [
        {
            "entity_type": "",
            "orm_base": "node",
            "tag": "node",
            "joining_keyword": null,
            "joining_value": null,
            "edge_tag": null,
            "outerjoin": false
        },
        {
            "entity_type": "group.core",
            "orm_base": "group",
            "tag": "core_1",
            "joining_keyword": "with_node",
            "joining_value": "node",
            "edge_tag": "node--core_1",
            "outerjoin": false
        }
    ],
    "filters": {
        "node": {
            "node_type": {
                "like": "%"
            }
        },
        "core_1": {
            "type_string": {
                "like": "%"
            }
        },
        "node--core_1": {}
    },
    "project": {
        "node": [],
        "core_1": [
            {
                "id": {}
            }
        ],
        "node--core_1": []
    },
    "order_by": [],
    "limit": null,
    "offset": null,
    "distinct": true
}

v16x

{
    "path": [
        {
            "entity_type": "",
            "tag": "node",
            "joining_keyword": null,
            "joining_value": null,
            "outerjoin": false,
            "edge_tag": null
        },
        {
            "entity_type": "group.core",
            "tag": "core_1",
            "joining_keyword": "with_node",
            "joining_value": "node",
            "outerjoin": false,
            "edge_tag": "node--core_1"
        }
    ],
    "filters": {
        "node": {
            "node_type": {
                "like": "%"
            }
        },
        "core_1": {
            "type_string": {
                "like": "%"
            }
        },
        "node--core_1": {}
    },
    "project": {
        "node": [],
        "core_1": [
            {
                "id": {}
            }
        ],
        "node--core_1": []
    },
    "order_by": {},
    "limit": null,
    "offset": null
}

Apart from the orm_base keys which are added to v2.0, there is no difference. The distinct key seems to be missing from v1.6, which I think was a bug that was addressed in 4174e5de3adbeec785290a02a0fc78d4597e42e0. It should not actually have an effect, it just doesn't show up in the QueryBuilder.queryhelp output.

I think that commit 4174e5de3adbeec785290a02a0fc78d4597e42e0 also created the bug we are seeing here. It completely refactored the QueryBuilder implementation. The following test checks the regression

def test_joins_group_node_project_group(self):
    """Test that when protecting only the group for a join on nodes, only unique groups are returned.

    Regression test for #5535
    """
    group = orm.Group(label=str(uuid.uuid4())).store()
    node_a = orm.Data().store()
    node_b = orm.Data().store()
    group.add_nodes([node_a, node_b])

    query = orm.QueryBuilder()
    query.append(orm.Data, tag='node')
    query.append(orm.Group, with_node='node', project='id')
    query.distinct()

    assert query.count() == 1
    assert query.all(flat=True) == [group.pk]

Note that when swapping the joining order, things work as expected:

query = orm.QueryBuilder()
query.append(orm.Group, tag='group', project='id')
query.append(orm.Data, with_group='group')
query.distinct()

I have drafted a PR that adds the regression test #5554

@chrisjsewell any idea where this might be going wrong?

sphuber avatar Jun 02 '22 16:06 sphuber

@chrisjsewell I think I have found the source of the discrepancy between v1.6 and v2.0. The QueryBuilder._build method which constructs the Query object from the internal query representation, starts with calling session.query(first_alias). Then it loops over the joins, starting from the second edge (since the first is already added). In v1.6, at the very end, the initially added entity is popped: https://github.com/aiidateam/aiida-core/blob/82a81c9ef839f22f1ac0ce8711f17bec605aeb6f/aiida/orm/querybuilder.py#L1997

This line is no longer there in v2.0. The reason is probably since the Query._entities attribute no longer exists in SqlAlchemy v1.4. It seems plausible that since the first entity is no longer popped, that is why in the problematic example explained in the post above, the node is also projected, in addition to the group. However, we only want the group properties to be projected. If we find a way to accomplish the same in SQLA v1.4, i.e., remove the projection of the first alias, we should solve this problem.

sphuber avatar Sep 15 '22 09:09 sphuber