OPTIMADE Collections endpoint

Collections endpoint

Open jvita opened this issue 2 years ago • 31 comments

Starting point for discussing a general "Collections" entry

Sep 22 '21 19:09 jvita

Thank you for starting this PR! I always find it helpful to have an example. It would be nice if you could add one if you have the time.

Sep 28 '21 11:09 JPBergsma

Sorry I haven't responded to this in awhile, I've been a bit overwhelmed with other projects and haven't found the time to work on this PR.

If entries get updated regularly, you could place relationships in each entry pointing to the collections they belong to. If this rarely happens, you could probably query all the collections to check whether a particular entry is in that collection and then update it. We don't have to specify how to update the data belonging to the collections in the specification though, only that it SHOULD be updated. In some cases it could however be worth while to create a new structure rather than to update the existing one. For example, when you want a collection you refer to in an article to stay the same.

I've been struggling with deciding what the best approach to this would be. It seems like it would be difficult and/or expensive to make sure to update everything whenever an entry is changed, especially in the case where you have collections of collections. An alternative could be to not automatically update a collection when one of its entries changes, but to instead provide some kind of functionality to "re-synchronize" the collection to aggregate any desired properties up to the collection. This option would mean the collection isn't guaranteed to be up-to-date unless it's re-synchronized regularly, but it would also avoid running the potentially expensive task of updating everything on every change.

I always find it helpful to have an example. It would be nice if you could add one if you have the time.

I'll try to add an example soon. We (the OpenKIM team) are still working on ironing out our data structures, so I've been holding off on providing an example until I have one that would be representative of our use case.

Oct 14 '21 15:10 jvita

`I've been struggling with deciding what the best approach to this would be. It seems like it would be difficult and/or expensive to make sure to update everything whenever an entry is changed, especially in the case where you have collections of collections. An alternative could be to not automatically update a collection when one of its entries changes, but to instead provide some kind of functionality to "re-synchronize" the collection to aggregate any desired properties up to the collection. This option would mean the collection isn't guaranteed to be up-to-date unless it's re-synchronized regularly, but it would also avoid running the potentially expensive task of updating everything on every change.

I think that it is most important that the database as presented to the outside world is consistent. That a new entry appears immediately is less important.

So I think you could create a backlog with all the changes that need to be made to the database and execute these at an appropriate time. For example when the server load is low. This would allow you to accumulate multiple changes, so you would only have to update a collection once. At that moment you could create an updated collection entry and then write both the updated collection and the updated entries belonging to this collection to the database. Thereby minimizing the time the database is in an inconsistent state.

I am not sure whether it will be that much work to update a collection though. In most cases you will not need to access the other entries in a collection. For example: If the property in the collection is an average, you can update it without needing to access the other entries as long as you know the old value and the number of entries in the collection. Adding an entry should also be possible without needing to access any of the other entries in the collection. Removing an entry would be more work, as you have to check whether any of the properties in a list(for example, which elements occur in any of the structures in the collection) also occur in another entry. But in many cases these values will occur in another entry, and you can stop as soon as you have found them.

Therefore, I am not sure if it is necessary to implement a log of all the changes that need to be executed.

Oct 18 '21 11:10 JPBergsma

I think that it is most important that the database as presented to the outside world is consistent. That a new entry appears immediately is less important.

So I think you could create a backlog with all the changes that need to be made to the database and execute these at an appropriate time. For example when the server load is low. This would allow you to accumulate multiple changes, so you would only have to update a collection once.

I like your idea of creating a backlog of pending changes that can be applied at chosen moments rather than automatically. We have implemented something similar in our software package, where our collections have a boolean flag that tracks if any of their entries have been changed, that way they can apply those changes before doing any critical operations (like saving to a file, or providing an aggregated statistic).

I am not sure whether it will be that much work to update a collection though. In most cases you will not need to access the other entries in a collection.

I agree with you that updating a single collection with multiple changes might not be that bad, but what I'm also worried about is having to update many collections that all share a common entry (or equivalently, nested collections). In this case you would need to propagate the changes to all collections containing the changed entry, and to all collections containing those collections, and so on and so forth... in this way, for a single changed entry the number of collections that need to be updated may be growing exponentially and it could begin to be a computationally demanding task to keep the database synchronized.

Here is a basic example that I think will be useful for these discussions. Perhaps we can edit this example as we go:

# A basic collection of (x, y) points
collection_1 = {
	'A': {'x': 0, 'y': 0},  # A, B, and C are low-level entries
	'B': {'x': 0, 'y': 1},
	'C': {'x': 0, 'y': 2},
	'aggregated_field': {   # these are the fields that have been chosen to be aggregated
		'x': [0],
		'y': [0, 1, 2]
	}
}

# A second collection
collection_2 = {
	'D': {'x': 0, 'y': 3},
	'E': {'x': 1, 'y': 3},
	'F': {'x': 2, 'y': 3},
	'aggregated_fields':  {
		'x': [0, 1, 2],
		'y': [3],
	}
}

# And now, a collection of collections
collection_3 = {
	'collection_1': <pointer_to_collection_1>,
	'collection_2': <pointer_to_collection_2>,
	'aggregated_fields': {
		'x': [0, 1, 2],  # "double aggregation"
		'y': [0, 1, 2, 3],
	}
}

# When a user queries for collections with 'aggregated_fields.x',
# it's important that the database not have any backlogged changes,
# that way the correct results are returned by the query

Here is a second example if collections are allowed to be non-homogeneous (i.e., not all entries are required to be the same). Again, I think we should discuss if we want this to be allowed:

# A basic collection of (x, y) points
collection_1 = {
	'A': {'x': 0, 'y': 0},  # A, B, and C are low-level entries
	'B': {'x': 0, 'y': 1},
	'C': {'x': 0, 'y': 2},
	'aggregated_field': {   # these are the fields that have been chosen to be aggregated
		'x': [0],
		'y': [0, 1, 2]
	}
}

# A collection of (x, y, z) points
collection_2 = {
	'D': {'x': 1, 'y': 3, 'z': 0},
	'E': {'x': 1, 'y': 3, 'z': 1},
	'F': {'x': 1, 'y': 3, 'z': 2},
	'aggregated_fields':  {  # not all fields are being aggregated
		'x': [1],
		'z': [0, 1, 2],
	}
}

# And now, a collection of collections
collection_3 = {
	'collection_1': <pointer_to_collection_1>,
	'collection_2': <pointer_to_collection_2>,
	'aggregated_fields': {
		'x': [0, 1],  # "double aggregation"
		'y': [0, 1, 2],  # note that 'y' wasn't aggregated in collection_2, so we don't get the value '3'
		'z': [0, 1, 2],
	}
}

Nov 05 '21 14:11 jvita

I like your idea of creating a backlog of pending changes that can be applied at chosen moments rather than automatically. We have implemented something similar in our software package, where our collections have a boolean flag that tracks if any of their entries have been changed, that way they can apply those changes before doing any critical operations (like saving to a file, or providing an aggregated statistic).

Another option could be to have “two” databases. Each time you want to do an update on the database, you can perform the update on a copy. Once you finish updating, you switch the database you serve. If you include a version number in the pagination link, you can inform the clients that the database has been updated and the result they get may not be consistent.

I agree with you that updating a single collection with multiple changes might not be that bad, but what I'm also worried about is having to update many collections that all share a common entry (or equivalently, nested collections). In this case you would need to propagate the changes to all collections containing the changed entry, and to all collections containing those collections, and so on and so forth... in this way, for a single changed entry the number of collections that need to be updated may be growing exponentially and it could begin to be a computationally demanding task to keep the database synchronized.

It depends on what kind of aggregated properties you have, but for the aggregated properties I can think of, you would not need to visit the members of other collections when you update one of the collection in a super collection. So unless you have more than 10.000 or so collections in which the changed entry occurs, I do not think it would be a problem to update one entry and update all the collections it is in.

collection_1 = {
	'A': {'x': 0, 'y': 0},  # A, B, and C are low-level entries
	'B': {'x': 0, 'y': 1},
	'C': {'x': 0, 'y': 2},
	'aggregated_field': {   # these are the fields that have been chosen to be aggregated
		'x': [0],
		'y': [0, 1, 2]
	}
}

In the example above, you have placed the data('A': {'x': 0, 'y': 0}) inside your collection. This looks a bit strange, as normally your collection would contain a reference to the structure.

I would discourage not aggregating all the fields, as you do in the second example, as the aggregated fields are now inconsistent. A user could look for a collection with 'y' : 3 but he/she would not find it.

Nov 08 '21 21:11 JPBergsma

@jvita

Thanks again for the work with this. I'm seeing some need of collections in my work and would like this PR to be added. I see that the PR has been idling for a while (and I very much understand being busy with other things). Do you plan to address the concerns eventually, or would you be ok with me pushing commits to address the outstanding issues?

Jun 17 '22 07:06 rartino

Hi @rartino; please feel free to make any changes that you wish -- I'm not sure that I'll be able to dedicate much time to this right now, unfortunately.

Thanks!

Jun 17 '22 23:06 jvita

As this PR was made from @jvita's fork, perhaps we should remake in a branch that other's have access to and address the initial comments above? I'm happy to make the PR but can't really commit to owning the PR overall right now (maybe in a few weeks).

Jun 18 '22 20:06 ml-evs

@ml-evs

No need to remake the branch - if someone clones a GitHub repo and then create a PR against it from a branch, the default is to allow the repo owners of the original repo push-access to that branch for precisely this reason.

I've pushed the edits I wanted to do. Please review.

Jun 21 '22 08:06 rartino

@ml-evs

No need to remake the branch - if someone clones a GitHub repo and then create a PR against it from a branch, the default is to allow the repo owners of the original repo push-access to that branch for precisely this reason.

I've pushed the edits I wanted to do. Please review.

Great! I must have been misremembering why we had to add collaborator access on forks outside of the org in the past, guess it was probably before the PR had been made.

Jun 21 '22 08:06 ml-evs

@merkys Thanks for the feedback. I had the idea that the contents of a collections should all go in a "contents" relationship, i.e., that the tag directly under relationship should be contents rather than structures, but then I messed up the example.

After seeing your comments and re-reading the specification, I realize that not all output formats need to support the facility to "name" relationships the way that JSON API allows, so we shouldn't rely on that feature (or, we should add that feature to the definition of a relationship). You also remark that the relationship "name" must be the name of the entry type (is that in the specification?).

So, I've re-done this now, so that "any" "normal" "named-as-the-entry-type" relationship defines the set of entries in the collection.

Jun 28 '22 06:06 rartino

I wanted to share an example use-case from the ColabFit project (the project that I was working on when I first started this PR) in case it helps guide the development direction of the collections endpoint. We've been working on the structure of our database for awhile now, but it has become relatively stable at this point, so this should be a good example of our intended usage. I'll also ping @EFuem, as he's now the lead developer at ColabFit and would be better suited than me for continued participation in this PR (though I don't know if I'm able to @ someone who isn't currently in the repo...).

Some background: The ColabFit project is working on building a database of first-principles training datasets. In order to do this, it has two fundamental low-level entry types ("Configurations" and "Properties") for storing raw data (e.g. atomic structures, and first-principles calculations), and two higher-level entry types ("ConfigurationSets" and "Datasets"). The two higher-level entry types have three main purposes:

For defining collections of Configurations and Properties
For storing additional meta-data about the collections
For storing aggregated values that are generated by performing map-reduce operations on their sub-entries.

Below is an example of what a Configuration entry and a ConfigurationSet entry look like (copied directly from our Mongo database):

Configuration:

{
    _id: ObjectId("629e66b5059cdfec36c0f1d2"),
    'short-id': 'CO_585066926276_000',
    atomic_numbers: [
      13, 13, 13, 13, 13,
      13, 28, 28, 28, 28
    ],
    cell: [ [ 0, 2.86805, 0 ], [ 2.86805, 0, 0 ], [ 0, 0, -15.6918 ] ],
    chemical_formula_anonymous: 'A3B2',
    chemical_formula_hill: 'Al6Ni4',
    chemical_formula_reduced: 'Al3Ni2',
    dimension_types: [ 1, 1, 1 ],
    elements: [ 'Al', 'Ni' ],
    elements_ratios: [ 0.6, 0.4 ],
    labels: [ 'active_learning' ],
    last_modified: '2022-06-06T15:42:24Z',
    names: [ 'train_1st_stage_0' ],
    nelements: 2,
    nperiodic_dimensions: 3,
    nsites: 10,
    pbc: [ 1, 1, 1 ],
    positions: [
      [ 0, 0, -15.2024 ],
      [ 1.43403, 1.43403, -1.56918 ],
      [ 0, 0, -3.62776 ],
      [ 0, 0, -6.53437 ],
      [ 0, 0, -9.41506 ],
      [ 0, 0, -12.2957 ],
      [ 1.43403, 1.43403, -5.07094 ],
      [ 1.43403, 1.43403, -7.97012 ],
      [ 1.43403, 1.43403, -10.86 ],
      [ 1.43403, 1.43403, -13.7592 ]
    ],
    relationships: {
        property_instances: [
          'PI_732493115336_000',
          'PI_168548420957_000',
          'PI_123842619300_000'
        ],
        configuration_sets: [ 'CS_834475110856_000', 'CS_930594991086_000' ]
    }
}

ConfigurationSet:

{
  _id: ObjectId("629e66db059cdfec36c11bc6"),
  'short-id': 'CS_834475110856_000',
  aggregated_info: {
    nconfigurations: 2666,
    nsites: 24851,
    nelements: 3,
    chemical_systems: [
      'AlNiTi', 'Ni',
      'AlTi',   'AlNi',
      'Al',     'Ti',
      'NiTi'
    ],
    elements: [ 'Al', 'Ni', 'Ti' ],
    individual_elements_ratios: {
      Al: [
        0.25,  0.4, 0.33, 0.43, 0.54, 0.67,  0.5,
        0.38, 0.56, 0.75, 0.42,    1, 0.92, 0.58,
        0.18,  0.6, 0.93, 0.11, 0.94, 0.27, 0.44,
        0.91, 0.36, 0.86, 0.78,  0.7,  0.2, 0.45,
        0.62, 0.12, 0.29, 0.17,  0.1, 0.71, 0.22,
        0.21, 0.79, 0.09, 0.88,  0.3,  0.8, 0.55,
        0.64, 0.14, 0.89, 0.06, 0.08, 0.73, 0.57,
        0.82,  0.9, 0.07, 0.24, 0.41, 0.83
      ],
      Ni: [
        0.75,  0.5, 0.67, 0.58, 0.57, 0.14,  0.2,
        0.33, 0.12,  0.4, 0.25,    1, 0.92, 0.42,
        0.43, 0.18,  0.6, 0.93, 0.11, 0.27, 0.91,
        0.44, 0.36, 0.86, 0.04, 0.78,  0.7, 0.45,
        0.62, 0.29, 0.17,  0.1, 0.71, 0.22, 0.21,
        0.46, 0.79, 0.09, 0.54, 0.38, 0.88,  0.3,
        0.55,  0.8, 0.64, 0.89, 0.08, 0.56, 0.73,
        0.82,  0.9, 0.07, 0.24, 0.96, 0.41, 0.83
      ],
      Ti: [
        0.25, 0.67, 0.43,  0.4, 0.46,  0.5, 0.44,
        0.75, 0.36,  0.6, 0.42, 0.92, 0.59, 0.58,
           1, 0.18, 0.93, 0.11, 0.94, 0.27, 0.91,
        0.86, 0.04, 0.78,  0.7,  0.2, 0.45, 0.62,
        0.12, 0.29, 0.17, 0.09, 0.22, 0.71,  0.1,
        0.79, 0.21, 0.96, 0.38, 0.88,  0.3,  0.8,
        0.55, 0.14, 0.89, 0.64, 0.06, 0.08, 0.56,
        0.73, 0.57, 0.82,  0.9, 0.07, 0.83, 0.33
      ]
    },
    total_elements_ratios: {
      Al: 0.2960846646010221,
      Ni: 0.3825600579453543,
      Ti: 0.3213552774536236
    },
    labels: [ 'active_learning' ],
    labels_counts: [ 2666 ],
    chemical_formula_reduced: [
      'AlTi4',     'Al3Ni5Ti',  'Al2Ti3',    'Al5Ni6',     'Al7Ti3',
      'AlTi6',     'Ni7Ti6',    'Al2Ni3',    'Al2Ni5Ti',   'AlTi3',
      'NiTi3',     'Al3Ni7',    'Al5Ni9',    'Al4NiTi2',   'Ni5Ti24',
      'AlNi11',    'Al6Ti5',    'AlNi13',    'Al3Ni7Ti2',  'Al5NiTi6',
      'AlNi4',     'Al6Ni5',    'Al7Ni',     'Al11Ti',     'Al5NiTi3',
      'NiTi6',     'AlNi6Ti5',  'Al7Ni2Ti3', 'Al15Ti',     'AlNi4Ti',
      'AlNiTi7',   'AlNi2Ti',   'AlNi7',     'Ni11Ti3',    'Al9Ti',
      'Al5Ni2Ti',  'Al3Ni5Ti4', 'Ni2Ti7',    'Al3Ni4Ti3',  'AlNi5Ti6',
      'Al5Ni5Ti2', 'Al7Ni4Ti',  'Ni5Ti3',    'Al9Ni',      'Al4Ni2Ti',
      'Al4Ni',     'Ni10Ti',    'Al2Ni9Ti',  'Al13Ti',     'Ni3Ti5',
      'Al5Ni',     'Al3Ni2',    'Ni5Ti7',    'AlNiTi',     'Al4Ni5Ti',
      'Al6Ni5Ti',  'Al2NiTi3',  'Ni7Ti5',    'AlNi2Ti7',   'Ni4Ti7',
      'Ni7Ti4',    'Al4Ni3Ti',  'Al3Ni5',    'Ni5Ti9',     'Al5Ni2Ti3',
      'Al2Ti',     'Al3Ni2Ti4', 'Al7Ni5',    'Al5Ti3',     'AlNiTi2',
      'Al3NiTi3',  'Al6NiTi',   'Al3NiTi2',  'Ni12Ti17',   'AlTi15',
      'AlNi4Ti5',  'Al3Ni8',    'Al4Ni3Ti2', 'Ni5Ti',      'AlNi2Ti4',
      'Al2NiTi2',  'Al7Ni2',    'NiTi',      'Al7Ni16Ti6', 'Al8Ni',
      'Al5NiTi',   'Al4Ni3Ti5', 'Al2Ni5Ti5', 'Al7Ni2Ti',   'Ni4Ti3',
      'AlTi8',     'Ni5Ti2',    'Al9Ni2',    'Al7Ti6',     'Al12Ti17',
      'Ni4Ti5',    'Al3NiTi4',  'Al5Ti24',   'Al2Ti5',     'AlNi8Ti',
      ... 178 more items
    ],
    chemical_formula_anonymous: [
      'A7B3C2',  'A4B3C2', 'A5B3',   'A17B12', 'A10B',
      'A5B4C3',  'A4B2C',  'A4B3',   'A7B6',   'A7B2C',
      'A4B3C3',  'A6B5',   'AB',     'A2BC',   'A9B',
      'A13B',    'A7B5',   'A4B3C',  'A5B2C',  'A15B',
      'A3B3C2',  'ABC',    'A5B5C2', 'A2B',    'A9B2',
      'A7B4C',   'A3BC',   'A6B',    'A3B3C',  'A5B2',
      'A22B',    'A7B3',   'A',      'A3B',    'A4B',
      'A9B2C',   'A6B2C',  'A8B2C',  'A11B',   'A3B2',
      'A10BC',   'A24B5',  'A8BC',   'A9B5',   'A5B4C',
      'A8B3C',   'A4BC',   'A7B4',   'A8B3',   'A5BC',
      'A16B7C6', 'A8B',    'A5B3C2', 'A6B5C',  'A5B2C2',
      'A5B3C',   'A7BC',   'A3B2C',  'A6BC',   'A4B4C',
      'A7B',     'A6B3C',  'A5B',    'A7B2',   'A2B2C',
      'A11B3',   'A5B4'
    ],
    chemical_formula_hill: [
      'Al3Ni5Ti',   'Al5Ni6',     'Al7Ti3',    'Al10Ti8',   'Al2Ni10',
      'AlNi11',     'Al5NiTi6',   'AlNi6Ti5',  'AlNi4Ti',   'AlNiTi7',
      'AlNi2Ti',    'Al2Ni2',     'Al3Ni5Ti4', 'Ni2Ti7',    'Ni10Ti',
      'Al2Ni9Ti',   'Al13Ti',     'Ni5Ti9',    'Al10Ni6',   'Ni12Ti17',
      'Ni3Ti9',     'Al8Ni4Ti2',  'Ni10Ti2',   'Al4Ni2',    'Ni5Ti',
      'Al3Ni3',     'Ni2Ti4',     'Al2Ni6',    'NiTi',      'Al7Ni16Ti6',
      'Ni4Ti3',     'Al9Ni2',     'Ni4Ti5',    'Ni13Ti',    'Al2Ni4Ti4',
      'Al6Ti8',     'Ni10Ti8',    'Al4Ni8Ti8', 'Al2Ni2Ti2', 'Al2Ti6',
      'AlNi9Ti2',   'AlNi2',      'Al2Ti4',    'Al4Ni3',    'AlTi11',
      'Ni',         'Al4Ni10Ti2', 'Ni3Ti8',    'Al7Ti5',    'Al2Ni3Ti',
      'Al6Ni2Ti2',  'Al6Ni6',     'Al2Ni9',    'Al2Ni8',    'NiTi2',
      'Al2Ni6Ti',   'Al4Ni2Ti10', 'AlNiTi8',   'Al2Ti12',   'Al4Ni2Ti3',
      'Al2Ni10Ti4', 'Al3Ti9',     'Ni6Ti5',    'Al2Ni7Ti',  'Ni5Ti5',
      'AlNi9',      'Ni11Ti',     'Al5Ti7',    'Al3Ti5',    'Al2Ti7',
      'AlTi2',      'Al4Ni4Ti8',  'Ni8Ti16',   'Al5Ni3Ti4', 'Al2Ni7Ti3',
      'AlNi2Ti2',   'Ni44Ti2',    'Al2Ni12',   'Al5Ni2Ti2', 'Al5Ni4',
      'Ni4Ti8',     'Al7Ni6',     'Al3Ni11',   'Al2Ni6Ti2', 'AlNi3Ti6',
      'Al2Ti2',     'Ni6Ti10',    'Ni2Ti8',    'Al4Ti7',    'Ni6Ti2',
      'Ni7Ti2',     'Ni12Ti4',    'Ni2Ti2',    'Ni16Ti8',   'Ni5Ti4',
      'Al4Ni4Ti4',  'Al2',        'Al6Ti10',   'Al5Ti',     'Ni10Ti6',
      ... 331 more items
    ],
    nperiodic_dimensions: [ 3 ],
    dimension_types: [ [ 1, 1, 1 ] ]
  },
  description: 'Configurations generated using active learning by iteratively fitting a MTP model, identifying configurations that required the MTP to extrapolate, re-computing the energies/forces/structures of those configurations with DFT, then retraining the MTP model.',
  last_modified: '2022-06-06T15:43:07Z',
  relationships: {
    configurations: [
      'CO_103498874191_000', 'CO_123714783337_000', 'CO_286413493956_000',
      'CO_361853595308_000', 'CO_845864595717_000', 'CO_819703423836_000',
      'CO_104666227259_000', 'CO_126098619494_000', 'CO_809287109563_000',
      'CO_862630349690_000', 'CO_801903344573_000', 'CO_128815263518_000',
      'CO_226198246937_000', 'CO_550511261995_000', 'CO_102983245620_000',
      'CO_114881964944_000', 'CO_955577704605_000', 'CO_257557268903_000',
      'CO_122874373122_000', 'CO_501869001861_000', 'CO_776895673283_000',
      'CO_930848594787_000', 'CO_244173601732_000', 'CO_925418254023_000',
      'CO_600905758462_000', 'CO_487194788646_000', 'CO_812238954873_000',
      'CO_954530262738_000', 'CO_989694254312_000', 'CO_955448816493_000',
      'CO_505857471181_000', 'CO_182758564210_000', 'CO_296421260748_000',
      'CO_692693619819_000', 'CO_133234157101_000', 'CO_103332963818_000',
      'CO_288265507703_000', 'CO_100495156883_000', 'CO_499047226593_000',
      'CO_561554865265_000', 'CO_627559791683_000', 'CO_893221265575_000',
      'CO_968579001413_000', 'CO_105238792074_000', 'CO_510575348303_000',
      'CO_942148128430_000', 'CO_938646901128_000', 'CO_147361514387_000',
      'CO_118470011262_000', 'CO_405465410536_000', 'CO_357952885936_000',
      'CO_673030292495_000', 'CO_130260939580_000', 'CO_132085738871_000',
      'CO_851881108086_000', 'CO_305960046761_000', 'CO_294395214233_000',
      'CO_131520055055_000', 'CO_951126679130_000', 'CO_530070474421_000',
      'CO_347149919415_000', 'CO_127970152634_000', 'CO_380576273334_000',
      'CO_399031038109_000', 'CO_854965187388_000', 'CO_775591143593_000',
      'CO_133487129091_000', 'CO_131652369417_000', 'CO_845506464814_000',
      'CO_111083618699_000', 'CO_570176503189_000', 'CO_629723970586_000',
      'CO_111131394669_000', 'CO_893676209939_000', 'CO_941481952794_000',
      'CO_407827566958_000', 'CO_659415313177_000', 'CO_856193683182_000',
      'CO_962940011645_000', 'CO_101507918432_000', 'CO_586158704401_000',
      'CO_337921169873_000', 'CO_109525958723_000', 'CO_241797560321_000',
      'CO_535732713249_000', 'CO_557224521434_000', 'CO_114181710530_000',
      'CO_216631133540_000', 'CO_776524570077_000', 'CO_685546368290_000',
      'CO_791391514643_000', 'CO_630441142338_000', 'CO_177355754902_000',
      'CO_110192275836_000', 'CO_107560021239_000', 'CO_636112115595_000',
      'CO_535867816114_000', 'CO_820687078851_000', 'CO_109305047769_000',
      'CO_152627356945_000',
      ... 2566 more items
    ],
    datasets: [ 'DS_555294579000_000' ]
  }
}

Some things to note:

Something like an aggregated_info field is very important to us, as much of the critical information in our collections is generated through a map-reduce operations on their entries
Details for how to perform the map-reduce operation for a given entry type are (currently) defined outside of the database itself -- in our case, through the definition of a Python class which sub-classes from our base entry type and defines its own reduction operation.

Some additional assumptions that we've made:

Sub-entries will change relatively infrequently, so that the contents of aggregate_fields rarely (if ever) change.
Collections will consist of large (million? tens of millions?) of sub-entries

Jun 28 '22 15:06 jvita

@jvita

Something like an aggregated_info field is very important to us, as much of the critical information in our collections is generated through a map-reduce operations on their entries

Right, indeed - I removed your aggregated_info field definitions from the PR not because I didn't think this kind of aggregated information wasn't useful to make available via OPTIMADE, but because it seemed the aggregations at some level must reference things that are database-specific. In that case it makes sense to make this aggregated information available via database-prefixed fields in the collection endpoint, e.g., _colabfit_aggregated_info. Which, if we release a new OPTIMADE version with the present PR merged, is something you could do directly with the exact field definitions I took out from the PR, you just prefix them with _colabfit_. However, adding these definitions to the joint standard seems quite a bit of complexity for something that doesn't seem to help interoperability.

Also: with the property definitions PR now in OPTIMADE, a detailed description of these fields can still be made available over your info endpoint.

Jun 28 '22 15:06 rartino

Re-pinging @merkys and @ml-evs who've commented/review this before to approve or comment.

Jun 29 '22 06:06 rartino

@rartino

I had the idea that the contents of a collections should all go in a "contents" relationship, i.e., that the tag directly under relationship should be contents rather than structures, but then I messed up the example.

After seeing your comments and re-reading the specification, I realize that not all output formats need to support the facility to "name" relationships the way that JSON API allows, so we shouldn't rely on that feature (or, we should add that feature to the definition of a relationship). You also remark that the relationship "name" must be the name of the entry type (is that in the specification?).

So, I've re-done this now, so that "any" "normal" "named-as-the-entry-type" relationship defines the set of entries in the collection.

I seemingly misunderstood the whole issue with relationship naming. As well I thought the JSON API had some requirements for them (apparently it does not). So now I envisage three suggestions:

Do not mandate the relationship names (this is what the PR is doing now). The advantage of this option is that we do not need to make any JSON API-specifc provisions in the specification, but having no recommendation apart from an example may leave implementers of both clients and servers wondering.
Use contents relationship name for JSON API. This would put all related entries to the same list.
Separate relationships per entry type (as in example). Not sure if this is actually beneficial.

Jun 30 '22 07:06 merkys

@merkys

I think we first need to clarify this ambiguity of JSON API relationship naming in general...

The most straightforward solution seems to be to edit 5.3.2 Entry Listing JSON Response Schema and for the definition of the "relationships" field say that: "Relationships MUST be grouped into a single JSON API relationships object for each entry type and placed in the relationships dictionary with that entry type name as key (e.g., "structures").

If we rather want to preserve the JSON API relationship "name" as a means to group related things - possibly of different types - together, this feature of a 'relationship name' needs to be added to the output-format-agnostic description of relationships in 3.6 Relationships. Preferably with some hint in the implementers note about how one can encode these names in more straightforward key-value based output formats (e.g. "type name + '_' + relationship name"?). It is probably also a good idea with a clarification in 5.3.2 that the key in this dictionary field represents the relationship name.

Separate relationships per entry type (as in example). Not sure if this is actually beneficial.

The most straightforward solution above means going with this choice.

Jun 30 '22 07:06 rartino

I drafted an issue (probably years ago now) about our non-compliance with JSON:API relationships (but never got around to posting it). The Fetching relationships part of the JSON:API spec states that we should also serve relationships at e.g., https://example.com/v1/collections/42/relationships/structures.

Hmm... Reading Section 1 I'd say that strictly speaking, this construct is indeed required to be supported:

"The API specification described in this document builds on top of the JSON API v1.0 specification. In particular, the JSON API specification is assumed to apply wherever it is stricter than what is formulated in this document. Exceptions to this rule are stated explicitly (e.g. non-compliant responses are tolerated if a non-standard response format is explicitly requested)."

This is actually part of a bigger non-compliance, as JSON:API mandates in Fetching resources that you should be able to pick out particular attributes of an entry via e.g., https://example.com/v1/structures/123/elements. This would break our allowed IDs (which can have slashes, as I currently do with odbx/1... hence why I never posted the issue :sweat_smile:). Looking at my draft (which I am happy to post), my suggestion was to just add a note in the specification that this aspect of JSON:API is not supported, but maybe it is now actually required for collections functionality to work properly...

This is quite something to drop on this poor PR :-), and IMO a very important issue, but with very little to do with the acceptance or not of this PR, given that we already generally allow relationships - it should just be posted as a general issue. (In relation to what I commented above about included: I didn't even reflect over any other way of fetching the resources than by specifically querying them from their respective endpoints from the list of ids; so I wouldn't say this feature is essential for collections.)

Nevertheless, strictly formally, based on the quote above, I think the implication is that we MUST support this. How your implementation works out the separation between your "/"-based IDs and these fields is on up to you, but you can certaintly do it by, e.g., relying on that your ids have a fixed format so you know the first slash is always part of the id. That said, the realization here is that one probably should avoid "/" in IDs... (Very helpful of us to give examples showing "/"-based ids though.)

Jul 06 '22 18:07 rartino

This is quite something to drop on this poor PR :-), and IMO a very important issue, but with very little to do with the acceptance or not of this PR, given that we already generally allow relationships - it should just be posted as a general issue. (In relation to what I commented above about included: I didn't even reflect over any other way of fetching the resources than by specifically querying them from their respective endpoints from the list of ids; so I wouldn't say this feature is essential for collections.)

Indeed, I'll dig out my draft issue and post it. I think I mentioned it at a meeting once but perhaps I did not explain myself well.

Jul 06 '22 20:07 ml-evs

I don't think a discussion of #420 is blocking this PR, if we are happy with the format otherwise.

Jul 06 '22 20:07 ml-evs

I don't think a discussion of https://github.com/Materials-Consortia/OPTIMADE/issues/420 is blocking this PR, if we are happy with the format otherwise.

I agree - but I think this PR is now blocked by #419, so if it may be helpful if you have any thoughts on that...

Jul 07 '22 08:07 rartino

Just another quick thought that there is some enforced directionality to the collections relationship, in that each member of the collection does not have a relationship back to its collection. If it did, you would be able to do /structures?filter=collections.id = "Si-potential". Do we need to enforce either mono- or bi-directional links, or should we just suggest that each database is consistent in its application (e.g., if one structure has a link to its collection, then all other structures that are part of a collection in that database should also have the same link?)

Jul 08 '22 09:07 ml-evs

Just another quick thought that there is some enforced directionality to the collections relationship, in that each member of the collection does not have a relationship back to its collection. If it did, you would be able to do /structures?filter=collections.id = "Si-potential". Do we need to enforce either mono- or bi-directional links, or should we just suggest that each database is consistent in its application (e.g., if one structure has a link to its collection, then all other structures that are part of a collection in that database should also have the same link?)

The suggested direction of the mono-directional relationships could also be reversed, i.e. instead of collections->relationships->structures, we enforce structures->relationships->collections (and ban the opposite direction). Each relationship then becomes one-to-few (a structure will be in fewer collections than the number of structures per collection) and the overall collection metadata is still accessible at the collections endpoint.

Jul 08 '22 09:07 ml-evs

Do we need to enforce either mono- or bi-directional links, or should we just suggest that each database is consistent in its application (e.g., if one structure has a link to its collection, then all other structures that are part of a collection in that database should also have the same link?)

Just to throw in my two cents...

In our work with ColabFit we originally started off only requiring mono-directional links (in the collections->relationships->structures direction), but we quickly found that this made it unwieldy/inefficient to do many important queries that would have been simple if we had instead just used bi-directional links. The trade-off is of course that it might be more challenging for data providers to maintain the bi-directional links, especially if the given application requires that changes be propagated along the links whenever content at one end of the link changes.

So to me, the question of whether to use mono-/bi-directional links, and in which direction if you're only doing mono-directional, is just too application-dependent. Each application would want to make a different choice depending on 1) what types of queries they want to support, and 2) what direction information needs to flow if they're doing data aggregation.

My suggestion:

Don't enforce any constraints on the link directions, but add some method of letting users specify which directions are available for their given application
OR just choose one of the MONO-directional links

Personally, option 1 seems better to me. Though I can't think of an example off the top of my head, I could imagine that there might be some application where it's easier for the data provider to maintain one direction than it is to maintain the other, in which case option 2 would be overly restrictive.

Jul 08 '22 17:07 jvita

The way I edited the PR, it is meant to say that the direction collection -> members is mandatory. The way I read the relationships part of the OPTIMADE specification, databases are allowed (but not required) to also provide relationships in the other direction.

I don't like the idea of exposing clients to the possibility of databases providing only members -> collection, since it effectively means they have to run one query for every endpoint of the database before they are sure to have all the members of a collection.

With enforced collection -> members you get all members from a single query.

Even if your database deep down encodes collections as member -> collection,I expect you to have implemented some efficient way of collecting all members in a collection (otherwise, what are you doing with your collections?...), so you should be able to answer OPTIMADE queries on this form efficiently.

Jul 08 '22 18:07 rartino

@ml-evs I was about to say that with #438 merged this should be good to go. But, I realize your concerns about very large collections aren't resolved yet.

It seems issue #428, created in response of the need to sharing very large sets as part of trajectories in #377, should be relevant here; but I don't see how it can be directly applied to JSON:API relationships.

What do you think: should we re-edit collections to not use JSON:API relationships and instead use a regular attribute named, e.g., ranged_members which uses the mechanisms in #428 to access the contents?

Dec 19 '22 14:12 rartino

Using a ranged_members field instead of JSON:API relationships would enable returning all the ids and types of the members of the collection, even when there are more than ~ 100,000 members.

It would also solve the included problem. If we used the normal JSON:API relationships the client could specify in the include field that the structures belonging to the collection should be returned as well. This would result in a very large response. (I did notice in the specification that the included field is optional, but I'm wondering whether it would be strange if we support it for the references but not for e.g. structures.)

Instead, the entries in the collection could be retrieved via a query that checks if the id of an entry is in a list of ids supplied by the client. This would use the same paging mechanism that we already use. A disadvantage could however be that there may be a maximum size for the query parameter (depending on the software and settings, it could be as small as 1024B). So the number of entries per query could be limited.

An alternative solution would be to make bidirectional associations and query the individual entries for membership.

So in summary, I think it would be better to use a separate variable for storing the members, as this gives more flexibility to handle them.

Dec 20 '22 13:12 JPBergsma

It would also solve the included problem. [...]

To try to stay on point, I do not think we need to solve the "included problem" to add collections to the specification. The specification clearly allows an implementation not to ever include anything in a collections response, and then there is no problem. If someone wants to improve the general handling of this (which applies throughout all of OPTIMADE), it can go in a separate PR.

(I did notice in the specification that the included field is optional, but I'm wondering whether it would be strange if we support it for the references but not for e.g. structures.)

It is specifically encouraged for references because it is a sensitive issue to include data without sourcing it. That only applies to references. I'd be happy with implementations never using or supporting included beyond that.

Instead, the entries in the collection could be retrieved via a query that checks if the id of an entry is in a list of ids supplied by the client. This would use the same paging mechanism that we already use. A disadvantage could however be that there may be a maximum size for the query parameter (depending on the software and settings, it could be as small as 1024B). So the number of entries per query could be limited.

I was trying to force MUST level support for POST in OPTIMADE using the exact same semantics as for GET, to avoid issues like this. I'm sad that was rejected. Nevertheless, a client can work around the limitations by splitting up its request, so the issue can be worked around. I think that is good enough to accept the proposed version of collections.

An alternative solution would be to make bidirectional associations and query the individual entries for membership.

I am very strongly opposed to a MUST level requirement for bi-directional collections (and, if clients cannot rely on it being there, i.e, a MUST, then it is not very useful). In the httk implementation we generally want to view stored data as immutable, and hence adding something existing to a collection would need some kind of side storage / querying tricks to handle the bi-directionality.

So in summary, I think it would be better to use a separate variable for storing the members, as this gives more flexibility to handle them.

Right, so, lets get the solution of #428 implemented and merged, and then use it here.

Dec 20 '22 14:12 rartino

Instead, the entries in the collection could be retrieved via a query that checks if the id of an entry is in a list of ids supplied by the client. This would use the same paging mechanism that we already use. A disadvantage could however be that there may be a maximum size for the query parameter (depending on the software and settings, it could be as small as 1024B). So the number of entries per query could be limited.

I was trying to force MUST level support for POST in OPTIMADE using the exact same semantics as for GET, to avoid issues like this. I'm sad that was rejected. Nevertheless, a client can work around the limitations by splitting up its request, so the issue can be worked around. I think that is good enough to accept the proposed version of collections.

This is yet another argument in favor of supporting POST (#114). It may be worth revisiting the discussion there.

Dec 21 '22 07:12 merkys

I note that our merging of #467 probably should have closed #428. I don't see why we cannot complete and merge this PR now. (It would of course be nice to also have #481, but it isn't necessary). I'll take a look on the text and check if I can update it.

Jan 09 '24 14:01 rartino

After discussions: we realize that the partial data format in OPTIMADE may not cover the case of a too long relationship list to fit in a response. So we have to do one of:

Put the members of the collection in the data part instead
Amend the partial data handling to work with a too long relationship list

Jan 17 '24 14:01 rartino

OPTIMADE OPTIMADE copied to clipboard

Collections endpoint

OPTIMADE
OPTIMADE copied to clipboard