OPTIMADE Adding the trajectories endpoint to OPTIMADE

During the OPTIMADE meeting in June 2021, we discussed in room 2 how the OPTIMADE standard could be expanded to allow the sharing of trajectory data. Based on the discussions with amongst others @giovannipizzi, @gmrigna, Adam Hospital, Tristan Bereau, Amit Gupta, Corey Oses, Ellad Tadmor, Maithilee Motlag, and Daniel S Karls, I have written a first draft for adding a trajectory endpoint. This will hopefully be a good base for further discussion.

The PR adds a trajectories section (in this PR section 7.3), that defines how a client can query and retrieve data from trajectories and how the server can share this data. We have tried to do this in a general way so that many different kinds of trajectories can be shared.

Jul 21 '21 14:07 JPBergsma

Can we add an endpoint for bulk downloads?

Sep 09 '21 10:09 giovannipizzi

Can we add an endpoint for bulk downloads?

What would you want to do with that endpoint that could not be done with the endpoints that are already under discussion? The server is already allowed to send back as many trajectories as it wants in one go unless the client specifies a page limit.

Sep 12 '21 17:09 JPBergsma

Can we add an endpoint for bulk downloads?

What would you want to do with that endpoint that could not be done with the endpoints that are already under discussion? The server is already allowed to send back as many trajectories as it wants in one go unless the client specifies a page limit.

Both the client and the server can impose limits. What I'm thinking at is that the server might have to impose limits (we're returning data in JSON, that in most implementations has to be fully loaded in memory to be parsed, so the server will have to paginate). However there are other formats that would allow a simple file download (possibly also with streaming, reading one line at a time). Or, simply serve the file in the format stored in the underlying database.

Therefore, having a way to allow bulk downloads I think is very important - otherwise for big trajectories, the server will have to convert to JSON, paginate, the client will have to perform many paginated request, put all together, and reconstruct the trajectory in some other format. A bulk download would be much faster.

Sep 14 '21 14:09 giovannipizzi

We could allow references/links that point directly to the (compressed) output files that contain the trajectory data. Such as xyz, pdb etc. This would however also apply to the structures end point, where databases may also want to supply the original pdb and CIF files. So I do not see this as an issue specifically for the Trajectories.

There is also PR #360 that suggests a files end point. If that PR gets merged, we can also update our description for the trajectory endpoint and add a piece about how to share the trajectory data in these file types.

Edit: The server could store the files in a preprocessed format, so it only needs to read the parts of the files that contain the properties that are needed. As long as the requested data is contiguous, very little processing would be needed.

The structure of the JSON file would be quite predictable, so I do not think that a server would have a problem with the file being larger than the amount of memory it has. It can simply generate each data package just before it needs to be sent and it would never have to have the whole file in memory. Similarly, If we supply the necessary information to the client, the client could also start processing and storing the data before the entire JSON file is received. In that case this information must be send at the start of the returned json file. I have not written the current standard with this in mind, so I may have to go over it, but I think it should be possible to do this.

Sep 14 '21 17:09 JPBergsma

This is a large addition to the specification, albeit well-written. I personally do not deal with trajectories thus my comments may look naive, but wouldn't it be simpler and sufficient to define a trajectory by a sequence (collection maybe, #386) of OPTIMADE structures? I am aware that the current proposal introduces means to "pack" values using frame_serialization_format mechanism, but I wonder whether added complexity vs. benefit ratio is optimal.

Thank you for your comments. There are several reasons why a trajectory can not be stored as a series of structures in a good manner.

An MD trajectory can easily contain 100.000 time steps. To save storage space, not all the fields will be stored for every frame. In a project I did as a student I would store the constraint force every 5 time steps. Yet, the positions I would only store every 100 steps. The cartesian_site_positions are a mandatory field, so I would not be able to create a proper structure for all the time steps where I have data on the constraint force. Even if I had that data, storing that many structures would take up a lot of hard drive space.

There would also be so many structures that queries with multiple properties would be very slow. These structures would also drown out structures which occur only once. So I think we would have to, somehow, distinguish them from the normal structures any way.

To keep things as close as possible to the current OPTIMADE we have therefore implemented the reference frame. So you can still query trajectories in the same way as you would do with a normal structure. There are still some rare cases where this could cause problems, but I think it should be sufficient for the vast majority of the trajectories.

Dec 16 '21 10:12 JPBergsma

Merkys: Both requesting frame ranges via URL query parameters and returning their values inside trajectory entries also seem to deviate from OPTIMADE-ish and JSON:API-ish way of doing things. Maybe frame range could be made a subresource of a trajectory? For example, instead of requesting /trajectories/1?first_frame=10&last_frame=100&frame_step=20 we could request /trajectories/1/10-20-100? This would then return fields describing the frame range, but not the trajectory.

@giovannipizzi requested that I would include the frame ranges within the response. If I remember correctly, he found it clearer to return it with the trajectory property as it may not be immediately clear what the first frame and last frame are in case the frame_serialization_format is explicit_regular_sparse.

The method you suggest for selecting frames is quite universal. It should also apply to other future endpoints that could have some form of indexing. So we will need to consider such future use cases as well.
I think that 10-20-100 would at present be a valid id. So we would need to add a rule like: "id's cannot contain a dash '-' ".

Do I understand correctly that with " This would then return fields describing the frame range, but not the trajectory." you mean you would not return the fields: available_properties, reference_structures, nframes and reference_frame when only a subrange of the trajectory is requested?

Sep 29 '22 16:09 JPBergsma

OPTIMADE OPTIMADE copied to clipboard

Adding the trajectories endpoint to OPTIMADE

OPTIMADE
OPTIMADE copied to clipboard