Comparing with lists
In implementing a client, I have a hard to time to query and retrieve all 2-dimensional structures - as an example.
In trying to figure this out, I found that there is no way to query for property(that is a list) = value, which is maybe okay?
But can be a problem in some instances.
Specifically for dimension_types this can be solved with a query possibility of counting a value in a list, i.e., "how many of X is there in property_list Y". With this, one could check for the number of zeros (and/or ones) in dimension_types and deduce the dimensionality of the structure.
If there are some clever tricks in the current filter language to be combined and get unique results for these issues I would be very interested to know of them.
Good catch! The closest one can get currently is limiting the results to 1D-2D structures:
dimension_types HAS 0 AND dimension_types HAS 1
But there seems to be no way to tell 1D from 2D... As we're in feature freeze now, shall we think about "how many of X is there in property_list Y" queries for v1.1?
Right, the lack of a "list is equal to" operator has been observed, but we couldn't think of an example where it was needed with what we had standardized so far...
But I wonder if this isn't rather a sign that the format of dimension_types could have been chosen better. We could have gone with, e.g.,
periodic_dimensions = [0, 2]
and then periodic_dimensions LENGTH 2 would do the trick for finding 2D structures.
Nevertheless, to solve the problem at hand, we could simply standardize a new property. e.g., periodic_dimensionality exactly for this.
For the more general question of list-equal (if we need it), I note that our present zip-syntax would be sufficient if one also has access to a "range"-type list = [0, 1, 2]
range:dimension_types HAS ALL 0:0, 1:1, 2:1 OR range:dimension_types HAS ALL 0:1, 1:0, 2:1 OR range:dimension_types HAS ALL 0:1, 1:1, 2:0
But it would surely be more readable with something like
dimension_types = [0, 1, 1] OR dimension_types = [1, 0, 1] OR dimension_types = [1, 1, 0]
OR
dimension_types COUNT 0 = 1 AND dimension_types COUNT 1 = 2
For the more general question of list-equal (if we need it), I note that our present zip-syntax would be sufficient if one also has access to a "range"-type list = [0, 1, 2]
range:dimension_types HAS ALL 0:0, 1:1, 2:1 OR range:dimension_types HAS ALL 0:1, 1:0, 2:1 OR range:dimension_types HAS ALL 0:1, 1:1, 2:0But it would surely be more readable with something like
dimension_types = [0, 1, 1] OR dimension_types = [1, 0, 1] OR dimension_types = [1, 1, 0]OR
dimension_types COUNT 0 = 1 AND dimension_types COUNT 1 = 2
I really like the last option here, and it seems to me to be much less invasive than any of your other suggestions.
Thinking more about it, I originally thought dimension_types = [0, 1, 1] was natural. But now it seems to me it is not the right solution, and will be too incompatible with the rest of the filter language, Furthermore, then there's the question of order in the lists, i.e., is [0, 1, 1] the same as [1, 0, 1]. For a Python list, this is not the case, and I believe for a JSON list/array it is the same, but it simply adds an extra aspect to "take care of".
However, I would definitely expect it to be possible to filter for how many of a given quantity there are in a list, i.e., dimension_types COUNT 0 = 1 makes a lot of sense to me. It will solve the problem of distinction without having to add new fields or change the determined value of current ones, and may be useful for other list-valued fields to quickly determine/infer certain aspects of that field.
However, I am not sure how fast this kind of query will be in a backend, since it, e.g., will be going through value content of a database column, and not simply comparing column values - if that makes sense? I am not a database expert, so I don't know if this kind of query is generally (significantly) slower or not?
I like the COUNT option too. But I'd like to see it as an extension of current LENGTH operator, which is essentially querying the number of all elements in a list:
dimension_types LENGTH = 3
I suggest extending it by inserting the single-element condition in between the Property and LENGTH operator, viz:
dimension_types = 1 LENGTH = 2
This says "dimension_types list has exactly two 1s in it".
Performance might be an issue for certain backends, but we already have features that are problematic (at least to us, MySQLers). Thus we may mark the support of this new feature as OPTIONAL for all properties except the dimension_types.
Furthermore, then there's the question of order in the lists, i.e., is [0, 1, 1] the same as [1, 0, 1]. For a Python list, this is not the case, and I believe for a JSON list/array it is the same, but it simply adds an extra aspect to "take care of".
The relevant question is not Python or JSON, but the OPTiMaDe data type called list. I'm quite sure we are clear on that it is ordered; to me, that is at least implicit in the name "list" vs "set".
However, I am not sure how fast this kind of query will be in a backend, since it, e.g., will be going through value content of a database column, and not simply comparing column values - if that makes sense? I am not a database expert, so I don't know if this kind of query is generally (significantly) slower or not?
SQL has a COUNT function, mongodb has a db.collection.count(). But, I'm not sure we should make it mandatory. I can imagine weaker database engines.
I'm not enthusiastic about
dimension_types = 1 LENGTH = 2
because the meaning of that construct is not very human-readable, which has been a goal of our language design so far. But it may be fixable with more carefully selected keywords.
Another alternative or complement to COUNT I can think of is an additional list-type operator.
dimension_types HAS DISTINCT 0, 1, 1
Where the distinction between HAS ALL and HAS DISTINCT is that the latter requires separate elements in the list to match the values/inequalities on the right hand side.
I agree that dimension_types = 1 LENGTH = 2 is a bit wonky with respect to being human-readable, and I prefer COUNT over HAS DISTINCT, mainly for its simplicity, but also I think HAS DISTINCT is not completely clear. How would you do inequality on the right hand side? Per value? What if the property on the right hand side is a longer list and the values on the right hand side match a subset of it, including the correct order.
I'm not saying it couldn't work, just that it demands more explaining and seems not as clear as COUNT.
I already mentioned before so I just conclude another concept very briefly and hope that this time it will get more support :)
- All the comparison operations happen element-wise (
=,<,>,AND,OR) - "Functions" can be defined to reduce all the dimensions to scalar values
- Disclaimer: all these changes only related to arrays and most of them are optional
list HAS value -> ANY(list = value)
list HAS ALL [value1, value2, ...] -> ANY(list = value1) AND ANY(list = value2) AND ...
list HAS ANY [value1, value2, ...] -> ANY(list = value1) OR ANY(list = value2) OR ...
list LENGTH value -> LENGTH(list) = value
HAS ONLY [value1, value2, ...] -> ALL(list = value1 OR list = value2 OR ...)
NOT list HAS inverse -> NOT ANY(list = inverse)
I'm not sure that I managed to interpert properly but I try do my best:
list1:list2:... HAS val1:val2:... -> ANY(list1 = val1 AND list2 = val2 AND ...)
list1:list2:... HAS ALL val1:val2:... -> ALL(list1 = val1 AND list2 = val2 AND ...) ???
list1:list2:... HAS ANY val1:val2:... -> ANY(list1 = val1 OR list2 = val2 OR ...)
list1:list2:... HAS ONLY val1:val2:... -> ANY(list1 = val1 AND list2 = val2 AND ...) ???
For example:
matches all entries for which list contains at least one element that is less than three:
list HAS < 3
ANY(list < 3)
matches only those entries for which list simultaneously contains at least one element less than three and one elements greater than three:
list HAS ALL < 3, > 3
ANY(list < 3) AND ANY(list > 3)
matches all entries for which list contains at least one element that is between the values 2 and 5:
list:list HAS >=2:<=5
ANY(list >= 2 AND list <= 5)
And now back to the original issue:
2D only
COUNT(dimension_types = 1) = 2
1D and 2D
COUNT(dimension_types = 1) <= 2
where the COUNT OR SUM is the same type of "dimension reducer" function like ANY or ALL or LENGHT.
I'm a little bit biased by python but I think most of our intended users as well. The current definition is very compact (which is very nice) but I would sacrifice a little bit on the compactness if it is easier to understand the new definition. It is a lot of work especially to implement the details but I would be happy to prepare the changes for the grammar specifications.
I completely agree that defining a property in another representation is a "good" solution right now but this concept implements in a future release ...
I think the idea is interesting and in general I find you suggestion probably a bit more intuitive than the current (but both require anyway some time to grasp them). However I a bit scared that this change of grammar is not something that can be done in a few days, many clients are implementing the existing grammar, and it would require some testing to make sure the new grammar doesn't have different difficulties... With the aim to release 1.0 very soon, I don't know what to suggest.
Probably we should check if this new grammar is "better" in some sense (easier to use or to implement or more powerful...), and work on it for v2? Ideally it should be a superset, and we should come with an automatic query converter from the old to the new syntax.
But other opinions are welcome, I am not an expert of grammars!