objectiv-analytics
objectiv-analytics copied to clipboard
Remove non-deterministic ordering from Bach
Currently, quite a few Bach functions produce non-ordered or semi-ordered results. This does not necessarily have to be a problem, but it can cause unexpected behaviour if those results are to be used in consecutive analysis. This especially comes into play if results are to be expected to have the same order between different data stores, which is very often not the case.
@thijs-obj has proposed a solution to this problem:
- Whenever sorting is specified (via
sort_index
,sort_values
), we also sort on the fields not specified explicitely in the sort order in a deterministic way. - When sorting is not specified but it's absolutely needed for useful results, we fail with an exception: window functions,
mode
,drop_duplicates
,fillna
, etc (akin https://github.com/objectiv/objectiv-analytics/pull/1090/files) - When sorting is not specified but it's not directly a problem but might be later (
head
,to_pandas
, etc comes to mind), we sort on the fields in a deterministic way. This might include sorting onindex
first, and on all remaining fields after.
Parameters
We propose to use the above rules unless sort_index
, sort_values
are called with a specific parameter to "not care". Proposal: force_deterministic
defaulting to True
, but overrideable in case the user wants to go YOLO (or needs the least sorting complexity possible)
Documentation We will document the behaviour fully in a seperate documentation page, and everywhere sorting matters, we will refer to that. Related: https://github.com/objectiv/objectiv-analytics/issues/498
- When sorting is not specified but it's absolutely needed for useful results, we fail with an exception: window functions,
mode
,drop_duplicates
,fillna
, etc (akin https://github.com/objectiv/objectiv-analytics/pull/1090/files)
I think we should just force the ordering in that case. I created a small branch to demonstrate what I mean, and implemented this for drop_duplicates there: https://github.com/objectiv/objectiv-analytics/pull/1096/files
Note tho:
- branch is very much incomplete
- tests don't pass. Some of them could actually be a problem
- it's just a demo
Although I guess failing with an error is also fine, maybe even better :thinking:. That forces the user to think about what he is doing.
Hendrik paraphrased:
- Okay
- Error is nice
- Regarding
head
,to_pandas
, etc; they don't have to be deterministic. If there is an index, we should use that to sort on, and whether the rest of the field are use is something that could be controlled with a global flag to enable / disable that but not have it by default (due to possible slowness / complexity).
Tom: we need to figure out added query complexity. See if it's really bad or not. Determinism is nice, speed too.