Add option to preserve input order on table joins
Description
joins sort the resulting table by the join key. This is not documented in the actual API documentation of join here: https://docs.astropy.org/en/stable/api/astropy.table.join.html
and just mentioned in passing here: https://docs.astropy.org/en/stable/table/operations.html#join
This is undesired in many circumstances. There should be an option to preserve the input order, at least for left/right joins.
Additional context
Example:
from astropy.table import Table, join
import numpy as np
import astropy.units as u
rng = np.random.default_rng(0)
N_EVENTS = 10
N_TELESCOPES = 3
events = Table({
'event_id': np.arange(N_EVENTS),
'tel_id': rng.integers(0, 3, N_EVENTS),
})
telescopes = Table({
'tel_id': np.arange(N_TELESCOPES),
'x': rng.uniform(-100, 100, N_TELESCOPES) * u.m,
'y': rng.uniform(-100, 100, N_TELESCOPES) * u.m,
})
events = join(events, telescopes, keys='tel_id', join_type='left')
print(events)
results in:
event_id tel_id x y
m m
-------- ------ ------------------ -----------------
3 0 82.55111545554433 8.724998293084568
4 0 82.55111545554433 8.724998293084568
5 0 82.55111545554433 8.724998293084568
6 0 82.55111545554433 8.724998293084568
7 0 82.55111545554433 8.724998293084568
8 0 82.55111545554433 8.724998293084568
1 1 21.327155153435967 87.01448475755365
2 1 21.327155153435967 87.01448475755365
0 2 45.899312196799684 63.17071082430644
9 2 45.899312196799684 63.17071082430644
Fair enough. As a workaround you can do something like this:
events['__index__'] = np.arange(len(events))
events_tel = join(events, telescopes, keys='tel_id', join_type='left')
events_tel.sort('__index__')
del events['__index__']
If this were implemented, my guess for the most natural ordering is:
inner: left side (you can always swap left/right on input if desired)left: left sideright: right sideouter: no unambiguous ordering is possible so do nothing
Does sound reasonable? Maybe a new keep_order parameter?
If you'd like to take a crack at this, have a look at https://github.com/astropy/astropy/blob/5f8866e41164293e463a3cc53fadb5566ba488cc/astropy/table/operations.py#L395 for an example of putting in a temporary index to capture the sort order.
@taldcroft https://github.com/fact-project/aict-tools/pull/156/files ;)
What is status of this issue? I just faced this problem and I would be really happy to see it implemented.
In the meantime, I urge you to update the API documentation and specifically mention that the original order won't be preserved and the result will be sorted.
@astrobatty Thanks for reviving this issue. I've started working on the implementation of the keep_order argument suggested by @taldcroft. The hope is that the patch, once polished, can land in astropy 6.1
Actually, thinking more about this, I think we can do better than a boolean keep_order argument. Here are my arguments against it:
keep_order=Trueis not 100% clear in the case of the inner join: as mentioned we'd have to hard code a preference for one side, and the only way to reverse the preference from the user perspective would be to flip the first two arguments, which isn't very intuitive.keep_order=Falsemight turn out to be an outwright "lie" in cases where the join keys are already ordered (so there wouldn't be any observable difference withkeep_order=True)- a boolean argument isn't easily extensible without changing its type somehow, so it's harder to change our minds about it later
- as mentioned,
keep_order=Truewould still be ambiguous withjoin_type="outer"
I think a better solution would be to expose a sort_by: str argument that would accept "left", "right" or "keys" (which would be the default value). This would neatly avoid all four problems I just mentioned. I also don't think it would add much complexity with respect to the simpler keep_order argument. In fact it would merely expose the complexity we need in all cases. What do you think @taldcroft ?
@neutrinoceros - that sounds like a good solution.
The one thing I often do is look for precedent in pandas. Although astropy is not even close to API compatible, it can reduce friction to be similar where possible. Pandas appears to have a single bool sort keyword that looks similar to keep_order: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html. In the case of pandas they define the default sort order separately for each join type (defined in the how keyword).
Your idea might be more clear and better in the end, but have a look at pandas and see what you think.
Ok, will do ! I don't think it matters to much at this point as I think I'll implement the bulk of the feature the same way in both cases, it's really just about of how we expose it, so we can discuss the exact API in my future PR !