pandas
pandas copied to clipboard
merge_ordered() should accept left_index and right_index args
Code Sample, a copy-pastable example if possible
INDEX_COLS = ['group', 'timestamp']
left = pandas.DataFrame({
'group': ['a', 'a', 'b', 'b'],
'timestamp': [0, 1, 2, 5],
'obs1': [10, 11, 12, 15]
}).set_index(INDEX_COLS)
right = pandas.DataFrame({
'group': ['a', 'a', 'b', 'b'],
'timestamp': [0, 1, 3, 4],
'obs2': [20, 21, 23, 24]
}).set_index(INDEX_COLS)
pandas.merge_ordered(left, right)
# Traceback (most recent call last):
# File "<stdin>", line 1, in <module>
# File "python3.6/site-packages/pandas/core/reshape/merge.py", line 226, in merge_ordered
# result = _merger(left, right)
# File "python3.6/site-packages/pandas/core/reshape/merge.py", line 212, in _merger
# how=how)
# File "python3.6/site-packages/pandas/core/reshape/merge.py", line 1252, in __init__
# sort=True # factorize sorts
# File "python3.6/site-packages/pandas/core/reshape/merge.py", line 524, in __init__
# self._validate_specification()
# File "python3.6/site-packages/pandas/core/reshape/merge.py", line 1033, in _validate_specification
# lidx=self.left_index, ridx=self.right_index))
# pandas.errors.MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False
Problem description
The error message above implies that I should be able to merge these two timeseries DataFrames using left_index=True
and right_index=True
. But these are not arguments to the merge_ordered()
function like they are to other merging functions.
I could work around this by doing the following:
pandas.merge_ordered(left, right, left_on=INDEX_COLS, right_on=INDEX_COLS).set_index(INDEX_COLS)
but that seems unnecessarily verbose, and possibly less performant than if I could do
pandas.merge_ordered(left, right, left_index=True, right_index=True)
Proposal
It looks like merge_ordered()
uses _OrderedMerge
under the hood which can take the left_index
and right_index
arguments (and the code example above works as expected if those arguments are passed through). Would you consider adding these arguments? I can create a PR if there's appetite.
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None python: 3.6.8.final.0 python-bits: 64 OS: Darwin OS-release: 17.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8
pandas: 0.24.2 pytest: 4.2.0 pip: 19.0.1 setuptools: 40.7.3 Cython: None numpy: 1.16.2 scipy: 1.2.1 pyarrow: None xarray: None IPython: 7.2.0 sphinx: 1.8.4 patsy: None dateutil: 2.8.0 pytz: 2018.9 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml.etree: None bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None gcsfs: None
Hey - I know that I'm resurrecting a pretty old issue here, but I was wondering whether this is something that might be addressed at some point? I had originally assumed that the default interaction with pd.merge
would be the same as pd.merge_ordered
but I'm finding that index merging support doesn't seem to be in parity with the more widely-used counterpart. Very easy to work around so it's not pressing, but figured that I would bump this issue to resurface it.
I would find this useful too. One of the issues I am having is that many of the ways to do this kind of thing create intermediate states with missing values (NaN) which results in int columns being converted to float as part of the process. I am currently using concat with sort_index, which is very inefficient if both the input dataframes are sorted. This look like it could work for my use case if it could use the index directly.