orca
orca copied to clipboard
Problem merging tables with overlapping broadcast relationships
I'm having trouble merging sets of tables with overlapping broadcast relationships.
For example, these combinations run:
- broadcasts from A -> B and A -> C, merge tables A, B, C
- broadcasts from A -> B and B -> C, merge tables A, B, C
But this combination raises an error:
- broadcasts from A -> B, A -> C, B -> C, merge tables A, B, C
This came up in real-world use (https://github.com/ual/urbansim_parcel_bayarea/issues/11), but here's a stand-alone demonstration that you can paste into a python script:
import orca
import pandas as pd
a = pd.DataFrame({'ix': [1,2], 'val_a': ['a1','a2']})
b = pd.DataFrame({'ix': [1,2], 'val_b': ['b1','b2'], 'a': [1,2]})
c = pd.DataFrame({'ix': [1,2], 'val_c': ['c1','c2'], 'a': [1,2], 'b': [1,2]})
orca.add_table('a', a.set_index('ix'))
orca.add_table('b', b.set_index('ix'))
orca.add_table('c', c.set_index('ix'))
orca.broadcast(cast='a', onto='b', cast_index=True, onto_on='a')
orca.broadcast(cast='b', onto='c', cast_index=True, onto_on='b')
df = orca.merge_tables(target='c', tables=['c', 'b', 'a'])
orca.broadcast(cast='a', onto='c', cast_index=True, onto_on='a')
df = orca.merge_tables(target='c', tables=['c', 'b', 'a']) # throws error
Here is the error:
File "test.py", line 19, in <module>
df = orca.merge_tables(target='c', tables=['c', 'b', 'a']) # error on this line
File "/Users/maurer/Dropbox/Git-imac/udst/orca/orca/orca.py", line 1799, in merge_tables
cast_table = frames[cast]
KeyError: 'a'
Twin-Clouds-iMac:Desktop maurer$ python test.py
Traceback (most recent call last):
File "test.py", line 19, in <module>
df = orca.merge_tables(target='c', tables=['c', 'b', 'a']) # throws error
File "/Users/maurer/Dropbox/Git-imac/udst/orca/orca/orca.py", line 1799, in merge_tables
cast_table = frames[cast]
KeyError: 'a'
This is a bug, right? I can see how it's a potentially ambiguous merge, but if we just resolve it in a consistent way it seems like a supportable use case. Overlapping broadcasts are helpful if you want to do different merge combinations at different times with maximum efficiency.
I don't see an obvious source for the error, but will dig into it more when I have a chance.
I'm running Orca 1.5.1 and Pandas 0.22.0