disdat
disdat copied to clipboard
Disdat loses index when returning DataFrame
When a DataFrame with an index that is not standard (0, 1, 2...) is returned from run, the original index is not present when the DataFrame is loaded into memory for the next task or accessed from the API. This issue can be recreated using the code below:
import disdat.api as dsdt
from disdat.pipe import PipeTask
import pandas as pd
class TestIndex(PipeTask):
def pipe_requires(self, pipeline_input=None):
self.set_bundle_name('test_index')
def pipe_run(self, pipeline_input=None):
data = {
'a': [2, 3, 4],
'b': [5, 6, 7]
}
df = pd.DataFrame(data)
print 'Index should be 0, 1, 2'
print df
df.index = [7, 8, 9]
print 'Index should be 7, 8, 9'
print df
return df
if __name__ == '__main__':
dsdt.apply('tt', '-', '-', 'TestIndex', params={}, force=True)
print dsdt.search('tt', 'test_index')[0].data.index.values
This code will correctly update and print the index before it returns, however, once the DataFrame is retrieved by disdat, the index is back to the default index.
Output:
2018-12-18 15:24:25,151 - luigi-interface - INFO - Loaded []
2018-12-18 15:24:25,175 - luigi-interface - INFO - Informed scheduler that task DriverTask_True______a92c94fa32 has status PENDING
2018-12-18 15:24:25,175 - luigi-interface - INFO - Informed scheduler that task TestIndex_None_None____71e4869f25 has status PENDING
2018-12-18 15:24:25,175 - luigi-interface - INFO - Done scheduling tasks
2018-12-18 15:24:25,175 - luigi-interface - INFO - Running Worker with 1 processes
2018-12-18 15:24:25,176 - luigi-interface - INFO - [pid 53226] Worker Worker(salt=715866385, workers=1, host=sdgl141c3d83b, username=srowan, pid=53226) running TestIndex(closure_bundle_proc_name_root=None, closure_bundle_uuid_root=None, output_tags={})
Index should be 0, 1, 2
a b
0 2 5
1 3 6
2 4 7
Index should be 7, 8, 9
a b
7 2 5
8 3 6
9 4 7
2018-12-18 15:24:25,211 - disdat.pipe_base - WARNING - __main__.TestIndex: Source file /Users/srowan/Development/ds/turbotiles/turbotiles/pipeline/test_index.py not under git version control
2018-12-18 15:24:25,237 - luigi-interface - INFO - [pid 53226] Worker Worker(salt=715866385, workers=1, host=sdgl141c3d83b, username=srowan, pid=53226) done TestIndex(closure_bundle_proc_name_root=None, closure_bundle_uuid_root=None, output_tags={})
2018-12-18 15:24:25,238 - luigi-interface - INFO - Informed scheduler that task TestIndex_None_None____71e4869f25 has status DONE
2018-12-18 15:24:25,239 - luigi-interface - INFO - [pid 53226] Worker Worker(salt=715866385, workers=1, host=sdgl141c3d83b, username=srowan, pid=53226) running DriverTask(input_bundle=-, output_bundle=-, pipe_params={}, pipe_cls=TestIndex, input_tags={}, output_tags={}, force=True)
2018-12-18 15:24:25,240 - luigi-interface - INFO - [pid 53226] Worker Worker(salt=715866385, workers=1, host=sdgl141c3d83b, username=srowan, pid=53226) done DriverTask(input_bundle=-, output_bundle=-, pipe_params={}, pipe_cls=TestIndex, input_tags={}, output_tags={}, force=True)
2018-12-18 15:24:25,240 - luigi-interface - INFO - Informed scheduler that task DriverTask_True______a92c94fa32 has status DONE
2018-12-18 15:24:25,245 - luigi-interface - INFO - Worker Worker(salt=715866385, workers=1, host=sdgl141c3d83b, username=srowan, pid=53226) was stopped. Shutting down Keep-Alive thread
2018-12-18 15:24:25,248 - luigi-interface - INFO -
===== Luigi Execution Summary =====
Scheduled 2 tasks of which:
* 2 ran successfully:
- 1 DriverTask(...)
- 1 TestIndex(...)
This progress looks :) because there were no failed tasks or missing external dependencies
===== Luigi Execution Summary =====
Index: [0 1 2]