disdat icon indicating copy to clipboard operation
disdat copied to clipboard

Disdat loses index when returning DataFrame

Open seanr15 opened this issue 6 years ago • 0 comments

When a DataFrame with an index that is not standard (0, 1, 2...) is returned from run, the original index is not present when the DataFrame is loaded into memory for the next task or accessed from the API. This issue can be recreated using the code below:

import disdat.api as dsdt
from disdat.pipe import PipeTask
import pandas as pd


class TestIndex(PipeTask):
    def pipe_requires(self, pipeline_input=None):
        self.set_bundle_name('test_index')

    def pipe_run(self, pipeline_input=None):

        data = {
            'a': [2, 3, 4],
            'b': [5, 6, 7]
        }

        df = pd.DataFrame(data)
        print 'Index should be 0, 1, 2'
        print df

        df.index = [7, 8, 9]
        print 'Index should be 7, 8, 9'

        print df

        return df

if __name__ == '__main__':
    dsdt.apply('tt', '-', '-', 'TestIndex', params={}, force=True)
    print dsdt.search('tt', 'test_index')[0].data.index.values

This code will correctly update and print the index before it returns, however, once the DataFrame is retrieved by disdat, the index is back to the default index.

Output:

2018-12-18 15:24:25,151 - luigi-interface - INFO - Loaded []
2018-12-18 15:24:25,175 - luigi-interface - INFO - Informed scheduler that task   DriverTask_True______a92c94fa32   has status   PENDING
2018-12-18 15:24:25,175 - luigi-interface - INFO - Informed scheduler that task   TestIndex_None_None____71e4869f25   has status   PENDING
2018-12-18 15:24:25,175 - luigi-interface - INFO - Done scheduling tasks
2018-12-18 15:24:25,175 - luigi-interface - INFO - Running Worker with 1 processes
2018-12-18 15:24:25,176 - luigi-interface - INFO - [pid 53226] Worker Worker(salt=715866385, workers=1, host=sdgl141c3d83b, username=srowan, pid=53226) running   TestIndex(closure_bundle_proc_name_root=None, closure_bundle_uuid_root=None, output_tags={})
Index should be 0, 1, 2
   a  b
0  2  5
1  3  6
2  4  7
Index should be 7, 8, 9
   a  b
7  2  5
8  3  6
9  4  7
2018-12-18 15:24:25,211 - disdat.pipe_base - WARNING - __main__.TestIndex: Source file /Users/srowan/Development/ds/turbotiles/turbotiles/pipeline/test_index.py not under git version control
2018-12-18 15:24:25,237 - luigi-interface - INFO - [pid 53226] Worker Worker(salt=715866385, workers=1, host=sdgl141c3d83b, username=srowan, pid=53226) done      TestIndex(closure_bundle_proc_name_root=None, closure_bundle_uuid_root=None, output_tags={})
2018-12-18 15:24:25,238 - luigi-interface - INFO - Informed scheduler that task   TestIndex_None_None____71e4869f25   has status   DONE
2018-12-18 15:24:25,239 - luigi-interface - INFO - [pid 53226] Worker Worker(salt=715866385, workers=1, host=sdgl141c3d83b, username=srowan, pid=53226) running   DriverTask(input_bundle=-, output_bundle=-, pipe_params={}, pipe_cls=TestIndex, input_tags={}, output_tags={}, force=True)
2018-12-18 15:24:25,240 - luigi-interface - INFO - [pid 53226] Worker Worker(salt=715866385, workers=1, host=sdgl141c3d83b, username=srowan, pid=53226) done      DriverTask(input_bundle=-, output_bundle=-, pipe_params={}, pipe_cls=TestIndex, input_tags={}, output_tags={}, force=True)
2018-12-18 15:24:25,240 - luigi-interface - INFO - Informed scheduler that task   DriverTask_True______a92c94fa32   has status   DONE
2018-12-18 15:24:25,245 - luigi-interface - INFO - Worker Worker(salt=715866385, workers=1, host=sdgl141c3d83b, username=srowan, pid=53226) was stopped. Shutting down Keep-Alive thread
2018-12-18 15:24:25,248 - luigi-interface - INFO - 
===== Luigi Execution Summary =====

Scheduled 2 tasks of which:
* 2 ran successfully:
    - 1 DriverTask(...)
    - 1 TestIndex(...)

This progress looks :) because there were no failed tasks or missing external dependencies

===== Luigi Execution Summary =====

Index: [0 1 2]

seanr15 avatar Dec 18 '18 23:12 seanr15