mito icon indicating copy to clipboard operation
mito copied to clipboard

Ideas for future code optimization

Open naterush opened this issue 2 years ago • 4 comments

  1. Start with a pass that tries to combine all the "like" steps together, so we get all the renames in one place.

  2. right_combine which is useful for a DeleteDataframe - where it loops backwards and checks for right combines. We can then easily case all in one location on this step - which is really intuitive. I wonder about this though...

  3. Make required params and execution data to the CodeChunks - and also add a code coverage report.

  4. How to do out of order things?

  5. How do we go about fuzz testing this?

naterush avatar Apr 06 '22 19:04 naterush

There are three main areas of code optimization that we discuss:

  1. Removing unneeded code. This just requires using persistent IDs for things (dataframes, columns), and having a way of detecting dependencies between steps.
  2. Combining different steps together. For example, combine a Add Column + Rename into a single step.
  3. Separating code generated while exploring the dataset from code that the user actually wants to preserve.

We could probably take a pretty similar backend approach to 1 and 3. We generally think of removing unnecessary code as deleting code that edits variables that the user ended up deleting. ie: if you create a pivot table and then delete the pivot table. Let's think that we accomplish this via a function called delete_variable(sheet_index, column_id=None). When only a sheet index is provided, the function deletes the entire sheet and all previous steps that were used to create that sheet. When a sheet index and a column_id are provided, the function deletes the column_id and steps that created/edited that column_id.

Number 3 is just the inverse of that. Only keep code that is needed to create some variable. Not thinking about the the frontend and how we know which variables the user wants to keep, the backend is pretty similar.

  • Lets say that there is some set of steps that the user wants to keep K.
  • Build a dependency graph and then for any (sheet_index) or (sheet_index, column_id) from a previous step that is not part of the dependency graph, call the delete_variable function on it.

In order to do this properly, we need to know the effected output of each step. For example, a filter effects the entire sheet_index even though its only applied to one column.

aarondr77 avatar Apr 08 '22 17:04 aarondr77

That being said, if we do 1 and 2 then it might heavily reduce the pain point that 3 resolves anyways, as users would be able to cleanup their generated code to get rid of exploration steps just by deleting the exploration work they no longer want!

aarondr77 avatar Apr 08 '22 17:04 aarondr77

Tots agree. I also think there's probably a 1 day amount of work that doesn't involve sheet ids that would do a lot (think like 70%) of the sheet deleting stuff - and I think we should do it next week and see how we feel about the pain point after!

naterush avatar Apr 08 '22 20:04 naterush

This test passes (and it shouldn't):

def test_edit_merge_optimizes_after_other_edit():
    df1 = pd.DataFrame({'A': [2], 'B': [2]})
    df2 = pd.DataFrame({'A': [1, 2], 'B': [1, 2]})
    mito = create_mito_wrapper(df1, df2)
    mito.merge_sheets('inner', 0, 1, [['A', 'A']], ['A', 'B'], ['A', 'B'])
    mito.add_column(0, 'Test')
    mito.merge_sheets('inner', 0, 1, [['A', 'A']], ['A', 'B'], ['A'], destination_sheet_index=2)

    assert mito.transpiled_code == [
        'from mitosheet.public.v3 import *',
        '',
        "df_merge = df1.merge(df2, left_on=['A'], right_on=['A'], how='inner', "
        "suffixes=['_df1', '_df2'])",
        '',
        "df1.insert(2, 'Test', 0)",
        '',
        "df1_tmp = df1.drop(['Test'], axis=1)",
        "df2_tmp = df2.drop(['B'], axis=1)",
        "df_merge = df1_tmp.merge(df2_tmp, left_on=['A'], right_on=['A'], "
        "how='inner', suffixes=['_df1', '_df2'])",
        '',
    ]

marthacryan avatar Dec 15 '23 21:12 marthacryan