autonormalize icon indicating copy to clipboard operation
autonormalize copied to clipboard

Variable types #10

Open j-grover opened this issue 4 years ago • 12 comments

  • Adding variable types as parameters to auto_entityset, make_entityset
  • Test in tests/test_normalize
  • Updated README
  • Resolves #10

j-grover avatar Oct 18 '19 00:10 j-grover

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

CLAassistant avatar Oct 18 '19 00:10 CLAassistant

Generally I think this PR looks pretty good, but I have a question about how we want this to behave for the ID columns. If you look at the original entityset (entityset) generated in test_variable_types before normalizing, the customer_id is of type Numeric. However, if you look at this same column of the session_id entity in the normalized entityset, this same customer_id column now is now of type Id.

@rwedge Is this behavior acceptable? I know we will have some differences in variable types where a column in the original non-normalized entityset gets set as an index in a normalized entity, but wasn't sure how we wanted non-index columns to be treated throughout.

We probably should also update the test to test all of the columns we expect to have the same variable types throughout instead of just the single zip_code column.

thehomebrewnerd avatar Mar 12 '20 14:03 thehomebrewnerd

@thehomebrewnerd changing the variable we normalized on to Id type is standard featuretools behavior. I think the logic is that we now know this variable can be treated as a foreign key, so we label it as a special categorical type, Id.

I agree the test should test all of the columns, and I think we should test that make_entityset and auto_entityset use variable_types as expected

rwedge avatar Mar 13 '20 16:03 rwedge

@j-grover sorry for the delay on the review, are you interested in updating the tests?

rwedge avatar Mar 13 '20 18:03 rwedge

@rwedge Updated test to check all columns

j-grover avatar Mar 15 '20 05:03 j-grover

@j-grover Thanks for the quick response and updates to the test. Since we have now modified the parameters for the auto_entityset and make_entityset to add the optional variable_types parameter, it would also be good to add specific tests for these two functions to make sure they behave properly and return the expected result when using this optional parameter.

Is that something you would be able to do as well?

thehomebrewnerd avatar Mar 16 '20 15:03 thehomebrewnerd

@j-grover Thanks for the quick response and updates to the test. Since we have now modified the parameters for the auto_entityset and make_entityset to add the optional variable_types parameter, it would also be good to add specific tests for these two functions to make sure they behave properly and return the expected result when using this optional parameter.

Is that something you would be able to do as well?

What sort of things are we looking to test for these two methods. For example one case with default values and one with custom args?

j-grover avatar Mar 22 '20 04:03 j-grover

@j-grover Thanks for the quick response and updates to the test. Since we have now modified the parameters for the auto_entityset and make_entityset to add the optional variable_types parameter, it would also be good to add specific tests for these two functions to make sure they behave properly and return the expected result when using this optional parameter. Is that something you would be able to do as well?

What sort of things are we looking to test for these two methods. For example one case with default values and one with custom args?

Yes, that is what I was thinking...tests very similar to the test you added for normalize_entity, just to make sure those methods return the expected results when passing a variable_types parameter. Since these methods are part of the public API, it would be good for test coverage to have specific tests that cover their possible use cases as well.

thehomebrewnerd avatar Mar 23 '20 12:03 thehomebrewnerd

@thehomebrewnerd I've added the initial tests for make_entityset and auto_entityset. There is overlap between the tests as auto_entityset makes use of make_entityset internally. When testing auto_entityset, I've found that for entity 0 (refer to line 330 in test_normalize.py) the name changes between 'jersey_num_team' and 'team_jersey_num' for different runs. This causes the following to fail occasionally: assert normalized_entityset.entities[0].variable_types['jersey_num_team'] == Index How would you go about testing this?

j-grover avatar Apr 13 '20 04:04 j-grover

@thehomebrewnerd I've added the initial tests for make_entityset and auto_entityset. There is overlap between the tests as auto_entityset makes use of make_entityset internally. When testing auto_entityset, I've found that for entity 0 (refer to line 330 in test_normalize.py) the name changes between 'jersey_num_team' and 'team_jersey_num' for different runs. This causes the following to fail occasionally: assert normalized_entityset.entities[0].variable_types['jersey_num_team'] == Index How would you go about testing this?

@j-grover Thanks for creating these tests. If we expect this behavior - where the name is not deterministic - we could set a variable for the name that is actually returned and then use that variable in the tests. One way that comes to mind would be to check if team_jersey_num is in the variable_types dictionary keys and if it is not we would use the other option of jersey_num_team. Something like this:

# Index name is not always the same - checking what was returned
index_vname = "jersey_num_team"
if index_vname not in normalized_entityset.entities[0].variable_types.keys():
    index_vname = "team_jersey_num"
assert normalized_entityset.entities[0].variable_types[index_vname] == Index

I think you would also need to rename one of the dataframe columns for the df.equals test to pass reliably as well.

I don't know enough about the details of this code to know if expect to get the same name back every time, but I can look into that a bit more in the meantime to make sure this isn't highlighting some other issue.

thehomebrewnerd avatar Apr 13 '20 13:04 thehomebrewnerd

@j-grover After reviewing the code with @rwedge , the non-deterministic nature of column names is to be expected as the new names are created by joining an unsorted list of column names. Issue #24 was created to fix this problem, so for this PR I suggest we go ahead and implement the tests as we have described above, and then we can update later after issue #24 is closed.

thehomebrewnerd avatar Apr 17 '20 18:04 thehomebrewnerd

@j-grover After reviewing the code with @rwedge , the non-deterministic nature of column names is to be expected as the new names are created by joining an unsorted list of column names. Issue #24 was created to fix this problem, so for this PR I suggest we go ahead and implement the tests as we have described above, and then we can update later after issue #24 is closed.

Sorting the list sounds good. I have change the index names accordingly to jersey_num_team. @rwedge I've added a pytest fixture for that particular example

j-grover avatar Apr 18 '20 05:04 j-grover