autonormalize
autonormalize copied to clipboard
Variable types #10
- Adding variable types as parameters to auto_entityset, make_entityset
- Test in tests/test_normalize
- Updated README
- Resolves #10
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.
Generally I think this PR looks pretty good, but I have a question about how we want this to behave for the ID columns. If you look at the original entityset (entityset
) generated in test_variable_types
before normalizing, the customer_id
is of type Numeric
. However, if you look at this same column of the session_id
entity in the normalized entityset, this same customer_id
column now is now of type Id
.
@rwedge Is this behavior acceptable? I know we will have some differences in variable types where a column in the original non-normalized entityset gets set as an index in a normalized entity, but wasn't sure how we wanted non-index columns to be treated throughout.
We probably should also update the test to test all of the columns we expect to have the same variable types throughout instead of just the single zip_code
column.
@thehomebrewnerd changing the variable we normalized on to Id
type is standard featuretools behavior. I think the logic is that we now know this variable can be treated as a foreign key, so we label it as a special categorical type, Id
.
I agree the test should test all of the columns, and I think we should test that make_entityset
and auto_entityset
use variable_types
as expected
@j-grover sorry for the delay on the review, are you interested in updating the tests?
@rwedge Updated test to check all columns
@j-grover Thanks for the quick response and updates to the test. Since we have now modified the parameters for the auto_entityset
and make_entityset
to add the optional variable_types
parameter, it would also be good to add specific tests for these two functions to make sure they behave properly and return the expected result when using this optional parameter.
Is that something you would be able to do as well?
@j-grover Thanks for the quick response and updates to the test. Since we have now modified the parameters for the
auto_entityset
andmake_entityset
to add the optionalvariable_types
parameter, it would also be good to add specific tests for these two functions to make sure they behave properly and return the expected result when using this optional parameter.Is that something you would be able to do as well?
What sort of things are we looking to test for these two methods. For example one case with default values and one with custom args?
@j-grover Thanks for the quick response and updates to the test. Since we have now modified the parameters for the
auto_entityset
andmake_entityset
to add the optionalvariable_types
parameter, it would also be good to add specific tests for these two functions to make sure they behave properly and return the expected result when using this optional parameter. Is that something you would be able to do as well?What sort of things are we looking to test for these two methods. For example one case with default values and one with custom args?
Yes, that is what I was thinking...tests very similar to the test you added for normalize_entity
, just to make sure those methods return the expected results when passing a variable_types
parameter. Since these methods are part of the public API, it would be good for test coverage to have specific tests that cover their possible use cases as well.
@thehomebrewnerd
I've added the initial tests for make_entityset
and auto_entityset
. There is overlap between the tests as auto_entityset
makes use of make_entityset
internally. When testing auto_entityset, I've found that for entity 0 (refer to line 330 in test_normalize.py
) the name changes between 'jersey_num_team' and 'team_jersey_num' for different runs. This causes the following to fail occasionally:
assert normalized_entityset.entities[0].variable_types['jersey_num_team'] == Index
How would you go about testing this?
@thehomebrewnerd I've added the initial tests for
make_entityset
andauto_entityset
. There is overlap between the tests asauto_entityset
makes use ofmake_entityset
internally. When testing auto_entityset, I've found that for entity 0 (refer to line 330 intest_normalize.py
) the name changes between 'jersey_num_team' and 'team_jersey_num' for different runs. This causes the following to fail occasionally:assert normalized_entityset.entities[0].variable_types['jersey_num_team'] == Index
How would you go about testing this?
@j-grover Thanks for creating these tests. If we expect this behavior - where the name is not deterministic - we could set a variable for the name that is actually returned and then use that variable in the tests. One way that comes to mind would be to check if team_jersey_num
is in the variable_types
dictionary keys and if it is not we would use the other option of jersey_num_team
. Something like this:
# Index name is not always the same - checking what was returned
index_vname = "jersey_num_team"
if index_vname not in normalized_entityset.entities[0].variable_types.keys():
index_vname = "team_jersey_num"
assert normalized_entityset.entities[0].variable_types[index_vname] == Index
I think you would also need to rename one of the dataframe columns for the df.equals
test to pass reliably as well.
I don't know enough about the details of this code to know if expect to get the same name back every time, but I can look into that a bit more in the meantime to make sure this isn't highlighting some other issue.
@j-grover After reviewing the code with @rwedge , the non-deterministic nature of column names is to be expected as the new names are created by joining an unsorted list of column names. Issue #24 was created to fix this problem, so for this PR I suggest we go ahead and implement the tests as we have described above, and then we can update later after issue #24 is closed.
@j-grover After reviewing the code with @rwedge , the non-deterministic nature of column names is to be expected as the new names are created by joining an unsorted list of column names. Issue #24 was created to fix this problem, so for this PR I suggest we go ahead and implement the tests as we have described above, and then we can update later after issue #24 is closed.
Sorting the list sounds good. I have change the index names accordingly to jersey_num_team
.
@rwedge I've added a pytest fixture for that particular example