NVTabular
NVTabular copied to clipboard
[FEA] Restrict DropNA/Filter with api changes
Is your feature request related to a problem? Please describe. The new API we're using lets you build up a graph of operations - but the downside being that operators that change the number of rows (like dropna/filter) won't work well if they combined with other chains of operators that don't change the number of rows.
One potential solution here is to restrict the ability of operator chains with filter/dropna - and not allow them to be combined with other chains using the '+' operator.
feedback from a user:
Filter has been a big point of contention and many folks have run into issues there. Like, if you want to do an operation that looks like:
# pseudocode
filtered = [item, timestamp] >> Filter(timestamp > a_year_ago)
... some more nvt operations
output = [other_columns] + filtered + ...
And if the output is run on a dataset, the normal expectation is that all of the data would be filtered out when in reality, the dataset wouldn't be filtered at all. I learned that the filter needs to run on every branch