Feature handling empty values
In real world data, there are some instances where a particular element is absent because of various reasons, such as, corrupt data, failure to load the information, or incomplete extraction. Handling those missing values is one of the greatest challenges faced by analysts, because making the right decision on how to handle it generates robust data models.
This Pull request provides the ability to handle empty values to the DataFrame project.
Consider the below example :
df := DataFrame withRows: #(
#( Barcelona 1.609 nil 3 ) #( nil nil true 4 ) #( London 8.788 false 1 ) #( Tokyo 5.785 nil 5 ) #( Beijing nil false 6 ) ).
df rowNames: #( A B C D E ).
df columnNames: #( City Population BeenThere Position ).
Methods like replaceNils: anObject , replaceNilsWithZero , replaceNilsWithMean , replaceNilsWithMedian , replaceNilsWithMode are self explanatory. Below are some examples for remaining methods.
df numberOfNils.
Returns a Dictionary which shows the total number of Nil values in each column.
| Key | Value |
|---|---|
| #City | 1 |
| #Population | 2 |
| #BeenThere | 2 |
| #Position | 0 |
df hasNilsByColumn.
Returns a Dictionary which shows whether each column contains nil values.
| Key | Value |
|---|---|
| #City | true |
| #Population | true |
| #BeenThere | true |
| #Position | false |
df hasNils.
returns true when a nil value is present anywhere in dataFrame, retrurns false otherwise.
df removeRowsWithNils.
returns a modified dataFrame after removing all rows which had nils.
df replaceNilsWithPreviousRowValue.
This will propagate last valid observation forward. Much similar to ffill() in Pandas.
This is a very useful addition, thanks! I left a few comments above, and then there's the CI failure that I don't quite understand, but overall this is very nice!
CI failure not related. An alternative method to one added here have been added to Dataframe in the meantime. I’ll merge this PR and do another one to depreciate the other way in another PR.