rapids
rapids copied to clipboard
Data cleaning with multiple platforms
Hi @JulioV, I created a branch named data_cleaning/multiple_platforms
based on our discussion. Could you review the code when you are free? Thanks!
The following things are different from our conversation:
- Besides timestamp and os columns, the device_id column was also added to platforms output. We need this column to run readable_datetime.R script. By doing so, we can use the correct timezone to get local_datetime. (not UTC)
- While running readable_datetime.R script to convert timestamp to local_date_time, I set device_type to phone_platforms, which can skip the step of filter_wanted_dates: the start_date and end_date will be NA.
- For data cleaning script, I do not assign the majority class of the platforms. Instead, I assume all the time segments with multiple platforms to be iOS platform. The reason is that all the iOS features can also be extracted from Android devices. But, some Android features are not available for iOS devices. Selected event features are imputed with 0 by the following two steps: (1) features which can be extracted from both Android and iOS devices: impute all rows directly; (2) features which can only be extracted from Android devices: select these rows and impute
I need to review this PR more carefully, I think we can make some optimizations