rapids
rapids copied to clipboard
UnicodeDecodeError in rule phone_applications_foreground_python_features
Hi @JulioV and @Meng6, when trying to process phone applications foreground features for some participants, I get the following error:
[Wed Oct 19 17:39:19 2022]
rule phone_applications_foreground_python_features:
input: data/raw/1023/phone_applications_foreground_with_datetime_with_categories.csv, data/interim/1023/phone_app_episodes_resampled_with_datetime.csv, data/interim/time_segments/1023_time_segments_labels.csv
output: data/interim/1023/phone_applications_foreground_features/phone_applications_foreground_python_rapids.csv
jobid: 11
wildcards: pid=1023, provider_key=rapids
RAPIDS: Processing phone_applications_foreground rapids hourly0000
Traceback (most recent call last):
File "pandas/_libs/parsers.pyx", line 1119, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas/_libs/parsers.pyx", line 1244, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas/_libs/parsers.pyx", line 1259, in pandas._libs.parsers.TextReader._string_convert
File "pandas/_libs/parsers.pyx", line 1450, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 3: invalid continuation byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/rapids/.snakemake/scripts/tmpesfg1fnp.entry.py", line 21, in <module>
sensor_features = fetch_provider_features(provider, provider_key, sensor_key, sensor_data_files, time_segments_file)
File "/rapids/src/features/utils/utils.py", line 109, in fetch_provider_features
features = feature_function(sensor_data_files, time_segment, provider, filter_data_by_segment=filter_data_by_segment, chunk_episodes=chunk_episodes)
File "src/features/phone_applications_foreground/rapids/main.py", line 116, in rapids_features
apps_events_data = pd.read_csv(sensor_data_files["sensor_data"])
File "/opt/anaconda3/envs/rapids/lib/python3.7/site-packages/pandas/io/parsers.py", line 688, in read_csv
return _read(filepath_or_buffer, kwds)
File "/opt/anaconda3/envs/rapids/lib/python3.7/site-packages/pandas/io/parsers.py", line 460, in _read
data = parser.read(nrows)
File "/opt/anaconda3/envs/rapids/lib/python3.7/site-packages/pandas/io/parsers.py", line 1198, in read
ret = self._engine.read(nrows)
File "/opt/anaconda3/envs/rapids/lib/python3.7/site-packages/pandas/io/parsers.py", line 2157, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 862, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 941, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 1073, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas/_libs/parsers.pyx", line 1126, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas/_libs/parsers.pyx", line 1244, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas/_libs/parsers.pyx", line 1259, in pandas._libs.parsers.TextReader._string_convert
File "pandas/_libs/parsers.pyx", line 1450, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 3: invalid continuation byte
After investigating, it seems that the error occurs with pd.read_csv()
on both lines 116 and 124 in src/features/phone_applications_foreground/rapids/main.py
and is related to a special character (specifically, "é") in the name of one of the apps used by this participant. Data for participants who did not use this app are processed successfully. Explicitly specifying an encoding (e.g., encoding = "ISO-8859-1"
) in pd.read_csv()
seems to work as a temporary fix.
We've encountered this issue running the latest version of RAPIDS (commit d255f2de
) on two machines (macOS Monterey version 12.4 and Ubuntu version 20.04.1).
@Meng6 a good place to sanitize the input would be the beginning of the main script for this provider no?
Or we could add the file encoding as a provider parameter
Hi @JulioV, besides the locations you mentioned above, is it possible to handle it in R script? For example: pull_phone_data rule.
Hard coding the encoding is ok to process this particular dataset.
The medium term solution is to use a mutation script to force the problematic columns of the app foreground provider into utf with stringi::stri_enc_toutf8. We can publish this fix.
The long term solution is to move away from csv and to feather or parquet files but this will take more work and time.
Thank you both! @JulioV, for the medium-term solution you suggested, would I create a new mutation script in src/data/streams/mutations/phone/aware
and update the aware_*/format.yaml
files in src/data/streams
? Also, I know we don't currently have providers for phone applications crashes or notifications features, but given that those sensors also have the problematic application_name
column, should we implement this fix for them as well in addition to phone applications foreground?
Yeah, that's the right location for the script and the edits to the format.yaml files. We don't need to implement this fix for the other application sensors for now. When we crate providers for them we can always update the streams
Sounds good. Thank you, @JulioV!