CROP
CROP copied to clipboard
Test ideas for data quality control
This issue records the quality control issues with the data that we have identified while running models trained on tee_208.RDS on new data.
These should be converted into test cases for checking data quality before running predictions.
- Duplicate values
- Missing/late values
- Reads columns in different order
- Returning only date when number of days requested > 270
Duplicated values
I can see it in utc_energy_data:
select timestamp, electricity_consumption, sensor_id, count(*)
from utc_energy_data
where sensor_id=16
group by timestamp, electricity_consumption, sensor_id
HAVING count(*) > 1
order by timestamp desc;
The duplicated values run from 2021-05-31 00:30:00 till 2021-06-02 00:00:00.