CROP icon indicating copy to clipboard operation
CROP copied to clipboard

Test ideas for data quality control

Open myyong opened this issue 4 years ago • 1 comments

This issue records the quality control issues with the data that we have identified while running models trained on tee_208.RDS on new data.

These should be converted into test cases for checking data quality before running predictions.

  1. Duplicate values
  2. Missing/late values
  3. Reads columns in different order
  4. Returning only date when number of days requested > 270

myyong avatar Nov 08 '21 10:11 myyong

Duplicated values

I can see it in utc_energy_data:

select timestamp, electricity_consumption, sensor_id, count(*)                                                                 
from utc_energy_data
where sensor_id=16
group by timestamp, electricity_consumption, sensor_id
HAVING count(*) > 1
order by timestamp desc;

The duplicated values run from 2021-05-31 00:30:00 till 2021-06-02 00:00:00.

myyong avatar Nov 08 '21 10:11 myyong