CROP
CROP copied to clipboard
Multiple bugs in data pre-processing in ARIMA R code
The following bugs where found in the module code/cleandata.R, which performs pre-processing of the training data.
Bug 1
The bug is in the hourly_av_sensor function. This function averages the temperature and humidity for a given sensor based on the timestamp of the observations. Specifically, it takes the timestamps 15 minutes away from the full hour and averages them together. The problem arises with timestamps close to midnight. The problematic bit of code is the following:
# Find the mean temp and RH for this new HourPM
trh_ph <- plyr::ddply(trh_sub, .(DatePM,HourPM),
summarise,
Temperature=mean(temperature, na.rm = T),
Humidity=mean(humidity, na.rm = T))
Here, it first splits the data by DatePM, then by HourPM , and finally averages temperature and humidity. The issue is that timestamps within 15 minutes before midnight are assigned one DatePM and timestamps within 15 minutes after midnight are assigned a DatePM corresponding to the following day. So the two blocks are not averaged together (as they should), and actually get averaged with the timestamps 24 hours later.
Bug 2
In large datasets, several time zones are present in the timestamp column of the data matrices retrieved from the database (e.g. GMT and BST). However, this is not well handled in the code. The lubridate hour function is used to retrieve the hours element of a timestamp as an integer for further processing. However, this function does not return timezone information (e.g. it returns the integer 12 for both 12:00:00 GMT and 12:00:00 BST).
This has implications when using the as.Date function, which does take into account the timezone information of the input. Since the hour extracted with hour and the date extracted with as.Date are combined together to create a new timestamp column (as in the line of code below), there are inconsistencies in this newly generated timestamps.
trh$Timestamp <- as.POSIXct(paste(trh$Date, trh$Hour),tz="UTC",format="%Y-%m-%d %H")