nowcasting_dataset Discussion: For testing, should we use "fake" data or a small amount of real data?

Discussion: For testing, should we use "fake" data or a small amount of real data?

Open JackKelly opened this issue 2 years ago • 3 comments

(Let's not worry about this now... just making a note to discuss in early 2022!)

As we all know, in order for "fake" data to be useful for testing, the "fake" data needs to accurately capture almost all of the structure of "real" data. Otherwise the "fake" data could drive us to reach incorrect conclusions when debugging and testing our code (as happened when debugging the OpticalFlowDatasource tests).

Creating really "realistic" fake data is probably quite a lot of effort (for example, see issue #511).

I suppose I'm curious whether it might actually be less work to use a small amount of real data for testing, instead of maintaining code to create "fake" data on the fly? And include this sample of real data in the nowcasting_dataset/tests/data/ folder?

Strictly speaking, we're not allowed to share some of our data sources. Maybe it wouldn't be too much work to obfuscate a small amount of "real" data (e.g. PV locations could be the LSOA locations that we're allowed to share publicly. And, for other data sources, we could add a small amount of random noise to all the data?)

Nov 29 '21 14:11 JackKelly

nowcasting_dataset nowcasting_dataset copied to clipboard

Discussion: For testing, should we use "fake" data or a small amount of real data?

nowcasting_dataset
nowcasting_dataset copied to clipboard