HPI icon indicating copy to clipboard operation
HPI copied to clipboard

Figure out where to get 'fake data'

Open karlicoss opened this issue 4 years ago • 5 comments

It would be nice to have a public repository of raw data from different services, so it would be easy to test HPI and demonstrate without having to give up your own data. Does such a thing exist?

P.S. maybe this issue rather belongs here, and I'll tranfer it.

karlicoss avatar Apr 12 '20 10:04 karlicoss

Not sure about the existence of this raw data but I think you could use faker to generate all the data you need in a predictable way (seeding).

felubra avatar May 09 '20 14:05 felubra

@felubra very nice, thanks! Ideally would be good to get hands on real data (I might just make some of mine public), but that's super helpful too.

karlicoss avatar May 09 '20 14:05 karlicoss

Briefly tried faker (I think Hypothesis testing framework is also using it). Had some issues with lots of duplicate data (similar to what's reported here, but haven't investigated yet.

And also there is mimesis, which claims to be faster.

I guess generally the problem is that random data doesn't quite work for the demos, because real data has some sort of 'narrative', and causal structure. But anyway it's certainly useful to generate lots of it, and then filter out the datapoints so that it starts making some causal sense.

karlicoss avatar Sep 19 '20 00:09 karlicoss

In terms of organizing the code, etc: it seems that the data generations would belong well to the data access layers.

The idea is that the code that parses raw data and the code that generates fake raw data are close, so they don't go out of sync (also that allows to have CI for data parsing for free, just run it against the fake data).

Then, the corresponding HPI module uses the DAL to generate fake data and set it as inputs: https://github.com/karlicoss/HPI/blob/28fcc1d9b6f64f57c7a05ba3aaffef2fade04f9a/my/rescuetime.py#L78-L84

It works as a decorator, e.g.

with my.rescuetime.fake_data():
     # rescuetime module will run against fake data now

, here's an example: https://github.com/karlicoss/dashboard/blob/623555e09647cce20bcc60f8ba6e9f5e932d32a2/src/dashboard/tabs.py#L103-L116

And the end result: Rescuetime data heatmap generated against the completely fake data, with everything running on CI! https://karlicoss.github.io/dashboard/rescuetime.html

The snippets are a bit awkward at the moment, but I'll fix a couple of minor caveats, and I feel like this could work really well!

karlicoss avatar Sep 19 '20 00:09 karlicoss

Some test data I uploaded myself

  • https://github.com/karlicoss/hpi-testdata
  • https://github.com/karlicoss/mydata

karlicoss avatar May 31 '22 13:05 karlicoss