synthetic
synthetic copied to clipboard
R package for dataset generation and benchmarking
One of the biggest selling points for me of [fst](https://github.com/fstpackage/fst) is the ability to randomly access (and append in [v0.9.2](https://github.com/fstpackage/fst/milestone/23)!) disk stored data. That is, load specific data into an...
A numeric vector can have a limited amount of _levels_ that are replicated: ```r # 10 'levels' vec_levels
Advanced feature to generate dataset samples from a source dataset with the correlations between column vectors retained: ``` r dt
And generate new sample data from that. Correlations between columns can be retained: ``` r dt
Using appropriate generators and the `synthetic` infrastructure
For example: ``` r synthetic_bench() %>% bench_tables(generator, column_mode = "single column") %>% bench_streamers(rds_streamer, fst_streamer, parguet_streamer, feather_streamer) %>% bench_rows(1e7, 5e7) %>% bench_compression(50, 80) %>% compute() ``` Parameter _column\_mode_ could specify the...
Although the arrow package doesn't directly support selection of the number of threads (I think)
See [here](https://arrow.apache.org/blog/)
As it also tracks memory allocations. It would be nice to benchmark memory usage across packages as well.
For long benchmarks, the user should not use data when an error is encountered (or the system goed down). Instead, we can use a file to save temp results and...