make save_output_files create files with all comment lines at the start
A simple preprocessing with grep -v '^#' is one way to solve this issue, and maybe it wouldn't be a simple fix inside cmdstanr if it's related to how these files are written during sampling, but just in case it would be simple inside cmdstanr...
The files created by save_output_files have some comment lines at the start, some more in between the column headers (parameter names) and the values, and some more at the end.
This is too much for poor data.table::fread() to handle: pending its long awaited comment.char argument, it can only reliably skip lines that come together at the start of the file. Since data.table::fread() is go-to for huge csv files, it would be nice if all the comment lines were put together at the start of the file, such that these files can be read as-is by fread.
Example
library(data.table)
library(cmdstanr)
code <- "
data {
int N;
vector[N] x;
vector[N] y;
}
parameters {
real m;
real c;
real sigma;
}
model {
y ~ normal(m * x + c, sigma);
}
"
file <- write_stan_file(code)
model <- cmdstan_model(file)
samples <- model$sample(data = list(N = 1, x = 1, y = 1), iter_sampling = 10, iter_warmup = 10)
samples$save_output_files("~/", basename = "foo", timestamp = FALSE, random = FALSE)
df_ <- fread("~/foo-1.csv")
gives
Warning messages:
1: In fread("~/foo-1.csv") :
Detected 3 column names but the data has 10 columns (i.e. invalid file). Added 7 extra default column names at the end.
2: In fread("~/foo-1.csv") :
Stopped early on line 63. Expected 10 fields but found 1. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<# >>
and df_ is
# 1 1 1 V4 V5 V6 V7 V8 V9 V10
<num> <num> <num> <int> <int> <int> <num> <num> <num> <num>
1: -4.76425 0.999885 2.74896 7 127 0 6.27136 184.340 -244.86300 95.1088
2: -4.66059 0.999970 2.74896 6 63 0 5.03261 213.799 -171.11100 96.2315
3: -5.09712 0.981220 2.74896 6 66 1 6.39704 166.980 -125.97500 158.4170
4: -6.11659 0.999955 2.74896 7 127 0 6.23478 281.608 -7.43179 252.5850
5: -6.35803 0.999807 2.74896 8 255 0 9.37554 242.192 -442.52700 538.0940
6: -7.64953 0.999985 2.74896 10 1023 0 8.23180 444.349 -868.81500 2055.1400
7: -7.33027 1.000000 2.74896 10 1023 0 8.37176 -251.299 922.23900 1348.7000
8: -6.84893 1.000000 2.74896 10 1023 0 7.82562 -1589.190 1803.51000 917.7360
9: -6.91315 0.999973 2.74896 8 511 0 7.26564 -1565.040 1840.16000 965.7080
10: -6.92680 0.999974 2.74896 9 831 0 7.81856 -2004.660 2241.34000 990.8020
Sorry I'm just seeing this issue now, not sure why I missed it before. Unfortunately this is the way CmdStan itself writes the CSV files during sampling, not something that the R package decides or modifies (we also use fread inside the R package and have to get around this issue too). I don't think we want to mess with the CSV files that CmdStan creates since there's already a lot of code that assumes that they are the way they are, but I agree it's suboptimal. I guess we could consider adding an argument to save_output_files() that can be turned on to strip comments from the CSV files?
Or just rely on fread(cmd = paste("grep -v '^#'", my_file) that's OK too. Was just wondering if there was a simple fix.