cmdstanr icon indicating copy to clipboard operation
cmdstanr copied to clipboard

Escaped regex characters in variable_dims()

Open VisruthSK opened this issue 2 months ago • 10 comments

This PR escapes regex metacharacters in variable names when computing dimensions in variable_dims(), so names like OMEGA(1,1) don't error. This PR also swaps variables which have parentheses to brackets in repair_variable_names(). Notably, these don't get recovered in unrepair_variable_names(), so users will have to use the bracketed name in further calls. read_csv_metadata() now relies on data.table::fread() to parse the CSV header and generate the correctly quoted variable names.

Fixes #1116.

Copyright and Licensing

Visruth Srimath Kandali

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

  • Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
  • Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)

VisruthSK avatar Dec 03 '25 22:12 VisruthSK

Thanks @VisruthSK!

@yizhang-yiz can you check if this works for you?

jgabry avatar Dec 03 '25 23:12 jgabry

I do wonder if anything else will break downstream if we have parameter names with parenthesis in them. We don't have tests for all the post-processing functions with non-standard parameter names, but it might be ok.

An alternative could maybe be to convert these names to names with brackets instead of trying to allow names with parenthesis? But let's see if the solution in this PR works before going down that path.

jgabry avatar Dec 03 '25 23:12 jgabry

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 87.18%. Comparing base (1d58078) to head (5527cdb). :warning: Report is 4 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1120      +/-   ##
==========================================
+ Coverage   87.15%   87.18%   +0.02%     
==========================================
  Files          14       14              
  Lines        5973     5984      +11     
==========================================
+ Hits         5206     5217      +11     
  Misses        767      767              

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov-commenter avatar Dec 03 '25 23:12 codecov-commenter

After playing around a bit, this alone won't solve the issue in #1116. Unfortunately there are other issues caused by variable names with parentheses. I'm thinking it's best to just not support names with parentheses, especially since they can be converted before trying to read the CSV files and then everything should work.

jgabry avatar Dec 05 '25 22:12 jgabry

Should swapping parens with branckets be added to repair_variable_names() and unrepair_variable_names()? Or maybe just to repair to normalize output to always be brackets?

VisruthSK avatar Dec 05 '25 23:12 VisruthSK

If changing repair_variable_names() and unrepair_variable_names() is all that's needed to get this working and the change is simple, I'm OK with adding it.

But if other things also need to be changed (I'm not sure) then I would lean towards making the user do the conversion ahead of time and not changing so many things in CmdStanR. CmdStanR was designed to handle CSV files written by CmdStan. If other CSV files can be converted to be in that format then I'm happy for them to be used with CmdStanR, but I would be hesitant to make the package more complicated in order to accommodate the rare edge case of CSV files using other conventions. Does that seem reasonable?

Some additional background info: CmdStan actually writes the variable names using periods (Sigma.1.1) and we convert them to brackets (Sigma[1,1]) after reading the CSVs.

jgabry avatar Dec 06 '25 00:12 jgabry

Yeah definitely, especially as this is somewhat rare user behaviour I don't see the point of adding more complexity to cmdstanr. Do you happen to have an example/test I could run against? I got a test from an LLM but I don't know enough about cmdstanr to know if it makes sense and/or is testing the right things.

I just pushed changes where this test passes. (but not changing unrepair so the names are permantly swapped to using brackets, basically the same as the user doing so manually)

test_that("as_cmdstan_fit handles variable names with parentheses", {
  csv_file <- tempfile(fileext = ".csv")
  writeLines(c(
    "# stan_version_major = 2",
    "# stan_version_minor = 33",
    "# stan_version_patch = 0",
    "# model = norm_model",
    "# method = sample (Default)",
    "#   sample",
    "#     num_samples = 2",
    "#     num_warmup = 0",
    "#     save_warmup = 0",
    "#     thin = 1",
    "#   random",
    "#     seed = 123",
    "#   algorithm = hmc",
    "#     metric = diag_e",
    "#     stepsize = 1",
    "# id = 1",
    "THETA4,SIGMA(1,1)",
    "2.00000E+00,2.00000E+00",
    "2.00000E+00,2.00000E+00"
  ), con = csv_file)

  expect_no_error({
    fit <- as_cmdstan_fit(csv_file, check_diagnostics = FALSE, format = "draws_matrix")
  })

  draws <- fit$draws()
  vars  <- posterior::variables(draws)

  expect_equal(posterior::ndraws(draws), 2L)
  expect_true(any(grepl("THETA4", vars)))
  expect_true(any(grepl("SIGMA", vars)))
})

VisruthSK avatar Dec 06 '25 00:12 VisruthSK

Thanks. If you can get that test to pass then you're probably on the right track, but we should test this with parameters that have more than one element. I just modified the variable names in a CSV written by CmdStan to include Sigma(1,1), Sigma(1,2), Sigma(2,1), Sigma(2,2):

logistic-202512051745-1-989cfc.csv

I wasn't sure if the names with parentheses needed to be in quotes (probably?). CmdStan doesn't write the names in quotes, but maybe with the parentheses they need to be. I guess you can modify the names in the CSV file in various ways to test it. When I try this file I get an error where it thinks there are duplicate parameter names (there aren't):

Error: Duplicate variable names are not allowed in draws objects.
The following variable names are duplicates:
{'"Sigma[1]', '1"', '"Sigma[2]', '2"'}

I wouldn't spend too much time on this since it's an edge case, but if there's a simple fix we can do it.

jgabry avatar Dec 06 '25 00:12 jgabry

Thanks for working on this! I agree this is an edge case and there’s workaround (bracket for example) on my side. Originally I was just curious why it didn’t work as fread does. Please feel free to close #1116.

yizhang-yiz avatar Dec 06 '25 03:12 yizhang-yiz

Just swapped the metadata parsing to use fread instead of a string split on commas which was messing up the variable names. I think this should bring behaviour more inline with fread. I think this PR is ready for review, predicated on the tests passing.

VisruthSK avatar Dec 06 '25 04:12 VisruthSK