tinytest icon indicating copy to clipboard operation
tinytest copied to clipboard

Alphabetical order in testing environment different than in regular R environment?

Open swhalemwo opened this issue 7 months ago • 2 comments

I'm developing a package for project-specific data processing. One step is checking whether a number of names are really distinct, or if similar names refer to the same person. For this I first generate from a database a data.table of pairs that are similar based on string similarity, and compare this to a data.table of pairs for that I have manually checked whether they refer to the same person. If all similar sounding names have been covered in my manually compiled list, the test passes.

I do this via a negative join with data.table:

dt_redux <- dt_pairs_from_db[!dt_manually_checked_pairs, on = .(name1, name2)]
expect_true(nrow(dt_redux)==0)

This test did pass when calling test_all or build_install_test, but failed in R CMD check.

After some searching I tracked it down to the name order in dt_pairs_from_db. Here the pairs are generated from a string similarity function, which creates two entries for each couple (name1, name2 and name2, name1). To avoid having to check each couple twice, I only cover the cases where name1 > name2. However for one couple, "İnan Kıraç" and "Suna Kıraç", the alphabetical order differs between the normal R environment and the testing environment: In the normal R environment, expect_true("İnan Kıraç" > "Suna Kıraç") fails, but in the testing environment (in my test_package.R file), expect_true("İnan Kıraç" > "Suna Kıraç") passes.

This difference in alphabetical order lead to a dt_pairs_from_db being generated that didn't match the order of pairs to check in my dt_manually_checked_pairs, which caused the test to fail.

I've now fixed it by just adding this particular couple in both comparisons to my dt_manually_checked_pairs, but I'm curious what caused this; any ideas?

swhalemwo avatar Dec 04 '23 10:12 swhalemwo

I think/vaguely remember that R CMD check uses the 'C' collation chart, so bytewise sorting. You could try to set the lc_collate variable to C with sys.setenv in your test file so it is always used

markvanderloo avatar Dec 06 '23 18:12 markvanderloo

See also the 'details' section in ?run_test_dir

markvanderloo avatar Dec 06 '23 18:12 markvanderloo