nhanes
nhanes copied to clipboard
Detection of numeric columns in `nhanesTranslate()`
While looking at nhanesTranslate()
, I came across this example which puzzled me for a while:
> d <- nhanesTranslate("LAB04", data = nhanes("LAB04", translated = FALSE))
Warning message:
In nhanesTranslate("LAB04", data = nhanes("LAB04", translated = FALSE)) :
No columns were translated
This is surprising because the table has categorical variables that should be translated, e.g., https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/LAB04.htm#LBDWBFLC.
The reason is that the current code mistakes these for numeric variables because one of the value descriptions is "Detectable Result and Exceeds the Calibrated Range of Assay", which includes the word "Range", and hence is flagged here:
https://github.com/cjendres1/nhanes/blob/master/R/nhanes_translate.R#L151
Should we change
nskip <- grep('Range', translations)
to
nskip <- grep('Range of values', translations)
?
Do we have a strategy for this related issue?: https://github.com/cjendres1/nhanes/issues/21#issue-1835636708
That's a new one for me. I could find 235 similar instances. This write-up has details and some possible strategies:
https://github.com/ccb-hms/nhanes-exploration/blob/main/check-numeric-categorical.md
Thanks for looking into that so thoroughly! That's quite a few cases. If possible, I think it's preferable to detect key phrases. Your suggestion of using "Range of Values" is more precise. Similarly, we could detect "or more", "or older", as well as " to " in CodeOrValue. Need to work out what the preferred outcome is in those cases, but it seems we should be able to handle those cases.
Yes, that strategy should capture most of these, though it doesn't find everything, such as
137972 OSQ020a OSQ_H 1 1 71
137973 OSQ020a OSQ_H 2 2 9
137974 OSQ020a OSQ_H 3 3 2
137975 OSQ020a OSQ_H 7777 Refused 0
137976 OSQ020a OSQ_H 9999 Don't know 1
137977 OSQ020a OSQ_H . Missing 3732
More importantly, we would have to check if this might flag legitimate categorical variables. I will try to run that check.
But I'm also a bit wary of this kind of automated flagging that are not verified manually. It may be preferable to maintain a manually curated list. This approach has the drawback that we would periodically need to check all tables manually, which would generally be difficult. But on the other hand, it's quite easy with the database backend, or even with local caching.
This needs more discussion.
FYI, I took a look at variable codes which have descriptions containing "less than" or "more than", and there seem to be quite a few false positives. Even if we restrict to descriptions that only start with "less than", the following examples come up, many of which are probably not numeric. Making this approach work will require a lot of manual fiddling, so I don't think it's going to be that useful.
[64] "Less Than High School"
[65] "Less Than 9th Grade"
[66] "Less Than 5th Grade"
[67] "Less than 1 year"
[68] "Less than usual"
[69] "Less than 1 canister"
[70] "Less than Half the Days"
[71] "Less than Half"
[72] "Less than half the days"
[73] "Less than 1 year ago"
[74] "Less than 2 weeks"
[75] "Less than 6 months ago"
[76] "Less than 1 month"
[77] "Less than weekly"
[78] "Less than 5-1/2 lbs. (2500 g)?"
[79] "Less than 9 lbs. (4100 g)?"
[80] "Less than one year"
I changed 'Range' to 'Range of Values' as you suggested (Version 0.9). Looping in Robert and Laha cause these translation "edge cases" are of general interest and would like to reach a consensus on how to handle.