nhanes icon indicating copy to clipboard operation
nhanes copied to clipboard

Detection of numeric columns in `nhanesTranslate()`

Open deepayan opened this issue 1 year ago • 6 comments

While looking at nhanesTranslate(), I came across this example which puzzled me for a while:

> d <- nhanesTranslate("LAB04", data = nhanes("LAB04", translated = FALSE))
Warning message:
In nhanesTranslate("LAB04", data = nhanes("LAB04", translated = FALSE)) :
  No columns were translated

This is surprising because the table has categorical variables that should be translated, e.g., https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/LAB04.htm#LBDWBFLC.

The reason is that the current code mistakes these for numeric variables because one of the value descriptions is "Detectable Result and Exceeds the Calibrated Range of Assay", which includes the word "Range", and hence is flagged here:

https://github.com/cjendres1/nhanes/blob/master/R/nhanes_translate.R#L151

Should we change

nskip <- grep('Range', translations)

to

nskip <- grep('Range of values', translations)

?

deepayan avatar Dec 03 '23 02:12 deepayan

Do we have a strategy for this related issue?: https://github.com/cjendres1/nhanes/issues/21#issue-1835636708

cjendres1 avatar Dec 03 '23 13:12 cjendres1

That's a new one for me. I could find 235 similar instances. This write-up has details and some possible strategies:

https://github.com/ccb-hms/nhanes-exploration/blob/main/check-numeric-categorical.md

deepayan avatar Dec 03 '23 17:12 deepayan

Thanks for looking into that so thoroughly! That's quite a few cases. If possible, I think it's preferable to detect key phrases. Your suggestion of using "Range of Values" is more precise. Similarly, we could detect "or more", "or older", as well as " to " in CodeOrValue. Need to work out what the preferred outcome is in those cases, but it seems we should be able to handle those cases.

cjendres1 avatar Dec 05 '23 13:12 cjendres1

Yes, that strategy should capture most of these, though it doesn't find everything, such as

137972  OSQ020a     OSQ_H           1                1    71
137973  OSQ020a     OSQ_H           2                2     9
137974  OSQ020a     OSQ_H           3                3     2
137975  OSQ020a     OSQ_H        7777          Refused     0
137976  OSQ020a     OSQ_H        9999       Don't know     1
137977  OSQ020a     OSQ_H           .          Missing  3732

More importantly, we would have to check if this might flag legitimate categorical variables. I will try to run that check.

But I'm also a bit wary of this kind of automated flagging that are not verified manually. It may be preferable to maintain a manually curated list. This approach has the drawback that we would periodically need to check all tables manually, which would generally be difficult. But on the other hand, it's quite easy with the database backend, or even with local caching.

This needs more discussion.

deepayan avatar Dec 05 '23 14:12 deepayan

FYI, I took a look at variable codes which have descriptions containing "less than" or "more than", and there seem to be quite a few false positives. Even if we restrict to descriptions that only start with "less than", the following examples come up, many of which are probably not numeric. Making this approach work will require a lot of manual fiddling, so I don't think it's going to be that useful.

 [64] "Less Than High School"                                       
 [65] "Less Than 9th Grade"                                         
 [66] "Less Than 5th Grade"                                         
 [67] "Less than 1 year"                                            
 [68] "Less than usual"                                             
 [69] "Less than 1 canister"                                        
 [70] "Less than Half the Days"                                     
 [71] "Less than Half"                                              
 [72] "Less than half the days"                                     
 [73] "Less than 1 year ago"                                        
 [74] "Less than 2 weeks"                                           
 [75] "Less than 6 months ago"                                      
 [76] "Less than 1 month"                                           
 [77] "Less than weekly"                                            
 [78] "Less than 5-1/2 lbs. (2500 g)?"                              
 [79] "Less than 9 lbs. (4100 g)?"                                  
 [80] "Less than one year"                                          

deepayan avatar Dec 06 '23 22:12 deepayan

I changed 'Range' to 'Range of Values' as you suggested (Version 0.9). Looping in Robert and Laha cause these translation "edge cases" are of general interest and would like to reach a consensus on how to handle.

cjendres1 avatar Jan 01 '24 15:01 cjendres1