rvtests icon indicating copy to clipboard operation
rvtests copied to clipboard

Empty string in phenotype file not handled as a missing value

Open dpanyard opened this issue 4 years ago • 0 comments

Hi, I'm not sure whether this is something to fix or something to just clarify explicitly in the documentation, but I discovered an issue while running rvtests using a space-delimited phenotype file where an empty string is used for missing values. It appears that rvtests will treat consecutive spaces as one delimiter instead of treating it as having a missing value.

For instance, if my phenotype file contained... fid iid fatid matid sex pheno1 pheno2 pheno3 pheno4 pheno5 1 1 3 4 1 17 39 20 933 0.4 2 2 5 6 1 93 84 588 0.3

[note that there are two spaces between 93 and 84 above]

...then it seems that for the second row of data (iid 2) rvtests would use the value for pheno3 (84) for the pheno2 value, use the value for pheno4 (588) for the pheno3 value, use the value for pheno5 (0.3) for the pheno4 value, and then treat pheno5 as missing. However, the desired outcome here would be that pheno2 would be the missing value and that pheno3, pheno4, and pheno5 would be read in appropriately.

I discovered that this issue was happening by seeing the following warning message in the log file: [WARN] Skip line 813 (short of columns) in phenotype file

I easily fixed the issue by ensuring all missing values are instead encoded as the string "NA", but it would be worth considering having rvtests handle the case of a missing value encoded as an empty string (""). It might also be helpful to throw an error when there are too few columns on certain rows such as when this issue is happening.

The documentation could be more explicit as well. The documentation currently reads...

In phenotype file, missing values can be denoted by NA or any non-numeric values.

It would be helpful to explicitly mention that the empty string is not currently supported and will provide incorrect results.

dpanyard avatar May 27 '20 21:05 dpanyard