miceRanger icon indicating copy to clipboard operation
miceRanger copied to clipboard

Unicode characters in data column names throw an error in naWhere

Open drag05 opened this issue 2 years ago • 3 comments

I have the following data

> head(htc, 2)
      25 µL      50 µL     75 µL    100 µL  Accession
1: 1.265836 0.02575365 0.1428066 0.2107820 A0A024R6I7
2:       NA 0.01566025 0.1481060 0.2069585 A0A075B6K4

> dim(htc)
[1] 269   5

> htc[, colSums(is.na(.SD))]
    25 µL     50 µL     75 µL    100 µL Accession 
      200         0         3         0         0 

associated with these naWhere , varp and varn

> naWhere[1:4, ]
     25 µL 50 µL 75 µL 100 µL Accession
[1,] FALSE FALSE FALSE  FALSE     FALSE
[2,]  TRUE FALSE FALSE  FALSE     FALSE
[3,]  TRUE FALSE FALSE  FALSE     FALSE

> dim(naWhere)
[1] 269   5

> colSums(naWhere)
    25 µL     50 µL     75 µL    100 µL Accession 
      200         0         3         0         0 

> varp <- unique(unlist(vars))
> varp
[1] "50 μL"     "75 μL"     "100 μL"    "Accession" "25 μL"   ## maybe apply gtools::mixedsort ?

> varn
[1] "25 μL" "75 μL"

Calculating the leftout columns, throws the following error:

leftOut <- !varp %in% varn & colSums(naWhere[, varp]) > 0

"Error in naWhere[, varp] : subscript out of bounds"

Checking varp against colnames(naWhere):

identical(varp, colnames(naWhere))
FALSE

> intersect(varp, colnames(naWhere))
[1] "Accession"

> varp %in% colnames(naWhere)
[1] FALSE FALSE FALSE  TRUE FALSE

> which(varp %in% colnames(naWhere)) ## "Accession" only (FALSE)
[1] 4
> which(colnames(naWhere) %in% varp) ## "Accession" only (FALSE)
[1] 5

It seems to still be working when comparing varp against varn:

> !varp %in% varn
[1]  TRUE FALSE  TRUE  TRUE FALSE

The error seems to be caused by the presence of unicode characters in names although it seems to be no challenge for varp and varn , as shown by the last code line above. However,

using either seq_along or base::enc2native functions seems to remove the error:

leftOut <- !varp %in% varn & colSums(naWhere[, seq(along=varp)]) > 0

> leftOut
    25 µL     50 µL     75 µL    100 µL Accession 
     TRUE     FALSE      TRUE     FALSE     FALSE 

> varp = enc2native(varp)
> leftOut <- !varp %in% varn & colSums(naWhere[, varp]) > 0
> leftOut
    50 µL     75 µL    100 µL Accession     25 µL 
    FALSE      TRUE     FALSE     FALSE      TRUE 

Please advise, thank you!

drag05 avatar Apr 29 '22 18:04 drag05