GenomicRanges icon indicating copy to clipboard operation
GenomicRanges copied to clipboard

If invalid colum names are found when converting from `GRanges` to `data.frame` at least a warning should be printed

Open lbeltrame opened this issue 1 year ago • 1 comments

As it stands, when converting GRanges objects to data.frame, it is not possible to skip check.names as it is not passed when the data.frame is constructed and when mcols are handled (https://github.com/Bioconductor/GenomicRanges/blob/2b7bf7d519a2091652ceef6d6c47e3aaa1030900/R/GenomicRanges-class.R#L264) therefore names of a column will get converted if numeric or with other non allowed characters.

This actually can cause problems to downstream projects when the GRanges metadata needs to be joined with different data (https://github.com/cruk-mi/mesa/issues/30). Using correct column names instead should be the right thing to do, however this conversion in GRanges happens completely silently. In my opinion at least a warning should be printed in this case, to tell that user what's going on.

Minimal reproducer:

test <- GRanges(seqnames=c("chr1"), ranges=IRanges(start=1000001, end = 2000000), `1345`="bar") 

test
GRanges object with 1 range and 1 metadata column:
      seqnames          ranges strand |        1345
         <Rle>       <IRanges>  <Rle> | <character>
  [1]     chr1 1000001-2000000      * |         bar
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

as.data.frame(test)
  seqnames   start     end   width strand X1345
1     chr1 1000001 2000000 1000000      *   bar

lbeltrame avatar Oct 08 '24 08:10 lbeltrame

Hi,

Using correct column names instead should be the right thing to do,

I could not agree more with that but note that this is not what as.data.frame() does in base R.

however this conversion in GRanges happens completely silently.

This is also the case in base R:

> x <- list("A"=11:14, "1345"=letters[1:4])
> as.data.frame(x)  # no warning!
   A X1345
1 11     a
2 12     b
3 13     c
4 14     d

In base R, whether as.data.frame() should preserve the original names/colnames or pass them thru make.names() is controlled by the optional argument:

> as.data.frame(x, optional=TRUE)  # preserves the original names
   A 1345
1 11    a
2 12    b
3 13    c
4 14    d

The as.data.frame() methods defined in Bioconductor for various vector-like objects are expected to follow base R semantics as closely as possible. For example:

> hits <- Hits(1:3, 8:6, nLnode=3, nRnode=9, `1345`=letters[1:3])

> hits
Hits object with 3 hits and 1 metadata column:
           from        to |        1345
      <integer> <integer> | <character>
  [1]         1         8 |           a
  [2]         2         7 |           b
  [3]         3         6 |           c
  -------
  nLnode: 3 / nRnode: 9

> as.data.frame(hits)
  from to X1345
1    1  8     a
2    2  7     b
3    3  6     c

> as.data.frame(hits, optional=TRUE)
  from to 1345
1    1  8    a
2    2  7    b
3    3  6    c

However, it appears that the method for GenomicRanges objects was not following this convention.

This is now fixed in GenomicRanges 1.56.2 (BioC 3.19, current release) and 1.57.2 (BioC 3.20, current devel). See commits 8930ae8cb4c803bba3206d0cb4f86900e2477660 and 3f98a49a88350b8f8e09d984fe29bcc1be4d2a9f.

With these new versions:

> as.data.frame(test)
  seqnames   start     end   width strand X1345
1     chr1 1000001 2000000 1000000      *   bar

> as.data.frame(test, optional=TRUE)
  seqnames   start     end   width strand 1345
1     chr1 1000001 2000000 1000000      *  bar

In all fairness, the case could be made to ignore the optional argument and to always propagate the original metadata colnames for Bioconductor vector-like objects. I was never a big fan of the name mangling game in R, to say the least :wink: (see https://github.com/r-dbi/RSQLite/issues/259 and https://github.com/Bioconductor/SummarizedExperiment/issues/2). However this would deviate slightly from base R semantics so I'm hesitant to cross that bridge for now. In any case it should not be done lightly and would require some discussion with other members of the commnunity.

GenomicRanges 1.56.2 and 1.57.2 should become available via BiocManager::install() in the next 48 hours or so.

Best, H.

hpages avatar Oct 08 '24 22:10 hpages

Closing this now. Please open a new issue if you think we should further discuss the change I suggested above, that is, that as.data.frame() methods in Bioconductor should ignore the optional argument and always propagate the original metadata colnames (i.e. no name mangling). Thanks!

hpages avatar Oct 28 '24 19:10 hpages