rentrez icon indicating copy to clipboard operation
rentrez copied to clipboard

Inconsistent behaviour and unnecessary warning when entrez_link called with single ID and by_id = TRUE

Open johnomics opened this issue 2 years ago • 1 comments

Thanks for developing rentrez, it is excellent.

The man page for entrez_link says when by_id = TRUE, a list of elink objects will be returned, one for each ID in id. This works for the example shown in the tutorial:

> all_links_sep  <- entrez_link(db="protein", dbfrom="gene", id=c("93100", "223646"), by_id=TRUE)
> all_links_sep
List of 2 elink objects,each containing
  $links: IDs for linked records from NCBI
> all_links_sep[[1]]$links$gene_protein
 [1] "1387845369" "1387845338" "1370513171" "1370513169" "1034662000" "1034661998" "1034661996" "1034661994" "1034661992" "558472750"  "545685826" 
[12] "194394158"  "166221824"  "154936864"  "122346659"  "119602646"  "119602645"  "119602644"  "119602643"  "119602642"  "37787309"   "37787307"  
[23] "37787305"   "33991172"   "21619615"   "10834676"  
> all_links_sep[[1]]$links$gene_protein_refseq
 [1] "1387845369" "1387845338" "1370513171" "1370513169" "1034662000" "1034661998" "1034661996" "1034661994" "1034661992" "558472750"  "194394158" 

But this is what happens with only a single ID:

> one_link_sep  <- entrez_link(db="protein", dbfrom="gene", id="93100", by_id=TRUE)
Warning message:
In entrez_link(db = "protein", dbfrom = "gene", id = "93100", by_id = TRUE) :
  Some IDs appear to be invalid. Result containg no information for the following IDs: 93100 , 
> one_link_sep
elink object with contents:
 $links: IDs for linked records from NCBI
> one_link_sep$links$gene_protein
 [1] "1387845369" "1387845338" "1370513171" "1370513169" "1034662000" "1034661998" "1034661996" "1034661994" "1034661992" "558472750"  "545685826" 
[12] "194394158"  "166221824"  "154936864"  "122346659"  "119602646"  "119602645"  "119602644"  "119602643"  "119602642"  "37787309"   "37787307"  
[23] "37787305"   "33991172"   "21619615"   "10834676"  
> one_link_sep$links$gene_protein_refseq
 [1] "1387845369" "1387845338" "1370513171" "1370513169" "1034662000" "1034661998" "1034661996" "1034661994" "1034661992" "558472750"  "194394158" 

The link is returned, but as a single link, not as a list with one link. And an unnecessary warning is produced - the link's data is returned with no problems.

Please could this be returned as a list containing a single link, instead of just the single link, and the warning removed? I realise this is a slightly odd request - why use by_id with only one ID? It's because I'm running upstream queries that return different (unknown) numbers of IDs, sometimes returning only a single ID, and I want the output to always be a list so I can process it consistently. Otherwise, I need to check every return value of entrez_link to see whether it returned a single value or a list, and I need to suppress the warning, as the output is fine.

johnomics avatar Sep 21 '21 11:09 johnomics

It doesn't look like the code in this repo will be updated any time soon, so you'll have to institute a workaround.

rentrez does output entrez_link() results with different classes if it is a single result or list. I handled this situation when extracting PubMed IDs by defining s3 methods for the two different outputs elink (single result) and elink_list (list of results). See https://github.com/allenbaron/DO.utils/blob/632093c8ea37ac46a18ae559a4a0ea59395edbd4/R/extract.R#L77-L188.

I created a fork of rentrez to fix another bug (see PR #174). You're welcome to submit a pull request there. I'd be happy to merge it.

allenbaron avatar Jul 14 '22 20:07 allenbaron