pysradb icon indicating copy to clipboard operation
pysradb copied to clipboard

[BUG] Duplicated metadata when querying metadata for single run accession

Open kpj opened this issue 3 years ago • 4 comments

Describe the bug In some cases, when using SRAweb.sra_metadata with a single run accession, multiple metadata rows are returned. It would seem more sensible to only return the metadata for the requested run accession. This is e.g. problematic when retrieving metadata for a list of samples and expecting the number of rows to be equal to the number of queried samples.

To Reproduce Execute the following code:

>>> from pysradb.sraweb import SRAweb

>>> db = SRAweb()
>>> db.sra_metadata('SRR12169246', detailed=True)  # returns metadata for both SRR12169246 and SRR12169247
#   run_accession study_accession experiment_accession  ...                                                                       ena_fastq_ftp ena_fastq_ftp_1 ena_fastq_ftp_2
# 0  SRR12169247   SRP270837       SRX8684079           ...  [email protected]:vol1/fastq/SRR121/047/SRR12169247/SRR12169247.fastq.gz  N/A             N/A           
# 1  SRR12169246   SRP270837       SRX8684079           ...  [email protected]:vol1/fastq/SRR121/046/SRR12169246/SRR12169246.fastq.gz  N/A             N/A           

[2 rows x 32 columns]
>>> db.sra_metadata('SRR12169247', detailed=True)  # returns metadata for both SRR12169246 and SRR12169247
#   run_accession study_accession experiment_accession  ...                                                                       ena_fastq_ftp ena_fastq_ftp_1 ena_fastq_ftp_2
# 0  SRR12169247   SRP270837       SRX8684079           ...  [email protected]:vol1/fastq/SRR121/047/SRR12169247/SRR12169247.fastq.gz  N/A             N/A           
# 1  SRR12169246   SRP270837       SRX8684079           ...  [email protected]:vol1/fastq/SRR121/046/SRR12169246/SRR12169246.fastq.gz  N/A             N/A           

[2 rows x 32 columns]

Desktop:

  • OS: Linux
  • Python version: 3.8.5
  • pysradb version: 0.11.2-dev0

kpj avatar Dec 16 '20 12:12 kpj

Thanks for the bug report @kpj! I think the reason this bug results in two runs is because that happens when you also search it via the NCBI-SRA website. For example see: https://www.ncbi.nlm.nih.gov/sra/?term=SRR12169246 That said, it can be handled internally - I will get to it this week.

saketkc avatar Dec 16 '20 13:12 saketkc

Thanks! I came across a similar issue when fetching metadata manually and ended up subsetting the dataframe.

Maybe there's a better of way of handling this.

kpj avatar Dec 16 '20 14:12 kpj

For now, I would recommend the fix you have in place. It is slightly tricky to deal this internally given the passed in argument could be anything (SRP/SRR/SRX/GSM etc.). The origin of this is not at pysradb end, but what NCBI search itself returns (see above comment)

saketkc avatar Dec 25 '20 19:12 saketkc

Is the main issue to figure out which column to detect duplicates in/which column to select the accessions from? In that case it might be an idea to add a parameter such as duplicate_accession_removal_column which would be run_accession when input accessions are of the form ERR4413803.

This is certainly not very elegant and maybe there are other issues making this more difficult, so I am happy either way :)

kpj avatar Dec 25 '20 22:12 kpj

I met the same question. And I am confused about the relationship between multiple SRR IDs within a single SRX ID. Are these SRR IDs technical replicates from a shared sequencing library? The manual in NCBI made me really confused. And I would appreciate it if you could tell me your understanding of this question.

fatyang799 avatar Feb 22 '23 01:02 fatyang799

Yes, SRRs for the same SRX are technical replicates. Here are some slides that might help: https://f1000research.com/slides/8-1183

saketkc avatar Feb 22 '23 01:02 saketkc

Yes, SRRs for the same SRX are technical replicates. Here are some slides that might help: https://f1000research.com/slides/8-1183

Many thanks for your quick reply!!

In passing, I would like to raise here another problem that I encountered in the course of using. The metadata I prefetch by pysradb metadata --detailed do not include some important info.

For example, I want to acquire antibody info of a ChIPseq ([SRX027872](https://www.ncbi.nlm.nih.gov/sra/SRX027872[accn])). On the web of NCBI, I can see the antibody info (Experiment attributes part). But there is no related info in metadata I prefetch by pysradb.

fatyang799 avatar Feb 22 '23 01:02 fatyang799

@sheep-liu thanks for brining it to my attention. I have pushed https://github.com/saketkc/pysradb/commit/7da562f86fe759f737b25f6581a8c44a9437b5b4 which enables fetching experiment protocol. It will be in the next release (you can install the develop version from github for now).

For future, please create a new issue. I will close this for now as I think the original issue it is best handled downstream.

saketkc avatar Feb 22 '23 02:02 saketkc

@sheep-liu thanks for brining it to my attention. I have pushed 7da562f which enables fetching experiment protocol. It will be in the next release (you can install the develop version from github for now).

For future, please create a new issue. I will close this for now as I think the original issue it is best handled downstream.

Roger! And thanks a lot.

fatyang799 avatar Feb 22 '23 04:02 fatyang799