emp icon indicating copy to clipboard operation
emp copied to clipboard

Question: fastaq for release 1

Open galud27 opened this issue 6 years ago • 3 comments

Hi, Thanks for putting all the code and data analysis for the emp release 1. It's great! I wanted to ask a question, I am very interested in performing a smaller study looking at the data but using a different pipeline for the microbial community analysis. I was able to get fastaq files for all the studies, but the data is only available in single reads in ENA. I was wondering if there any studies with pair-end sequences fastaq?

Thanks

galud27 avatar Feb 14 '18 22:02 galud27

Hi! Thanks for your message. We hope the resource is useful to you.

For EMP Release 1, some of the studies did not have Read 2 data, and for those that did, we did not use it. This is because the amplicon size is ~253bp, and only the more recent studies, with read lengths 150-151bp, would be long enough to merge. Read 2 tends to have a higher error rate than Read 1, and merging the Read 1 and Read 2 sequences can also introduce uncertainty, which we wanted to avoid.

For some of the more recent studies, we do have Read 2 data, and it should be available Qiita (qiita.ucsd.edu) for some of them. We will try to come up with a list of which studies have these data available. Note that it will only be possible to merge sequences for studies with ~150bp reads, which is noted in the Release 1 mapping files; studies sequenced since 2015 that are in Qiita will also have reads ~150bp. We know that people are interested in the Read 2 data, and we will try to make this more accessible in the future!

Thanks! Luke

Cc: @antgonza @ackermag @walterst

cuttlefishh avatar Feb 19 '18 21:02 cuttlefishh

I assume "fastaq" was a typo for FASTQ.

Could you give one or two specific examples with paired end data (R1 and R2) suitable for overlap merging? Thanks!

peterjc avatar Feb 17 '20 13:02 peterjc

These studies appear to have R1 and R2 data (plus index) and should be suitable for merging:

https://qiita.ucsd.edu/emp/study/description/10561 https://qiita.ucsd.edu/emp/study/description/10533

The studies numbered >10000 were sequenced after the start of 2015 and should have longer reads (150bp). If the reverse reads are present, it should be possible to merge forward and reverse reads.

cuttlefishh avatar Feb 17 '20 15:02 cuttlefishh