grep exercise unrealistic
Arizona Bug BBQ - In general we dislike the current set of exercises using grep. it is quite artificial and not relevant to the pipeline that we are working through with them. We suggest dropping grep and piping entirely from this lesson unless someone comes up with an exercise that is relevant to the current data set and is something learners would use in their actual workflow.
Additionally, most bioinformatic tools don't take advantage of piping.
I agree, it is not directly relevant to a full workshop and the workshop would profit from trimming down the material. However, whenever I teach this lesson as "stand alone", I never skip this because the output of many bioinformatic tools I use need to be redirected to a file. I would suggest we make this an optional episode under 'Extras'. What do others think?
I was trying to come up with a useful exercise with fastq files and grep, but yeah, the lesson is kind of artificial. If the lesson was done on a set of fasta files (transcripts, etc) it would be easier to come up with relevant examples for grep, piping and other things, but that would mean too much work I guess.
Still, grep and piping is very useful in downstream processing of results and I also think it would be good to have these exercises in the 'Extra' episode.
Learning about grep and redirect is useful in many cases.
In order to "mimic" an AWS instance for a local (laptop) teaching I first used Ubuntu (20.04 LTS) within docker to follow the lessons, as Ubuntu is what is shown from an AWS "splash" screen of the introduction lesson 01. I thought that there was an error in the grep exercises of Lesson 4 |Redirection because I was getting a count of 537 "bad" reads of 10- Ns, rather than 802 as in the lesson.
grep -B1 -A2 NNNNNNNNNN SRR098026.fastq | wc
537 1073 23217
However, if I used the same command on my macOS, the I would get 802 as it is written in the lesson. I then tried Docker instances of Alpine and Centos 7 and these also resulted in 537. The difference is that on the Linux distro it is gnu grep while on the Mac it is BSD grep.
After some search I figured that the difference is about non-matching lines written as a -- output line. The Linux gnu grep only write only one towards the end, while the BSD Mac version writes 266 of them:
# On macOS:
grep -B1 -A2 NNNNNNNNNN SRR098026.fastq | wc -l
802
grep -B1 -A2 NNNNNNNNNN SRR098026.fastq | egrep '^--' | wc -l
266
I am not sure if/how, this is a bug, but there is definitely a problem and inconsistency. I don't understand while the gnu grep would provide only one. I also checked the "end-of-line" to make sure that the file had a Unix format.
Was the original course developed on a Linux distro or a BSD-derived system?
I was trying to come up with a useful exercise with fastq files and grep, but yeah, the lesson is kind of artificial. If the lesson was done on a set of fasta files (transcripts, etc) it would be easier to come up with relevant examples for grep, piping and other things, but that would mean too much work I guess.
Still, grep and piping is very useful in downstream processing of results and I also think it would be good to have these exercises in the 'Extra' episode.
I agree, grep is a useful tool, I have some suggestions on a lesson that is relevant that I am currently using in my dissertation using fastq files. I originally had BAM, and I stripped the bam files of the reference genome, I had to separate paired end fastq reads to re-align to a new reference genome. The 'for loop' code I used in my MAC terminal to separate the files are :
first separate pair-end reads between 1 and 2.
for f in *.fastq do cat ${f} | grep '^@.*/1$' -A 3 --no-group-separator > PreAligned_Fastq/${f}_R1.fastq
cat ${f} | grep '^@.*/2$' -A 3 --no-group-separator > PreAligned_Fastq/${f}_R2.fastq done
Good comment on the type of grep command used. Lesson should be updated.