Fasta parsing experiment
Hi @fubark ,
Thanks again for your awesome language. I played around with cyber a bit today for fasta parsing to see how it might fare against some other languages (inspiration here). My results are here if you are interested in taking a look. Right now python is ahead by ~ 2 orders of magnitude. I know cyber is designed for embedded systems but I thought i might get lucky with some fast I/O as well :).
This is a really promising language thats been fun to use; thank you. zach cp
time python3 readfq.py < GCA_013297495.1_ASM1329749v1_genomic.fna
real 0m1.065s
time ./cyber readfq.cy < GCA_013297495.1_ASM1329749v1_genomic.fna
real 2m24.335s
time ./cyber readfq2.cy < GCA_013297495.1_ASM1329749v1_genomic.fna
real 2m30.641s
Thanks for providing readfq2. It helped me narrow down the perf bottleneck quickly. readLine was meant for getting the user input from the command line and not bulk reads from stdin. For that reason, I deprecated readLine in favor of getInput. As for bulk reads on std.in you can do the following now in readfq2:
import os 'os'
--- minimal parse. don't use object or fastq
--- '@+>' is 64 / 43 / 62
func is_fastx(chr) bool:
if chr == 64:
return true
if chr == 62:
return true
return false
n = 0
slen = 0
qlen = 0
for os.stdin.streamLines() as line:
if is_fastx(line.charAt(0)):
n += 1
else:
slen += line.len()
print 'There are {slen} bases from {n} records in this file.'
On my linux machine, this is now twice as fast as the python3 version (still much room for improvement but now it's a more fair comparison in regards to reading lines from stdin). Although the python script seems to be doing more in the script... I'm going to see what missing functions there are and also flesh out more of the new File api.
Boom shakalaka! Amazing work.
Note: if cyber can compete favorably on these benchmarks I think you might unlock a bioinformatics market segment.....
# Same for me on MacOS!
time python3 readfq.py < GCA_013297495.1_ASM1329749v1_genomic.fna
There are 341540 records and 161512289 bases
real 0m0.898s
user 0m0.794s
sys 0m0.072s
time ./cyber readfq3.cy < GCA_013297495.1_ASM1329749v1_genomic.fna
There are 163709211 bases from 341540 records in this file.
real 0m0.393s
user 0m0.323s
sys 0m0.062s
I just made the same script even faster using simd to find the new line character. Also you can now provide a read buffer size to streamLines(). It defaults to 4096 bytes, but I've found that 4MB works well for larger files. Between this and simd (mostly simd), I'm seeing almost another 2x in performance gains.
Also worth mentioning the same simd technique is now made available for string.indexChar()