fasten
fasten copied to clipboard
numcpus does not work
Need to add functionality for numcpus
@lskatz Would you mind sharing the method you're planing to use for multi-threading here?
(Also, cool repo :-) I really like the unixy pipe style of the one liners, and I've been thinking about making/adopting something similar to fasten but for amino-acid sequences)
I had ideas of making threads but never got around to it. I am open to getting help on it though! If/when we submit this to JOSS, I will be open to giving coauthorship to contributions like this!
And re: proteins, I think there shouldn't be any problem using it with protein sequences. Let me know!
Oh and the extent to which I was thinking of multithreading. I didn't think any particular rust implementation would matter. I was just thinking of picking some smart number of number of reads per chunk (100k?) and letting each thread run its thing. The "thing" would depend on what the executable actually does.
I am open to getting help on it though! If/when we submit this to JOSS, I will be open to giving coauthorship to contributions like this!
Super ! I'm a total novice in rust but I'd be happy to help as much as I can.
re:re: proteins, many of the core functionality should work 'as is', I started forking around fasten yesterday and I think the main TODOs are adding amino acid alphabet validation, disabling paired end mode for the protein mod, and avoiding trying 'reverse complementing' them. I could try to get to these if you wouldn't mind.
re: threading, that sounds good and straightforward, I think rayon would fit the bill.As for the number of reads per chunk - interesting, If the number of records/reads to be processed is known beforehand (not very fitting to Unix piping but still), what do you think about splitting the chunks more or less equally between (NUMCPUs - 1)? I'm assuming a bit here like no overhead for the picking of the actual chunks, and as I noted, I'm really new to rust so sorry in advance if I misunderstood the actual inner workings of fasten.
Unrealted: I think you forgot the 'L' in Chandler 😄 but at least it's a consistent typo (both in the pics dir and fasten_kmer help arg).
Thank you for catching those typos! I want to remove all the Friends references but keep finding more :)
Any PRs you have, I would appreciate and review! Your plan for multithreading is as good as any and I would love to see it :)
I think that reading a file twice however (once to see how many reads there are and then again to process) might be too much overhead and maybe even impossible with pipes, and so I would prefer chunks but I have an open mind if you can benchmark it.
Thanks for being so open! it's really refreshing :-)
Yeah, knowing how many reads need to be processed only to set chunk size, doesn't really justify reading the file twice. I was thinking maybe checking the file size and if the input format is indicative enough, then maybe estimating the number of records by that (but that won't work when the input is piped). I think your initial suggestion of 100k reads per chunk would work great - I guess it's better to have more chunks than threads, and just queue the chunks in memory until a thread has finished working on a chunk and is ready process the next one. So, assuming must people (like me) would usually process a total of >1Mil reads using 6-8 threads, the chunks to threads ratio will be fine.
It might be a while (month+) but I'll let you know if I get to actually making changes (I also want to read some more on MT in rust first).
Thanks again and have happy new year!
I added a method to fasten_regex
which seems to speed it up. I also added multithreading to fasten_trim
which actually slows it down and so I added a warning to that script. Both additions are in the branch concurrency
. For both, I did not listen to my own advice to @UriNeri :see_no_evil:
@UriNeri were you able to try adding multithreading on your end at all?
Local test on random but large fastq file:
for i in 1 2 4 8; do time zcat longtest2.fastq.gz | ./target/release/fasten_regex --numcpus $i --regex TTTTG > /dev/null; done;
real 0m11.189s
user 0m11.703s
sys 0m0.491s
real 0m7.382s
user 0m14.317s
sys 0m1.326s
real 0m6.365s
user 0m20.285s
sys 0m5.059s
real 0m5.865s
user 0m30.755s
sys 0m10.695s
Hi @lskatz !
No sorry, I hadn't came up with anything usable =\
I started looking into two things; reading up on rayon, and the other was focused on the fasten_kmer
code, converting some of variables to chash maps. I got to that after my initial, naive, attempt had workers in conflict, so that even if time decreased the output was bad (just plain wrong). I sort of left it after that attempt, but I did came across some potentially helpful crates, i.e. mapreduce via pipelines-rs.
I'm still interested in seeing how you tackle this! the time X threads scaling for fasten_regex
seems promising!