fasten numcpus does not work

Need to add functionality for numcpus

Apr 23 '20 14:04 lskatz

@lskatz Would you mind sharing the method you're planing to use for multi-threading here?
(Also, cool repo :-) I really like the unixy pipe style of the one liners, and I've been thinking about making/adopting something similar to fasten but for amino-acid sequences)

Dec 17 '20 18:12 UriNeri

I had ideas of making threads but never got around to it. I am open to getting help on it though! If/when we submit this to JOSS, I will be open to giving coauthorship to contributions like this!

Dec 17 '20 23:12 lskatz

And re: proteins, I think there shouldn't be any problem using it with protein sequences. Let me know!

Dec 17 '20 23:12 lskatz

Oh and the extent to which I was thinking of multithreading. I didn't think any particular rust implementation would matter. I was just thinking of picking some smart number of number of reads per chunk (100k?) and letting each thread run its thing. The "thing" would depend on what the executable actually does.

Dec 17 '20 23:12 lskatz

I am open to getting help on it though! If/when we submit this to JOSS, I will be open to giving coauthorship to contributions like this!

Super ! I'm a total novice in rust but I'd be happy to help as much as I can.

re:re: proteins, many of the core functionality should work 'as is', I started forking around fasten yesterday and I think the main TODOs are adding amino acid alphabet validation, disabling paired end mode for the protein mod, and avoiding trying 'reverse complementing' them. I could try to get to these if you wouldn't mind.

re: threading, that sounds good and straightforward, I think rayon would fit the bill.As for the number of reads per chunk - interesting, If the number of records/reads to be processed is known beforehand (not very fitting to Unix piping but still), what do you think about splitting the chunks more or less equally between (NUMCPUs - 1)? I'm assuming a bit here like no overhead for the picking of the actual chunks, and as I noted, I'm really new to rust so sorry in advance if I misunderstood the actual inner workings of fasten.

Unrealted: I think you forgot the 'L' in Chandler 😄 but at least it's a consistent typo (both in the pics dir and fasten_kmer help arg).

Dec 18 '20 11:12 UriNeri

Thank you for catching those typos! I want to remove all the Friends references but keep finding more :)

Any PRs you have, I would appreciate and review! Your plan for multithreading is as good as any and I would love to see it :)

I think that reading a file twice however (once to see how many reads there are and then again to process) might be too much overhead and maybe even impossible with pipes, and so I would prefer chunks but I have an open mind if you can benchmark it.

Jan 04 '21 14:01 lskatz

Thanks for being so open! it's really refreshing :-)

Yeah, knowing how many reads need to be processed only to set chunk size, doesn't really justify reading the file twice. I was thinking maybe checking the file size and if the input format is indicative enough, then maybe estimating the number of records by that (but that won't work when the input is piped). I think your initial suggestion of 100k reads per chunk would work great - I guess it's better to have more chunks than threads, and just queue the chunks in memory until a thread has finished working on a chunk and is ready process the next one. So, assuming must people (like me) would usually process a total of >1Mil reads using 6-8 threads, the chunks to threads ratio will be fine.

It might be a while (month+) but I'll let you know if I get to actually making changes (I also want to read some more on MT in rust first).

Thanks again and have happy new year!

Jan 04 '21 15:01 UriNeri

I added a method to fasten_regex which seems to speed it up. I also added multithreading to fasten_trim which actually slows it down and so I added a warning to that script. Both additions are in the branch concurrency. For both, I did not listen to my own advice to @UriNeri :see_no_evil:

@UriNeri were you able to try adding multithreading on your end at all?

Jun 22 '21 02:06 lskatz

Local test on random but large fastq file:

 for i in 1 2 4 8; do time zcat longtest2.fastq.gz | ./target/release/fasten_regex --numcpus $i --regex TTTTG > /dev/null; done;

  real	0m11.189s
  user	0m11.703s
  sys	0m0.491s
  
  real	0m7.382s
  user	0m14.317s
  sys	0m1.326s
  
  real	0m6.365s
  user	0m20.285s
  sys	0m5.059s
  
  real	0m5.865s
  user	0m30.755s
  sys	0m10.695s

Jun 22 '21 02:06 lskatz

Hi @lskatz !
No sorry, I hadn't came up with anything usable =\
I started looking into two things; reading up on rayon, and the other was focused on the fasten_kmer code, converting some of variables to chash maps. I got to that after my initial, naive, attempt had workers in conflict, so that even if time decreased the output was bad (just plain wrong). I sort of left it after that attempt, but I did came across some potentially helpful crates, i.e. mapreduce via pipelines-rs.
I'm still interested in seeing how you tackle this! the time X threads scaling for fasten_regex seems promising!

Jun 22 '21 08:06 UriNeri

fasten fasten copied to clipboard

numcpus does not work

fasten
fasten copied to clipboard