Macs peak calling
Description
Add macs2/macs3 peak calling
Utilize a multi-threaded approach separated by cluster within prep_macs_inputs(). Allow users to choose a specific step to run (ie "prep-inputs", "run-macs", "read-outputs") to allow for executing macs calls in external slurm cluster.
Also change behaviour in InsertionIterator to not keep insertions in memory after moving to next fragment.
Tests
- Added test cases for prep_macs() parameterized by cluster existence, threading, and insertion_mode
Here's some initial comments I came up with from a read-through. Overall looks like a pretty good structure and approach. I'm focusing on some higher-level comments here.
There are also several small things that I think could be tweaked around choices of naming, docs, default arguments, organization etc. Rather than dictate solutions to those while you're still drafting, I'll let you finish up and do your own polishing pass before we discuss those.
C++ comments:
- I'm pretty sure that
writeInsertionBedByPseudobulkshould write out duplicate coordinates rather than explicitly skipping printing repeats - Once we get a chance to do a performance profile, we should check if
gzprintfis taking up a bunch of time (and comparing with/without a.gzfile ending to isolate if the compression or theprintfpart is slow)- We might also want to take a quick look to see if
macs3can support any other kinds of compressed input that would be faster to write
- We might also want to take a quick look to see if
- We will definitely need to debug the memory issue you identified in
InsertionIteratoras well
R comments
- With the option to split up execution using
step, I'm a bit concerned about errors that might happen if users edit the input folder between steps, or try calling a later step with arguments that are mismatched/incompatible with their earlier step- This might mean making
cell_namesnon-nullable, and maybe evenfragments, though that might not be worth it given the usability trade-off.
- This might mean making
prep-inputsshould probably write the macs command to a.shfile (one-per-cluster). Also returning the commands is as a vector is fine, but later steps could then read from the.shfiles to simplify- Also we usually have a few more arguments present, including e.g.
--shift. You might want to check the ArchR source a bit more for that as I think arguments are listed in at least two separate places
- Also we usually have a few more arguments present, including e.g.
- For
mclapplywe probably also want to always use the argumentmc.preschedule=FALSE
There are a few more tests to be added that directly compare writing insertions from fragments in ArchR vs BPCells. I might just copy the output files to eliminate the dependence of building an entire ArchR project.
Otherwise, I think I've addressed most of your comments, and the feature is in a relatively polished state.
Looking pretty good! Most remaining comments are style-related, though I did end up with a large number of those in call_macs_peaks(). Trying out a bulleted list here in rough order of the code rather than using Github's inline comment feature since it's a bit easier for me to prep.
Good job setting up the priority queue InsertionIterator -- I'm fine leaving it as a stopgap for this PR but I'd really like to get the radix sort algorithm fixed up and running soon after we merge this.
InsertionIterator
- Very cool that you were able to compare performance between the priority queue and the sorting version!
- Minor style note but
*(starts + i)is the same asstarts[i]
writeInsertionBedByPseudobulk()
- Given that this writes out just one pseudobulk, I'm not sure why both the
cellsandgroup_of_interestarguments are needed. Arguably, you could even just pass infragmentsto already be subset to just the cells of interest from the R side withselect_cells()- At least filtering prior to
InsertionIteratorwill save some sorting work
- At least filtering prior to
- Unclear if
keep_dupsargument is needed (more discussion below). This also might eliminate the need for thecleanup at end of chromosomesection - Error messages shouldn't be prefixed with
writeBedgraph:anymore - I'd suggest just calling this function just
writeInsertionBedfor simplicity. As a cleanup task I can editwriteInsertionBedgraphto just write one pseudobulk at a time anyhow as I think that will make the most sense - Either this file name will need an update or this function should go into its own file because it doesn't handle
Bedgraphfiles. (I'd be happy to rename to maybeBedWritersince bedgraph is kinda just another type of bed file)
prep_macs_inputs()
- Does this need to be a separate exported function? Seems like it could mostly slot in to
call_macs_peaksor at least be private
call_macs_peaks()
- Make docs and parameter names consistent with
call_peaks_tile(cell_groups and effective_genome_size arguments). Can use@inheritParamsto save on re-typing docs if the text will be identical.- You might also consider copying the default initialization of
cell_groups, though I'm not sure how that would work if we allowfragmentsto be NULL
- You might also consider copying the default initialization of
pathdocs. Put directory structure in@detailssection. Specify inputs are<path>/input/<cluster>.bed(.gz)and shell commands are<path>/input/<cluster>.shinsertion_modedocs don't quite read naturally. If you want consistency withwrite_insertion_bedgraphI'd update both to say maybe "for coverage calculation" rather than "for insertion counts calculation"- Do we need the
keep_dupsoption? I was just looking through the ArchR code and I don't think it would ever de-duplicate (and as mentioned previously I think it's likely undesirable to remove duplicate insertion sites in a peak calling context). Though if your testing indicates otherwise we can discuss stepargument should default toallmacs_versioncould probably bemacs_executableinstead, and allow the user to pass a full path if desired. (And I'd say default to macs3 at this point, or make the default NULL and try to auto-detect macs2 vs macs3 with testsystem()calls)- Either way, we probably want to check if macs can be run before prepping any inputs in the case of
step = all
- Either way, we probably want to check if macs can be run before prepping any inputs in the case of
additionalParamsshould be snake case like you have it in the docs. It also might be nice if the name made it clear that these are parameters getting passed to MACS- Still don't have a
--shiftargument being passed, which might make sense to add here
- Still don't have a
quiet: If we multi-threadrun-macs, then we might want to write the macs stdout/stderr to a log file inoutputs/and have this just print when we start writing each input file and when we start running each copy of macs (maybe also printing when each finishes?). Also, for consistency this should flip and be namedverboseuse_gzdocs would be helpful if they could give some absolute file sizes + time to write with /without compression for a small-ish example dataset (e.g. one of the public 10x pbmc datasets)threadsargument incall_macs_peaksshould probably also be used in therun-macssection- Return value should be a single data frame, with an extra column specifying the group that the cluster was called in. Try to place the group column early in the ordering (probably either first or right before name?). A list of data frames is a little awkward to deal with as output
- Will the
system2call work properly when there are spaces in cluster names? Might want to add some quoting tomacs_call_template. Also maybe get rid of any/character in a cluster names? - In docs,
bedfileshould be split up as two words for style
tests
- Looks reasonable to me. I like your tests that the insertion bed file has the right contents. If
prep_macs_inputsstops being a public/separate function, you could adjust to just callcall_macs_peaks(step="prep-inputs")in that test
I think I'm happy with the current state, but please let me know if you see any more glaring changes I should be making. It will be a little bit of a journey attempting to extract insertion beds from ArchR, while removing the hardwired pseudobulk replicates and tn5 bias steps. To make sure I can continue progressing, I suggest that I table creating a testcase from comparing outputs from ArchR. Getting ArchR to skip those steps will require a decent amount of software engineering time by itself. Would that be reasonable?
Ok it should be ready! I added in the test comparing Archr and BPCells results. The only change required is setting devtools::load_all() to target your bpcells/archr install.
Overall, I like how this is going and fundamentally it's quite close to being ready. A lot of the details I asked for from the last round are fixed. I like the ordering you chose for the call_macs_peaks parameters, and making write_insertion_bed have a matching interface to write_insertion_bedgraph. The performance testing and comparison against ArchR outputs I imagine were tricky and time-consuming, but good job on them.
There are several more detail-oriented issues remaining though, several from my last round of suggestions which were either forgotten or only partially addressed. I've also included a few new ones that came up as I was able to try out the function more.
Sorry for having taken so much of the week to get back to you on this, I hope it hasn't been too much of a blocker. I'd be happy to do a call early next week to discuss any questions/clarification about this round of comments.
Follow-ups still to be addressed from last comment round
writeInsertionBed
- After removing the
keep_dupsargument, there's no reason to have the variablelast_baseand hence the end-of-chromosome cleanup step.
call_macs_peaks()
pathdocs in details section.- For input files, I'd describe contents/purpose of the files. E.g just "shell commands" doesn't say what they are for. "Bed files" doesn't explain what the contents are.
- For outputs, for the files BPCells doesn't use, we can just link to the macs3 documentation on output files.
- Very useful to say we only read the
narrowPeaksfile, but you need an extra line break so it doesn't get swallowed into the last bullet when rendered as a webpage - In the docs for the
pathargument, you should add a (see Details for directory structure) or equivalent message
macs_versionwas not renamed in the function declaration, only in the docs. Also, usingmatch.argmeans that it's not possible for the user to pass a custom path for a macs executableuse_gzdocs would be helpful to give numbers for the time trade-off as well. Most straightforward would be to give both time + input folder size for with and without compression.threadsargument incall_macs_peaksshould probably also be used in therun-macssection- Return value "cluster" column would ideally not be the last column
- Will the
system2call work properly when there are spaces in cluster names? Might want to add some quoting tomacs_call_template. Also maybe get rid of any/character in a cluster names?
New comments:
Verbose logging setup:
-
log_messagevariable will not be defined in "run-macs" section if run standalone (and I also don't think it's necessary to include that information in the log file) -
In
write_insertion_bed(), the logging should be inside themclapplycall so it prints at the correct time. (Despite the theoretical race condition, I think this basically always works out fine even in very tight loops)-
The style of log you use in
run-macsis better also, rather than printing out cluster names on individual lines -
Example of the current (sub-optimal) output
[1] "2024-08-23 16:10:15.508477 Writing bed file for clusters:" [1] "consonant" [1] "vowel" [1] "2024-08-23 16:10:15.532154 Finished writing bed files"
-
-
I'd suggest
format(Sys.time())to avoid having the fractional seconds get printed -
I think using
message()rather thanprint()will be better to avoid the[1]prefixes getting printed
run-macs and error checking:
-
macs_executablenow has a slightly weird combination of both auto-detect and a default value.I'd say the best option is to default NULL, and in the case of NULL we can do your auto-detect procedure. If not null, we just check that
macscan be found else we error out- Also, the existing
stop_runandlog_messagevariables can probably be eliminated since I don't think there's a need / easy way to hang on tolog_messagefor use in later steps.
- Also, the existing
-
The existing check on
macs_executableshould check the output ofmacs3 --versionto make sure it contains the textmacs-- otherwise the check will pass for almost any executable even if it's not macs -
In
run-macs, it's probably better forsystem2to write stdout/stderr directly to the log file so that logs are written as the command is running rather than all at the end. It's OK if this prevents writing the command as it appears MACS already prints the input command at the top of the log file -
Would be good to check in
run-macsthat the command ran without error (had a 0 exit code)
Other:
- The test for
macs_e2e_worksnow introduces a system dependency on havingmacs2installed. Can we make it skip instead ifmacsis not installed? --call-summitsprobably can go in--additional-params, as it seems to just affect the peak calling method used by macs while leaving the output file format consistent- in
write_insertion_bed,cell_groupsshould probably come afterpathfor consistency withwrite_insertion_bedgraph, andcell_groupsshould maybe default to NULL for consistency.
Questions/discussion areas (can do via call/slack)
- Is there a better name possible for
use_gz? I'm just realizing the name doesn't really give any hints about what is getting compressed, so it would be nice to have a name that would be more specific to the macs bed insertion inputs. Let me know if you have a suggestion here - What do you think of the adjusting column names/values in the output? Reading the [UCSC narrowPeak format docs](https://genome.ucsc.edu/FAQ/FAQformat.html#format12), I think it's kinda weird that
pValueis given as -log10(p), and peak/pointSource is given relative tostart. On the one hand it's nice to maintain a direct correspondence with the macs output, but some parts of it are kind of non-intuitive