BPCells icon indicating copy to clipboard operation
BPCells copied to clipboard

Macs peak calling

Open immanuelazn opened this issue 1 year ago • 3 comments

Description

Add macs2/macs3 peak calling Utilize a multi-threaded approach separated by cluster within prep_macs_inputs(). Allow users to choose a specific step to run (ie "prep-inputs", "run-macs", "read-outputs") to allow for executing macs calls in external slurm cluster.
Also change behaviour in InsertionIterator to not keep insertions in memory after moving to next fragment.

Tests

  • Added test cases for prep_macs() parameterized by cluster existence, threading, and insertion_mode

immanuelazn avatar Aug 13 '24 07:08 immanuelazn

Here's some initial comments I came up with from a read-through. Overall looks like a pretty good structure and approach. I'm focusing on some higher-level comments here.

There are also several small things that I think could be tweaked around choices of naming, docs, default arguments, organization etc. Rather than dictate solutions to those while you're still drafting, I'll let you finish up and do your own polishing pass before we discuss those.

C++ comments:

  • I'm pretty sure that writeInsertionBedByPseudobulk should write out duplicate coordinates rather than explicitly skipping printing repeats
  • Once we get a chance to do a performance profile, we should check if gzprintf is taking up a bunch of time (and comparing with/without a .gz file ending to isolate if the compression or the printf part is slow)
    • We might also want to take a quick look to see if macs3 can support any other kinds of compressed input that would be faster to write
  • We will definitely need to debug the memory issue you identified in InsertionIterator as well

R comments

  • With the option to split up execution using step, I'm a bit concerned about errors that might happen if users edit the input folder between steps, or try calling a later step with arguments that are mismatched/incompatible with their earlier step
    • This might mean making cell_names non-nullable, and maybe even fragments, though that might not be worth it given the usability trade-off.
  • prep-inputs should probably write the macs command to a .sh file (one-per-cluster). Also returning the commands is as a vector is fine, but later steps could then read from the .sh files to simplify
    • Also we usually have a few more arguments present, including e.g. --shift. You might want to check the ArchR source a bit more for that as I think arguments are listed in at least two separate places
  • For mclapply we probably also want to always use the argument mc.preschedule=FALSE

bnprks avatar Aug 13 '24 23:08 bnprks

There are a few more tests to be added that directly compare writing insertions from fragments in ArchR vs BPCells. I might just copy the output files to eliminate the dependence of building an entire ArchR project.
Otherwise, I think I've addressed most of your comments, and the feature is in a relatively polished state.

immanuelazn avatar Aug 16 '24 04:08 immanuelazn

Looking pretty good! Most remaining comments are style-related, though I did end up with a large number of those in call_macs_peaks(). Trying out a bulleted list here in rough order of the code rather than using Github's inline comment feature since it's a bit easier for me to prep.

Good job setting up the priority queue InsertionIterator -- I'm fine leaving it as a stopgap for this PR but I'd really like to get the radix sort algorithm fixed up and running soon after we merge this.

InsertionIterator
  • Very cool that you were able to compare performance between the priority queue and the sorting version!
  • Minor style note but *(starts + i) is the same as starts[i]
writeInsertionBedByPseudobulk()
  • Given that this writes out just one pseudobulk, I'm not sure why both the cells and group_of_interest arguments are needed. Arguably, you could even just pass in fragments to already be subset to just the cells of interest from the R side with select_cells()
    • At least filtering prior to InsertionIterator will save some sorting work
  • Unclear if keep_dups argument is needed (more discussion below). This also might eliminate the need for the cleanup at end of chromosome section
  • Error messages shouldn't be prefixed with writeBedgraph: anymore
  • I'd suggest just calling this function just writeInsertionBed for simplicity. As a cleanup task I can edit writeInsertionBedgraph to just write one pseudobulk at a time anyhow as I think that will make the most sense
  • Either this file name will need an update or this function should go into its own file because it doesn't handle Bedgraph files. (I'd be happy to rename to maybe BedWriter since bedgraph is kinda just another type of bed file)
prep_macs_inputs()
  • Does this need to be a separate exported function? Seems like it could mostly slot in to call_macs_peaks or at least be private
call_macs_peaks()
  1. Make docs and parameter names consistent with call_peaks_tile (cell_groups and effective_genome_size arguments). Can use @inheritParams to save on re-typing docs if the text will be identical.
    • You might also consider copying the default initialization of cell_groups, though I'm not sure how that would work if we allow fragments to be NULL
  2. path docs. Put directory structure in @details section. Specify inputs are <path>/input/<cluster>.bed(.gz) and shell commands are <path>/input/<cluster>.sh
  3. insertion_mode docs don't quite read naturally. If you want consistency with write_insertion_bedgraph I'd update both to say maybe "for coverage calculation" rather than "for insertion counts calculation"
  4. Do we need the keep_dups option? I was just looking through the ArchR code and I don't think it would ever de-duplicate (and as mentioned previously I think it's likely undesirable to remove duplicate insertion sites in a peak calling context). Though if your testing indicates otherwise we can discuss
  5. step argument should default to all
  6. macs_version could probably be macs_executable instead, and allow the user to pass a full path if desired. (And I'd say default to macs3 at this point, or make the default NULL and try to auto-detect macs2 vs macs3 with test system() calls)
    • Either way, we probably want to check if macs can be run before prepping any inputs in the case of step = all
  7. additionalParams should be snake case like you have it in the docs. It also might be nice if the name made it clear that these are parameters getting passed to MACS
    • Still don't have a --shift argument being passed, which might make sense to add here
  8. quiet: If we multi-thread run-macs, then we might want to write the macs stdout/stderr to a log file in outputs/ and have this just print when we start writing each input file and when we start running each copy of macs (maybe also printing when each finishes?). Also, for consistency this should flip and be named verbose
  9. use_gz docs would be helpful if they could give some absolute file sizes + time to write with /without compression for a small-ish example dataset (e.g. one of the public 10x pbmc datasets)
  10. threads argument in call_macs_peaks should probably also be used in the run-macs section
  11. Return value should be a single data frame, with an extra column specifying the group that the cluster was called in. Try to place the group column early in the ordering (probably either first or right before name?). A list of data frames is a little awkward to deal with as output
  12. Will the system2 call work properly when there are spaces in cluster names? Might want to add some quoting to macs_call_template. Also maybe get rid of any / character in a cluster names?
  13. In docs, bedfile should be split up as two words for style
tests
  • Looks reasonable to me. I like your tests that the insertion bed file has the right contents. If prep_macs_inputs stops being a public/separate function, you could adjust to just call call_macs_peaks(step="prep-inputs") in that test

bnprks avatar Aug 16 '24 08:08 bnprks

I think I'm happy with the current state, but please let me know if you see any more glaring changes I should be making. It will be a little bit of a journey attempting to extract insertion beds from ArchR, while removing the hardwired pseudobulk replicates and tn5 bias steps. To make sure I can continue progressing, I suggest that I table creating a testcase from comparing outputs from ArchR. Getting ArchR to skip those steps will require a decent amount of software engineering time by itself. Would that be reasonable?

immanuelazn avatar Aug 20 '24 08:08 immanuelazn

Ok it should be ready! I added in the test comparing Archr and BPCells results. The only change required is setting devtools::load_all() to target your bpcells/archr install.

immanuelazn avatar Aug 22 '24 00:08 immanuelazn

Overall, I like how this is going and fundamentally it's quite close to being ready. A lot of the details I asked for from the last round are fixed. I like the ordering you chose for the call_macs_peaks parameters, and making write_insertion_bed have a matching interface to write_insertion_bedgraph. The performance testing and comparison against ArchR outputs I imagine were tricky and time-consuming, but good job on them.

There are several more detail-oriented issues remaining though, several from my last round of suggestions which were either forgotten or only partially addressed. I've also included a few new ones that came up as I was able to try out the function more.

Sorry for having taken so much of the week to get back to you on this, I hope it hasn't been too much of a blocker. I'd be happy to do a call early next week to discuss any questions/clarification about this round of comments.

Follow-ups still to be addressed from last comment round

writeInsertionBed

  • After removing the keep_dups argument, there's no reason to have the variable last_base and hence the end-of-chromosome cleanup step.

call_macs_peaks()

  • path docs in details section.
    • For input files, I'd describe contents/purpose of the files. E.g just "shell commands" doesn't say what they are for. "Bed files" doesn't explain what the contents are.
    • For outputs, for the files BPCells doesn't use, we can just link to the macs3 documentation on output files.
    • Very useful to say we only read the narrowPeaks file, but you need an extra line break so it doesn't get swallowed into the last bullet when rendered as a webpage
    • In the docs for the path argument, you should add a (see Details for directory structure) or equivalent message
  • macs_version was not renamed in the function declaration, only in the docs. Also, using match.arg means that it's not possible for the user to pass a custom path for a macs executable
  • use_gz docs would be helpful to give numbers for the time trade-off as well. Most straightforward would be to give both time + input folder size for with and without compression.
  • threads argument in call_macs_peaks should probably also be used in the run-macs section
  • Return value "cluster" column would ideally not be the last column
  • Will the system2 call work properly when there are spaces in cluster names? Might want to add some quoting to macs_call_template. Also maybe get rid of any / character in a cluster names?

New comments:

Verbose logging setup:
  • log_message variable will not be defined in "run-macs" section if run standalone (and I also don't think it's necessary to include that information in the log file)

  • In write_insertion_bed(), the logging should be inside the mclapply call so it prints at the correct time. (Despite the theoretical race condition, I think this basically always works out fine even in very tight loops)

    • The style of log you use in run-macs is better also, rather than printing out cluster names on individual lines

    • Example of the current (sub-optimal) output

      [1] "2024-08-23 16:10:15.508477 Writing bed file for clusters:"
      [1] "consonant"
      [1] "vowel"
      [1] "2024-08-23 16:10:15.532154 Finished writing bed files"
      
  • I'd suggest format(Sys.time()) to avoid having the fractional seconds get printed

  • I think using message() rather than print() will be better to avoid the [1] prefixes getting printed

run-macs and error checking:
  • macs_executable now has a slightly weird combination of both auto-detect and a default value.

    I'd say the best option is to default NULL, and in the case of NULL we can do your auto-detect procedure. If not null, we just check that macs can be found else we error out

    • Also, the existing stop_run and log_message variables can probably be eliminated since I don't think there's a need / easy way to hang on to log_message for use in later steps.
  • The existing check on macs_executable should check the output of macs3 --version to make sure it contains the text macs -- otherwise the check will pass for almost any executable even if it's not macs

  • In run-macs, it's probably better for system2 to write stdout/stderr directly to the log file so that logs are written as the command is running rather than all at the end. It's OK if this prevents writing the command as it appears MACS already prints the input command at the top of the log file

  • Would be good to check in run-macs that the command ran without error (had a 0 exit code)

Other:
  • The test for macs_e2e_works now introduces a system dependency on having macs2 installed. Can we make it skip instead if macs is not installed?
  • --call-summits probably can go in --additional-params, as it seems to just affect the peak calling method used by macs while leaving the output file format consistent
  • in write_insertion_bed, cell_groups should probably come after path for consistency with write_insertion_bedgraph, and cell_groups should maybe default to NULL for consistency.
Questions/discussion areas (can do via call/slack)
  • Is there a better name possible for use_gz? I'm just realizing the name doesn't really give any hints about what is getting compressed, so it would be nice to have a name that would be more specific to the macs bed insertion inputs. Let me know if you have a suggestion here
  • What do you think of the adjusting column names/values in the output? Reading the [UCSC narrowPeak format docs](https://genome.ucsc.edu/FAQ/FAQformat.html#format12), I think it's kinda weird that pValue is given as -log10(p), and peak/pointSource is given relative to start. On the one hand it's nice to maintain a direct correspondence with the macs output, but some parts of it are kind of non-intuitive

bnprks avatar Aug 24 '24 00:08 bnprks