workflows icon indicating copy to clipboard operation
workflows copied to clipboard

our WDL files are too long

Open a-frantz opened this issue 8 months ago • 0 comments

I'll single the 3 worst offenders out: util.wdl, samtools.wdl, and picard.wdl have so many tasks in them it becomes difficult to find the one you're looking for. At least that's my experience. They are each >700 lines long, which I think in just about every other language would be considered a behemoth for maintenance. Most languages have recommended file lengths (guessing an average consensus would be around 500 lines?), I propose we adopt something similar for WDL/this repo. Although I don't want to base it off line number. I feel that can encourage some sloppy coding when the file in question is around the limit. e.g. collapsing lines that should be separated in order to keep below the length limit.

The below indented section would be me thinking out loud realizing all my ideas have some fatal flaw. I'm stumped as to how to solve the problem. Feel free to skip the indented section, or read it to see the thoughts I've had.

A saner approach to me is a task limit. The exact number might require some trial and error. Rough gut feeling: 5 seems too strict, 10 seems too lenient. I'd say we start looking in the 6-9 range for our task number limit.

"But we currently organize our files by tool. Are we throwing that scheme out?" No! (At least that's not my initial proposal. I'd hear someone out if they have an alternative.) I say we start with organizing our files by tool, and then once they grow past 6-9 tasks, we make a split. What that split is will depend on some context. For ex. picard.wdl: could be split into picard-qc.wdl and picard-manipulation.wdl. picard-qc has all the Picard tasks which generate a report of some kind, and don't change the BAM file. picard-manipulation has all the Picard tasks which deal with modifying BAM files.

samtools.wdl could be split into... Alright I don't see a great way to split this file. Let's try util.wdl: could be split into util-python-scripts.wdl, and gosh this is proving more difficult than I expected.

Pivot: what about sorting our tasks? That would also accomplish the goal of making it easier to find a specific task in a long file.

Let's start with a file whose order I like: kraken2.wdl. It's ordered so well I know it off the top of my head: download_taxonomy, download_library, create_library_from_fastas. build_db, kraken. It flows in order that the tasks would be used. It's logical. This seems to be a special case I don't see a way to generalize. Unfortunate...

ngsderive.wdl is roughly in the order that the task/subcommand was created. Chronological ordering makes sense, although I'd say that knowledge is pretty esoteric. I doubt anyone besides me and Clay could rattle off the order that commands were added to ngsderive. So that works for me but is probably not an ordering we should stand by. Now that I think about it, I think we always (or nearly always) add new tasks to the bottom of the file. So really most of our files are ordered in this chronological way. Kinda works for helping us regular maintainers find what we're looking for, but not helpful for anyone else.

Alphabetize? Are there any other sorting choices? Would alphabetized tasks be an improvement? It wouldn't be of very much help to me. My brain is not great at alphabetizing. Not "can't do it" bad, but also it wouldn't be trivial for me to locate what I'm looking for in that sort. I imagine the situation would be roughly the same for me. Maybe a small improvement?

At first I thought this would be a silly suggestion but I don't hate it: shorter tasks at the top of the file, longer tasks the bottom of the file. Con: really really annoying to get that sorted and maintain that sort (assuming we don't automate it). Pro: it's the closest to "locate by vibe" that exists 😆

So I'm stumped. I still think our files are too long and they should be broken up. Or sorted, although I'd prefer a scheme for splitting files into smaller chunks. But I don't like any specific implementation I can come up with.

The best thing I thought of is alphabetizing tasks. I don't love it bc my brain is wired in a way finding things in an alphabetic sort isn't the easiest for me. But it's probably an overall improvement (especially while we lack any viable alternatives). So, do we want to start alphabetizing our WDL files with many tasks? All our task files? Is there a threshold under which it's not worth the effort? Would that look strange, some files sorted, some not?

Opening the floor for proposals!

a-frantz avatar Oct 11 '23 22:10 a-frantz