nextflow icon indicating copy to clipboard operation
nextflow copied to clipboard

Optional order for publishDir.

Open srynobio opened this issue 5 years ago • 12 comments

New feature

publishDir is noted as operating in an 'asynchronous manner', which often is the preferred method of transfer. However this can cause issues when transferring a file type and its respective index.

Example: A process creates a VCF file and a tabix indexed file. The process uses the publishDir to 'copy' both files to a final directory. However, because the .tbi file copy much quicker (smaller), it receives a older timestamp then the VCF file once it transfers. Many tools that require the .tbi file to be used with the VCF will error out because index files are not allow to have an older stamp then the original VCF.

I have also noticed this issue with BAM & BAM indexes.

Suggest implementation

Possibly an additional 'parameters' that allows order to be specified when needed. i.e. synchronous manner.

publishDir "my/path/", mode: 'copy', pattern: "*.gz, order: 1
publishDir "my/path/", mode: 'copy', pattern: "*.tbi", order: 2

srynobio avatar Feb 08 '20 19:02 srynobio

This is out the scope of publishDir directory which is designed to store workflow data result. Ifyou need to synchronise the data output the execution of other tasks/tools a nextflow process has to be used instead.

pditommaso avatar Feb 09 '20 14:02 pditommaso

Okay, given that my example shows the files running through publishDir were store workflow data result[s] they appeared to me to fall within this scope, but passing them to another process was second on my list of possible actions.

Thanks!

srynobio avatar Feb 10 '20 19:02 srynobio

@pditommaso If publishDir would copy the last modification time from the original files, this would not be a problem at all. Like cp is doing when running it like this: cp --preserve=timestamps orgina_file new_file

ghuls avatar Jun 17 '21 13:06 ghuls

@srynobio @pditommaso @lindenb Several bioinformatics tools output both the vcf.gz and vcf.gz.tbi files. As described above, the present publishDir copy mode often results with files that have the vcf.gz.tbi that are dated before the vcf.gz file which is problematic for downsteam processes The copy is not preserving the file times. Simply using the cp -p as mentioned by @ghuls will solve the problem and maintain the 'asynchronous manner'. This issue has also been mentioned in #3002 for bams. Often this issue is not a problem when running pipelines on small test files but becomes an issue when testing production runs on wgs crams or large vcf.gz files with larger number of samples.

The solution suggested above to index the vcf.gz file with a subsequent process would result in a wasteful reindexing in this case. Also tabix indexing is very I/O intensive. Having 1000s of processes that index vcf.gz files at the same time will have a major negative impact on I/0 for the whole cluster. The user running the pipeline will be scolded by the cluster system adminstrator who will then limit the number of their running jobs.

The publishDir grooving code uses FileHelper.copyPath() to copy. The call just needs to set the CopyOption to COPY_ATTRIBUTES. static Path copyPath(Path source, Path target, CopyOption... options) to resolve the timestamp issues with cram, bam and vcf.gz files.

Could this issue be reopened?

jjfarrell avatar Feb 21 '23 12:02 jjfarrell

Reopened this issue for a discussion and response to @jjfarrell question.

srynobio avatar Feb 21 '23 17:02 srynobio

Having the same issue as the "published" index file being older than the vcf file. Anyone found a solution yet? Thanks!

ssllff avatar Apr 30 '23 20:04 ssllff

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Oct 15 '23 14:10 stale[bot]

@pditommaso Multiple posters are waiting for a resolution to this issue.   Why not simply preserve the time stamp of the working directory files when copying published  files? This avoids breaking downstream multiple genomics software tools.  The timestamp is preserved when a link is used but not with a copy. 

The publishDir code uses FileHelper.copyPath() for the copy. The call just needs to set the CopyOption to COPY_ATTRIBUTES.

jjfarrell avatar Oct 15 '23 15:10 jjfarrell

Hi @jjfarrell , feel free to submit a PR if you believe you have found an appropriate fix

bentsherman avatar Nov 08 '23 15:11 bentsherman

Hi @bensherman @pditommaso. Is there any time line for committing this pull-request? It has been over a couple months since the PR was submitted. It would be great to run our Nextflow WGS pipelines on a newly released batch of 20k crams with this issue resolved.

jjfarrell avatar Jan 15 '24 15:01 jjfarrell

Hi @jjfarrell,

It looks like the PR is waiting for you to respond to a comment. When this is addressed I'll try to push for it to be merged quickly as it is so small.

Phil

ewels avatar Jan 16 '24 08:01 ewels

Hi @bensherman

I think you meant @bentsherman.

bensherman avatar Jan 16 '24 17:01 bensherman