amazon-genomics-cli icon indicating copy to clipboard operation
amazon-genomics-cli copied to clipboard

--inputsFile is not being passed to miniwdl run

Open nh13 opened this issue 2 years ago • 13 comments

The parameters specified in the JSON given to --inputsFile are not being propagated to miniwdl run

I am using AGC release1.5 with WDL and miniWDL. I have a directory of WDL files, so in my sourceURL in my agc-project.yaml points to the top-level directory. In that directory, I have a MANIFEST.json with the top level workflow:

{
	"mainWorkflowURL": "./workflows/top_level_workflow.wdl"
}

I have a local JSON file that contains all the inputs needed by the WDL workflow, both parameters and S3 URIs.

I then call:

agc workflow run \
    <workflow-name> \
    --context <context-name> \
    --inputsFile /local/path/to/inputs.json;

My Batch job fails.

  | 2022-06-01T15:00:15.088-07:00 | {
  | 2022-06-01T15:00:15.088-07:00 | "error": "InputError",
  | 2022-06-01T15:00:15.088-07:00 | "message": "missing required inputs for <workflow-name>: fastq_1, fastq_2, threads, ...
  | 2022-06-01T15:00:15.088-07:00 | }

Executing miniwdl locally works fine when I change the S3 URIs to local file paths in the inputs.json.

If I add the below to the MANIFEST.json, it runs fine.

    "inputFileURLs": [
        "inputs.json"
    ]

Looking at the docs, inputFileURLs should be optional while --inputsFile says it can be used to provide inputs to the workflow. What am I missing?

nh13 avatar Jun 01 '22 22:06 nh13

I've observed this too

mlin avatar Jun 02 '22 05:06 mlin

Investigating

wleepang avatar Jun 02 '22 17:06 wleepang

Re-reading the above, I see an expectation that --inputsFile as a command line option should blend with or take precedence over contents of MANIFEST.json. I think that should be the case as well and need to verify if the code actually does that.

wleepang avatar Jun 03 '22 05:06 wleepang

If inputsFile is specified on the command line and manifest, the behavior is undefined. In this case, it’s not specified in the manifest, so I’d expect it to use the one given in the command line. Hopefully that helps!

nh13 avatar Jun 03 '22 14:06 nh13

Confirmed that this results in EXECUTOR_ERROR using the read workflow in the demo-wdl-project submitting with the command:

agc workflow run -c miniContext read --inputsFile ./workflows/read/read.inputs.json

and the following cases for MANIFEST.json

case 1

{
  "mainWorkflowURL": "main.wdl"
}

case 2:

{
  "mainWorkflowURL": "main.wdl",
  "inputFileURLs": []
}

wleepang avatar Jun 04 '22 03:06 wleepang

Working on a fix for this. There's an open question about what should happen when inputsFile is declared in the MANIFEST and on command line. I see three logical options

  1. We send them all to the engine and the engine processes them according to it's rules
  2. We merge them. Any duplicates are resolved by giving precedence to the inputs file declared on the command line
  3. Any inputs file declared by the MANIFEST is totally ignored if the command line argument is used

Out of these I think:

  • 1 or 3 are the easiest to implement.
  • 1 works well with engines like Cromwell that can handle this but would not be a great experience for engines that cannot
  • 2 makes it easy to over-ride only a few inputs and have your defaults present in the file referenced by the manifest.
  • 3 is very simple but puts the burden on the user to do any merging or copy/ paste. It does make it very explicit what will happen (e.g. no surprise configuration in that other inputs file you forgot about, no failed overrides due to typos)

Which solution do people prefer?

markjschreiber avatar Jun 07 '22 20:06 markjschreiber

I prefer 2 as I have static genomic reference data which could be in the manifest, and per-run sample genomic data and metadata that need to be custom for each run. Having the former in the manifest, and the latter in a json on the command line makes that split easy to understand.

nh13 avatar Jun 07 '22 23:06 nh13

I have added additional logging statements to the agc workflow run logic so that --verbose output reveals what is going on. It seems that having an inputs.json defined in both the MANIFEST and on command line leads to a double handling with both sets of inputs going to S3. Inspecting the output shows that miniwdl uses the inputs defined by the inputs.json defined by the MANIFEST.

The resolution is then to merge all input sources with command line taking precedence and work from that final source to upload the consolidated artifacts. We probably have to update the uploaded MANIFEST (if any) to point to this new resolved copy of the consolidated inputs.json

agc workflow run -c miniContext words-with-vowels --inputsFile /tmp/words-with-vowels.inputs.json --verbose

2022-06-08T13:22:51-04:00 ↓  Checking AGC version...
2022-06-08T13:22:51-04:00 𝒊  Running workflow. Workflow name: 'words-with-vowels', InputsFile: '/tmp/words-with-vowels.inputs.json', OptionFile: '', Context: 'miniContext'
2022-06-08T13:22:51-04:00 ↓  reading project specification
2022-06-08T13:22:51-04:00 ↓  reading specification of 'words-with-vowels' workflow
2022-06-08T13:22:51-04:00 ↓  workflow type: 'wdl' version: '1.0', workflow source url: 'workflows/words'
2022-06-08T13:22:51-04:00 ↓  current user id: 'mrschre4GqyMA'
2022-06-08T13:22:51-04:00 ↓  obtaining spec for context 'miniContext'
2022-06-08T13:22:51-04:00 ↓  using engine 'miniwdl' from context 'miniContext'
2022-06-08T13:22:51-04:00 ↓  checking deployment status of 'miniContext' context
2022-06-08T13:22:54-04:00 ↓  using output bucket 'agc-1234567890123-us-east-1'
2022-06-08T13:22:54-04:00 ↓  parsed workflow location as 'workflows/words'
2022-06-08T13:22:54-04:00 ↓  workflow location is local? 'true', upload is required? 'true'
2022-06-08T13:22:54-04:00 ↓  workflow upload base object key is 'project/Demo/userid/mrschre4GqyMA/context/miniContext/workflow/words-with-vowels'
2022-06-08T13:22:54-04:00 ↓  workflow path is '/Users/mrschre/devel/amazon-genomics-cli/examples/demo-wdl-project/workflows/words
2022-06-08T13:22:54-04:00 ↓  workflow '' path is a directory, packing contents ...
2022-06-08T13:22:54-04:00 ↓  recursively copying content of '/Users/mrschre/devel/amazon-genomics-cli/examples/demo-wdl-project/workflows/words' to '/var/folders/9j/zdk45gtj5znf3670v17tnj2h0000gt/T/workflow_2114126158'
2022-06-08T13:22:54-04:00 ↓  updating file references and loading packed content to 'agc-1234567890123-us-east-1/project/Demo/userid/mrschre4GqyMA/context/miniContext/workflow/words-with-vowels'
2022-06-08T13:22:54-04:00 ↓  reading manifest in '/var/folders/9j/zdk45gtj5znf3670v17tnj2h0000gt/T/workflow_2114126158
2022-06-08T13:22:54-04:00 ↓  manifest declares '1' input files
2022-06-08T13:22:54-04:00 ↓  reading content of input file at '/var/folders/9j/zdk45gtj5znf3670v17tnj2h0000gt/T/workflow_2114126158/words-with-vowels.inputs.json'
2022-06-08T13:22:54-04:00 ↓  content of '/var/folders/9j/zdk45gtj5znf3670v17tnj2h0000gt/T/workflow_2114126158/words-with-vowels.inputs.json' is 
{
  "Words_With_Vowels.word_file": "./mieliestronk-words.txt"
}
2022-06-08T13:22:54-04:00 ↓  inspecting key value pair, 'Words_With_Vowels.word_file: ./mieliestronk-words.txt'
2022-06-08T13:22:54-04:00 ↓  input value './mieliestronk-words.txt' can be resolved to a file at '/Users/mrschre/devel/amazon-genomics-cli/examples/demo-wdl-project/workflows/words/./mieliestronk-words.txt'
2022-06-08T13:22:54-04:00 ↓  loading 'mieliestronk-words.txt' to 'project/Demo/userid/mrschre4GqyMA/context/miniContext/workflow/words-with-vowels/mieliestronk-words.txt'
2022-06-08T13:22:56-04:00 ↓  updated reference '0' to 's3://agc-581254785006-us-east-1/project/Demo/userid/mrschre4GqyMA/context/miniContext/workflow/words-with-vowels/mieliestronk-words.txt'
2022-06-08T13:22:56-04:00 ↓  key value pair updated to 'Words_With_Vowels.word_file: s3://agc-581254785006-us-east-1/project/Demo/userid/mrschre4GqyMA/context/miniContext/workflow/words-with-vowels/mieliestronk-words.txt'
2022-06-08T13:22:56-04:00 ↓  updloading '/var/folders/9j/zdk45gtj5znf3670v17tnj2h0000gt/T/workflow_2489931138' to 's3://agc-581254785006-us-east-1/project/Demo/userid/mrschre4GqyMA/context/miniContext/workflow/words-with-vowels/workflow.zip
2022-06-08T13:22:56-04:00 ↓  cleaning up '/var/folders/9j/zdk45gtj5znf3670v17tnj2h0000gt/T/workflow_2489931138'
2022-06-08T13:22:56-04:00 ↓  workflow artifacts at 's3://agc-581254785006-us-east-1/project/Demo/userid/mrschre4GqyMA/context/miniContext/workflow/words-with-vowels/workflow.zip' will be used to run the workflow
2022-06-08T13:22:56-04:00 ↓  Input file override URL: /tmp/words-with-vowels.inputs.json
2022-06-08T13:22:56-04:00 ↓  content is:
'{
  "Words_With_Vowels.word_file": "./words.txt"
}
'
2022-06-08T13:22:56-04:00 ↓  moving local inputs from '/tmp' to s3://agc-581254785006-us-east-1/project/Demo/userid/mrschre4GqyMA/data and replacing paths with S3 paths
2022-06-08T13:22:56-04:00 ↓  inspecting key value pair, 'Words_With_Vowels.word_file: ./words.txt'
2022-06-08T13:22:56-04:00 ↓  input value './words.txt' can be resolved to a file at '/tmp/./words.txt'
2022-06-08T13:22:56-04:00 ↓  loading 'words.txt' to 'project/Demo/userid/mrschre4GqyMA/data/words.txt'
2022-06-08T13:22:56-04:00 ↓  updated reference '0' to 's3://agc-581254785006-us-east-1/project/Demo/userid/mrschre4GqyMA/data/words.txt'
2022-06-08T13:22:56-04:00 ↓  key value pair updated to 'Words_With_Vowels.word_file: s3://agc-581254785006-us-east-1/project/Demo/userid/mrschre4GqyMA/data/words.txt'
2022-06-08T13:22:56-04:00 ↓  arguments are: '{"Words_With_Vowels.word_file":"s3://agc-581254785006-us-east-1/project/Demo/userid/mrschre4GqyMA/data/words.txt"}'
2022-06-08T13:22:56-04:00 ↓  using context infrastructure from cloudformation stack 'Agc-Context-Demo-mrschre4GqyMA-miniContext'
2022-06-08T13:22:58-04:00 ↓  workflow will be submitted to wes endpoint at 'https://w2xbju8tb0.execute-api.us-east-1.amazonaws.com/prod/'
2022-06-08T13:22:58-04:00 ↓  constructing API client for WES endpoint at 'https://w2xbju8tb0.execute-api.us-east-1.amazonaws.com/prod/'
2022-06-08T13:22:58-04:00 ↓  EstablishWesConnection(https://w2xbju8tb0.execute-api.us-east-1.amazonaws.com/prod/ga4gh/wes/v1)
2022-06-08T13:22:58-04:00 ↓  Ping WES endpoint until we get an answer:
2022-06-08T13:22:58-04:00 ↓  Attempt 1 of 3
2022-06-08T13:23:00-04:00 ↓  Connected to WES endpoint
2022-06-08T13:23:00-04:00 ↓  saved attachment for argument '{"Words_With_Vowels.word_file":"s3://agc-123456789012-us-east-1/project/Demo/userid/mrschre4GqyMA/data/words.txt"}' to '/var/folders/9j/zdk45gtj5znf3670v17tnj2h0000gt/T/words-with-vowels.inputs.json_2587074133'
2022-06-08T13:23:00-04:00 ↓  workflow parameter of 'workflowInputs' is 'words-with-vowels.inputs.json_2587074133'
2022-06-08T13:23:00-04:00 ↓  running workflow at 's3://agc-123456789012-us-east-1/project/Demo/userid/mrschre4GqyMA/context/miniContext/workflow/words-with-vowels/workflow.zip' with language 'wdl' and version '1.0' using attachments '[/var/folders/9j/zdk45gtj5znf3670v17tnj2h0000gt/T/words-with-vowels.inputs.json_2587074133]', workflow parameters 'map[workflowInputs:words-with-vowels.inputs.json_2587074133]' and engine parameters 'map[]'
2022-06-08T13:23:01-04:00 ↓  recording workflow run metadata for workflow run id 'fb6c4633-4b29-4eb8-896f-f60a998cc8c9' to DynamodDB
2022-06-08T13:23:02-04:00 ↓  cleaning up '/var/folders/9j/zdk45gtj5znf3670v17tnj2h0000gt/T/words-with-vowels.inputs.json_2587074133'

markjschreiber avatar Jun 08 '22 17:06 markjschreiber

BTW, code with additional verbose logging is available at https://github.com/aws/amazon-genomics-cli/tree/markjschreiber/issue-474 branch

markjschreiber avatar Jun 08 '22 17:06 markjschreiber

I think this also blocks running multiple workflows concurrently in the same context, since the MANIFEST.json with a per-workflow inputFileURLs specified in the MANIFEST.json is packaged with the workflow (workflow.zip). The workflow.zip is 1:1 with the context, and not with the agc workflow run. This is certainly unexpected, as I'd expect that agc workflow run would not overwrite any previous definition of the workflow. So I think this is a bigger issue than just convenience.

Edit: see #497

nh13 avatar Jul 20 '22 16:07 nh13

Interesting observation @nh13, indeed there is only one workflow.zip per context. Do you know how this affects the concurrent execution of workflows with Cromwell? I have launched four workflows one after the other with a delay of between 1 and 10 minutes and they all seem to have picked the right input. They all use the same MANIFEST and I then pass the sample-specific input file with the -inputsFile parameter. From what I have seen, workflow.zip in s3 is rewritten after each agc workflow run but perhaps this file is only used transiently? If so, is there a recommended time we should wait between agc workflow run commands?

AsierGonzalez avatar Aug 02 '22 12:08 AsierGonzalez

I think this also blocks running multiple workflows concurrently in the same context, since the MANIFEST.json with a per-workflow inputFileURLs specified in the MANIFEST.json is packaged with the workflow (workflow.zip). The workflow.zip is 1:1 with the context, and not with the agc workflow run. This is certainly unexpected, as I'd expect that agc workflow run would not overwrite any previous definition of the workflow. So I think this is a bigger issue than just convenience.

Please escalate this to its own issue

wleepang avatar Aug 03 '22 01:08 wleepang

I am also seeing this behavior with the nextflow engine as well.

spitfiredd avatar Feb 02 '23 21:02 spitfiredd