amazon-genomics-cli
amazon-genomics-cli copied to clipboard
--inputsFile is not being passed to miniwdl run
The parameters specified in the JSON given to --inputsFile
are not being propagated to miniwdl run
I am using AGC release1.5 with WDL and miniWDL. I have a directory of WDL files, so in my sourceURL
in my agc-project.yaml
points to the top-level directory. In that directory, I have a MANIFEST.json with the top level workflow:
{
"mainWorkflowURL": "./workflows/top_level_workflow.wdl"
}
I have a local JSON file that contains all the inputs needed by the WDL workflow, both parameters and S3 URIs.
I then call:
agc workflow run \
<workflow-name> \
--context <context-name> \
--inputsFile /local/path/to/inputs.json;
My Batch job fails.
| 2022-06-01T15:00:15.088-07:00 | {
| 2022-06-01T15:00:15.088-07:00 | "error": "InputError",
| 2022-06-01T15:00:15.088-07:00 | "message": "missing required inputs for <workflow-name>: fastq_1, fastq_2, threads, ...
| 2022-06-01T15:00:15.088-07:00 | }
Executing miniwdl locally works fine when I change the S3 URIs to local file paths in the inputs.json.
If I add the below to the MANIFEST.json, it runs fine.
"inputFileURLs": [
"inputs.json"
]
Looking at the docs, inputFileURLs
should be optional while --inputsFile
says it can be used to provide inputs to the workflow. What am I missing?
I've observed this too
Investigating
Re-reading the above, I see an expectation that --inputsFile
as a command line option should blend with or take precedence over contents of MANIFEST.json
. I think that should be the case as well and need to verify if the code actually does that.
If inputsFile is specified on the command line and manifest, the behavior is undefined. In this case, it’s not specified in the manifest, so I’d expect it to use the one given in the command line. Hopefully that helps!
Confirmed that this results in EXECUTOR_ERROR
using the read
workflow in the demo-wdl-project
submitting with the command:
agc workflow run -c miniContext read --inputsFile ./workflows/read/read.inputs.json
and the following cases for MANIFEST.json
case 1
{
"mainWorkflowURL": "main.wdl"
}
case 2:
{
"mainWorkflowURL": "main.wdl",
"inputFileURLs": []
}
Working on a fix for this. There's an open question about what should happen when inputsFile is declared in the MANIFEST and on command line. I see three logical options
- We send them all to the engine and the engine processes them according to it's rules
- We merge them. Any duplicates are resolved by giving precedence to the inputs file declared on the command line
- Any inputs file declared by the MANIFEST is totally ignored if the command line argument is used
Out of these I think:
- 1 or 3 are the easiest to implement.
- 1 works well with engines like Cromwell that can handle this but would not be a great experience for engines that cannot
- 2 makes it easy to over-ride only a few inputs and have your defaults present in the file referenced by the manifest.
- 3 is very simple but puts the burden on the user to do any merging or copy/ paste. It does make it very explicit what will happen (e.g. no surprise configuration in that other inputs file you forgot about, no failed overrides due to typos)
Which solution do people prefer?
I prefer 2 as I have static genomic reference data which could be in the manifest, and per-run sample genomic data and metadata that need to be custom for each run. Having the former in the manifest, and the latter in a json on the command line makes that split easy to understand.
I have added additional logging statements to the agc workflow run
logic so that --verbose
output reveals what is going on. It seems that having an inputs.json
defined in both the MANIFEST and on command line leads to a double handling with both sets of inputs going to S3. Inspecting the output shows that miniwdl uses the inputs defined by the inputs.json
defined by the MANIFEST.
The resolution is then to merge all input sources with command line taking precedence and work from that final source to upload the consolidated artifacts. We probably have to update the uploaded MANIFEST (if any) to point to this new resolved copy of the consolidated inputs.json
agc workflow run -c miniContext words-with-vowels --inputsFile /tmp/words-with-vowels.inputs.json --verbose
2022-06-08T13:22:51-04:00 ↓ Checking AGC version...
2022-06-08T13:22:51-04:00 𝒊 Running workflow. Workflow name: 'words-with-vowels', InputsFile: '/tmp/words-with-vowels.inputs.json', OptionFile: '', Context: 'miniContext'
2022-06-08T13:22:51-04:00 ↓ reading project specification
2022-06-08T13:22:51-04:00 ↓ reading specification of 'words-with-vowels' workflow
2022-06-08T13:22:51-04:00 ↓ workflow type: 'wdl' version: '1.0', workflow source url: 'workflows/words'
2022-06-08T13:22:51-04:00 ↓ current user id: 'mrschre4GqyMA'
2022-06-08T13:22:51-04:00 ↓ obtaining spec for context 'miniContext'
2022-06-08T13:22:51-04:00 ↓ using engine 'miniwdl' from context 'miniContext'
2022-06-08T13:22:51-04:00 ↓ checking deployment status of 'miniContext' context
2022-06-08T13:22:54-04:00 ↓ using output bucket 'agc-1234567890123-us-east-1'
2022-06-08T13:22:54-04:00 ↓ parsed workflow location as 'workflows/words'
2022-06-08T13:22:54-04:00 ↓ workflow location is local? 'true', upload is required? 'true'
2022-06-08T13:22:54-04:00 ↓ workflow upload base object key is 'project/Demo/userid/mrschre4GqyMA/context/miniContext/workflow/words-with-vowels'
2022-06-08T13:22:54-04:00 ↓ workflow path is '/Users/mrschre/devel/amazon-genomics-cli/examples/demo-wdl-project/workflows/words
2022-06-08T13:22:54-04:00 ↓ workflow '' path is a directory, packing contents ...
2022-06-08T13:22:54-04:00 ↓ recursively copying content of '/Users/mrschre/devel/amazon-genomics-cli/examples/demo-wdl-project/workflows/words' to '/var/folders/9j/zdk45gtj5znf3670v17tnj2h0000gt/T/workflow_2114126158'
2022-06-08T13:22:54-04:00 ↓ updating file references and loading packed content to 'agc-1234567890123-us-east-1/project/Demo/userid/mrschre4GqyMA/context/miniContext/workflow/words-with-vowels'
2022-06-08T13:22:54-04:00 ↓ reading manifest in '/var/folders/9j/zdk45gtj5znf3670v17tnj2h0000gt/T/workflow_2114126158
2022-06-08T13:22:54-04:00 ↓ manifest declares '1' input files
2022-06-08T13:22:54-04:00 ↓ reading content of input file at '/var/folders/9j/zdk45gtj5znf3670v17tnj2h0000gt/T/workflow_2114126158/words-with-vowels.inputs.json'
2022-06-08T13:22:54-04:00 ↓ content of '/var/folders/9j/zdk45gtj5znf3670v17tnj2h0000gt/T/workflow_2114126158/words-with-vowels.inputs.json' is
{
"Words_With_Vowels.word_file": "./mieliestronk-words.txt"
}
2022-06-08T13:22:54-04:00 ↓ inspecting key value pair, 'Words_With_Vowels.word_file: ./mieliestronk-words.txt'
2022-06-08T13:22:54-04:00 ↓ input value './mieliestronk-words.txt' can be resolved to a file at '/Users/mrschre/devel/amazon-genomics-cli/examples/demo-wdl-project/workflows/words/./mieliestronk-words.txt'
2022-06-08T13:22:54-04:00 ↓ loading 'mieliestronk-words.txt' to 'project/Demo/userid/mrschre4GqyMA/context/miniContext/workflow/words-with-vowels/mieliestronk-words.txt'
2022-06-08T13:22:56-04:00 ↓ updated reference '0' to 's3://agc-581254785006-us-east-1/project/Demo/userid/mrschre4GqyMA/context/miniContext/workflow/words-with-vowels/mieliestronk-words.txt'
2022-06-08T13:22:56-04:00 ↓ key value pair updated to 'Words_With_Vowels.word_file: s3://agc-581254785006-us-east-1/project/Demo/userid/mrschre4GqyMA/context/miniContext/workflow/words-with-vowels/mieliestronk-words.txt'
2022-06-08T13:22:56-04:00 ↓ updloading '/var/folders/9j/zdk45gtj5znf3670v17tnj2h0000gt/T/workflow_2489931138' to 's3://agc-581254785006-us-east-1/project/Demo/userid/mrschre4GqyMA/context/miniContext/workflow/words-with-vowels/workflow.zip
2022-06-08T13:22:56-04:00 ↓ cleaning up '/var/folders/9j/zdk45gtj5znf3670v17tnj2h0000gt/T/workflow_2489931138'
2022-06-08T13:22:56-04:00 ↓ workflow artifacts at 's3://agc-581254785006-us-east-1/project/Demo/userid/mrschre4GqyMA/context/miniContext/workflow/words-with-vowels/workflow.zip' will be used to run the workflow
2022-06-08T13:22:56-04:00 ↓ Input file override URL: /tmp/words-with-vowels.inputs.json
2022-06-08T13:22:56-04:00 ↓ content is:
'{
"Words_With_Vowels.word_file": "./words.txt"
}
'
2022-06-08T13:22:56-04:00 ↓ moving local inputs from '/tmp' to s3://agc-581254785006-us-east-1/project/Demo/userid/mrschre4GqyMA/data and replacing paths with S3 paths
2022-06-08T13:22:56-04:00 ↓ inspecting key value pair, 'Words_With_Vowels.word_file: ./words.txt'
2022-06-08T13:22:56-04:00 ↓ input value './words.txt' can be resolved to a file at '/tmp/./words.txt'
2022-06-08T13:22:56-04:00 ↓ loading 'words.txt' to 'project/Demo/userid/mrschre4GqyMA/data/words.txt'
2022-06-08T13:22:56-04:00 ↓ updated reference '0' to 's3://agc-581254785006-us-east-1/project/Demo/userid/mrschre4GqyMA/data/words.txt'
2022-06-08T13:22:56-04:00 ↓ key value pair updated to 'Words_With_Vowels.word_file: s3://agc-581254785006-us-east-1/project/Demo/userid/mrschre4GqyMA/data/words.txt'
2022-06-08T13:22:56-04:00 ↓ arguments are: '{"Words_With_Vowels.word_file":"s3://agc-581254785006-us-east-1/project/Demo/userid/mrschre4GqyMA/data/words.txt"}'
2022-06-08T13:22:56-04:00 ↓ using context infrastructure from cloudformation stack 'Agc-Context-Demo-mrschre4GqyMA-miniContext'
2022-06-08T13:22:58-04:00 ↓ workflow will be submitted to wes endpoint at 'https://w2xbju8tb0.execute-api.us-east-1.amazonaws.com/prod/'
2022-06-08T13:22:58-04:00 ↓ constructing API client for WES endpoint at 'https://w2xbju8tb0.execute-api.us-east-1.amazonaws.com/prod/'
2022-06-08T13:22:58-04:00 ↓ EstablishWesConnection(https://w2xbju8tb0.execute-api.us-east-1.amazonaws.com/prod/ga4gh/wes/v1)
2022-06-08T13:22:58-04:00 ↓ Ping WES endpoint until we get an answer:
2022-06-08T13:22:58-04:00 ↓ Attempt 1 of 3
2022-06-08T13:23:00-04:00 ↓ Connected to WES endpoint
2022-06-08T13:23:00-04:00 ↓ saved attachment for argument '{"Words_With_Vowels.word_file":"s3://agc-123456789012-us-east-1/project/Demo/userid/mrschre4GqyMA/data/words.txt"}' to '/var/folders/9j/zdk45gtj5znf3670v17tnj2h0000gt/T/words-with-vowels.inputs.json_2587074133'
2022-06-08T13:23:00-04:00 ↓ workflow parameter of 'workflowInputs' is 'words-with-vowels.inputs.json_2587074133'
2022-06-08T13:23:00-04:00 ↓ running workflow at 's3://agc-123456789012-us-east-1/project/Demo/userid/mrschre4GqyMA/context/miniContext/workflow/words-with-vowels/workflow.zip' with language 'wdl' and version '1.0' using attachments '[/var/folders/9j/zdk45gtj5znf3670v17tnj2h0000gt/T/words-with-vowels.inputs.json_2587074133]', workflow parameters 'map[workflowInputs:words-with-vowels.inputs.json_2587074133]' and engine parameters 'map[]'
2022-06-08T13:23:01-04:00 ↓ recording workflow run metadata for workflow run id 'fb6c4633-4b29-4eb8-896f-f60a998cc8c9' to DynamodDB
2022-06-08T13:23:02-04:00 ↓ cleaning up '/var/folders/9j/zdk45gtj5znf3670v17tnj2h0000gt/T/words-with-vowels.inputs.json_2587074133'
BTW, code with additional verbose logging is available at https://github.com/aws/amazon-genomics-cli/tree/markjschreiber/issue-474 branch
I think this also blocks running multiple workflows concurrently in the same context, since the MANIFEST.json with a per-workflow inputFileURLs
specified in the MANIFEST.json is packaged with the workflow (workflow.zip). The workflow.zip is 1:1 with the context, and not with the agc workflow run
. This is certainly unexpected, as I'd expect that agc workflow run
would not overwrite any previous definition of the workflow. So I think this is a bigger issue than just convenience.
Edit: see #497
Interesting observation @nh13, indeed there is only one workflow.zip
per context. Do you know how this affects the concurrent execution of workflows with Cromwell?
I have launched four workflows one after the other with a delay of between 1 and 10 minutes and they all seem to have picked the right input. They all use the same MANIFEST and I then pass the sample-specific input file with the -inputsFile
parameter. From what I have seen, workflow.zip
in s3 is rewritten after each agc workflow run
but perhaps this file is only used transiently? If so, is there a recommended time we should wait between agc workflow run
commands?
I think this also blocks running multiple workflows concurrently in the same context, since the MANIFEST.json with a per-workflow
inputFileURLs
specified in the MANIFEST.json is packaged with the workflow (workflow.zip). The workflow.zip is 1:1 with the context, and not with theagc workflow run
. This is certainly unexpected, as I'd expect thatagc workflow run
would not overwrite any previous definition of the workflow. So I think this is a bigger issue than just convenience.
Please escalate this to its own issue
I am also seeing this behavior with the nextflow
engine as well.