wdl
wdl copied to clipboard
reading in Struct's and Array[Struct] from TSV/json to replace `read_objects`
I'm sitting down to try to summarize the convo about getting data read into an array of Structs. The situation arose from the removal of the read_objects function, which could read in a TSV with a header and create an Array[Object] that made each row into a named object based on the header of the file. This often was used to then scatter over when a group of Files/Strings were used together in a series of tasks to ensure that the right pairings of variables. This without having to do a bunch of parsing of the file header and such or to require users to pre-process input files and manifests to workflows that are human readable and user friendly, to a workflow friendly input. Json's also are obviously important too though, but they are more clear on how to do this.
I think the goal would be to have a multi column TSV or a json as the input. The file would be read in and the contents (each row or equivalent portion of the json) checked to conform to the defined Struct specified. If not, throw an error. Then stuff the contents of each row of the TSV or portion of the json into an Array[Struct] if there are more than 1 row/group.
The call could look like: read_TSV(file, Struct) or read_json(file, Struct). Thoughts/feelings/concerns?
We already have a read_json call: https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#mixed-read_jsonstringfile.
It can be used to read in a struct. We do that for example here: https://github.com/biowdl/germline-DNA/blob/99903edd6b826392a5db0d80551dabd415d77095/germline.wdl#L76
In this particular case we wrote a python script to convert the TSV in a structure that was easy to process in WDL. Outputted the structure in JSON and read it in as a WDL struct.
I think the advantage of JSON is that it is unambiguous about types (String, Int, List, Map), while a TSV is in essence a Array[Array[Any]]. TSV is extremely easy to parse in any language that supports some sort of line-by-line iteration of a file. So in my opinion it is easier to let the users do the tsv -> json conversion themselves and then read the json in WDL.
From reading the spec it was not clear to me that read_json would read into a struct, NOR were the docs on Structs clear to me that you could nest a struct. Perhaps more doc's to clarify more about how structs can be used?
I still don't understand then why read_tsv can't read a TSV with a header into a struct and then fail if the contents don't match what is expected. Why is it that a JSON is unambiguous about types, whereas a TSV with a header is not? I don't see a difference between those two. If the TSV does NOT have a header, then sure, it's totally ambiguous, but with a header, it's the same thing as a JSON and should be treated as such.
Sorry for the multiple replies here, but are you using Cromwell to run these WDLs? What does your python parsing script do that fixes the issue of reading in a json array?
This issue/comment from last year: https://github.com/broadinstitute/cromwell/issues/4518#issuecomment-454047093
I wrote a task to parse my TSV and read it into a struct similar to yours (a struct that is an array[anotherStruct]), and it gave me the old Failed to read_json... No coercion defined... error.
Has all of this been fixed and I just haven't followed along on the discussion or what am I missing?
Why is it that a JSON is unambiguous about types, whereas a TSV with a header is not? I don't see a difference between those two. If the TSV does NOT have a header, then sure, it's totally ambiguous, but with a header, it's the same thing as a JSON and should be treated as such.
Well in json {"ID": 5} and {"ID": "5"} are an Int and a string respectively. You can't do that with a tsv. There is not a way to distinguish 5 from "5".
Also Structs are objects and JSON stands for JavaScript Object Notation. This notation is very easy for defining objects. But importantly it is also easy for developers to implement. The TSV type casting you describe (determine type based on the struct members it is supposed to go into) is hard. Especially because TSV does not support lists and dictionaries. It will involve a lot of guessing what the user meant. Personally I think the only 'correct' way to parse a tsv is to cast it to an Array[Array[String]]. But it is easier to just write a small python script and do the json casting yourself.
What does your python parsing script do that fixes the issue of reading in a json array?
It uses a struct with a value samples that then contains the list of samples. It is a workaround, but it works.
I wrote a task to parse my TSV and read it into a struct similar to yours (a struct that is an array[anotherStruct]), and it gave me the old Failed to read_json... No coercion defined... error.
If you would post the JSON and the code and the full error message it would be much easier to help you.
Has all of this been fixed and I just haven't followed along on the discussion or what am I missing?
No this hasn't all been fixed. I guess what happened is that people had workarounds (like ours) that were only a very slight nuisance to implement. Then after that other more urgent issues take priority.
@vortexing I agree with you that the spec is ambiguous about what the type of the left-hand side of read_json may be. My assumption is that read_json always returns an object/array/primitive (depending on the contents of the file). The coercion table does not define any implicit coercions from object to struct, but it would not surprise me if Cromwell supports this as its feature set is a superset of the spec. The problem with this is that if you code your workflow assuming it will be run by Cromwell, and use Cromwell-specific features, then it is not portable because runtimes that code strictly to the spec will not run it.
All this is to say that we need to make a change to the spec that either: 1) explicitly says that implicit object to struct coercions are supported, or 2) adds an optional type parameter to read_json (and possibly also read_tsv - I really like this idea) allowing the user to specify the type of the struct they want returned.
I will write up a formal proposal in a separate issue.
Here is the proposal - please comment https://github.com/openwdl/wdl/issues/389
The spec now explicitly states that a coercion from read_json return value to a specific type is requires. In WDL 1.2 we will add the optional second parameter to read_tsv to specify the header. I think that should address everything in this issue.