PyVCF
PyVCF copied to clipboard
model.py: add_field function added in Call class, for adding or updating...
I think it could be a good improvement to allow the PyVCF module to be able to add or update information fields in Call objects, that is, in the FORMAT fields of the .vcf files. This feature could make possible to perform other interesting functions like merging of .vcf files coming from different variant callers, etc. I will be glad to include some of these functionalities in future releases.
Changes performed: model.py: add_field function added in Call class, for adding or updating a field of a call object, that is, it allow to modify calls
Thanks @gitanoqevaporelmundoentero for the pull request. I have also seen your earlier comment, but haven't had time to respond to it. Let me just link to #82 as a reference for now so the information here is complete.
Thanks, this looks good but I'm not really willing to merge it without any tests.
Can you add some, please?
Also, idea is normalising whitespace which is a bit irritating. But it has moved comments to the wrong indentation which is a bit annoying!
Hi, @jamescasbon! I'd like to come back to this pull-request. Regarding your request on tests, I'm thinking of adding 2 vcf files from different sources (GATK and VarScan, with different INFO and FORMAT fields) and include a short script for performing the merge of these 2 files using the PyVCF library. I would include the template files and script in the "scripts" folder. Is this ok?
Please let me know your opinion.
Added test files (vct/test/gatk_sample.vcf, vcf/test/varscan_sample.vcf, also gzipped and tabix indexed). Created script for merging files (scripts/vcf_merge.py). Included example of merge command (scripts/merge_test.txt).
Hi, thanks for adding tests. I will check this out now with a view to merging. ccing @martijnvermaat in case he gets there first.
Yes, we should try to merge this.
It would take some time for me to go through it, but just skimming I see a number of issues I would like to see fixed before a merge:
- Whitespace/indentation irregularities and some very long lines.
- If the info in
scripts/merge_test.txt
is really needed, can we think of another place for it? - Use of deprecated
optparse
module (our other scripts already useargparse
). - Can we replace the use of global variables with function arguments?
- What are the
sender
andreceivers
variables for? - Many unused imports.
- I think the
--verbose
and--debug
arguments are nice during development, but probably don't belong in the resulting script. - The
LICENSE: This script belongs to Sistemas Genomicos S.L.
line should at least be reworded, since I don't think it's compatible with our license (or any Open Source license). I suggest to just state copyright for you and the company.
All in all this PR needs some work in my opinion.
Sorry for not giving some guidance on this earlier. We do want the feature in PyVCF and I really appreciate all the work!
Some workflow related points:
-
I don't think you should have merged with current master during your work. Since we don't have any guidelines on this, I really cannot blame you (I'd rather blame Git), but I much prefer rebasing for this (or leave it to the maintainers).
-
The PR now consists of 24 commits, many of which could be squashed to make things more understandable for others. If you want to give it a shot, you could try Git's interactive rebase:
git rebase -i b8c0af7e # this was your original master git push origin master -f # since you're rewriting history, you need -f
Perhaps we should add some contribution guidelines to our documentation, especially on the recommended Git workflow. Or just link to a good external document for this.
In the end it is not all that important though, since we could always just squash the entire PR into one commit when merging.
Sorry for having submitted all my changes & tests without a prior exhaustive revision. I'm checking all the points you listed in your last 2 comments, and will come back with new checked changes as soon as possible.
I think I'm ready with the changes for the moment. A brief list of the changes:
- I've formatted the scripts and checked them with 'pep8 --max-line-length=100'
- I've removed file scripts/merge_test.txt, and included the example command in the vcf_merge.py script
- I've removed not necessary elements from script, and added necessary functions
- I've rebased to the original commit, I hope successfully, I'm not very used to "rebase" with git...
I will be waiting for further suggestions of changes for the PR. Thanks, Oscar
"Print" statements have been substituted to "sys.stderr.write" for compatibility with python 3
Since I haven't received any more feedback, I've decided to go through my pull request. I've made more consistent changes, regarding private methods being called from outside propietary classes and simplifying the code. To summarize:
- In model.py: Record class: changes in merge function and add_call function added
- In parser.py: Reader class: merge method added. Method merge_info_from_vcf_files added in parser.py I hope to get these changes included soon. Please let me know if I can collaborate in any other way.
New changes included:
- parser.py: Writer class: add_attr method created, for adding infos, formats, etc.
- parser.py: Writer class: init method: now this method does not write header until first record will be written, since further information may be added. Changes in _write_header and write_record were therefore needed.
Hope to hear from you soon again...
Sorry for not getting here earlier. I think there are still some issues to be resolved before this can be merged, but a full review will take me a lot of time. Let's see how we can best get there.
The main problem is the size of the pull request. I think it can be split into several smaller logically self-contained changes, which will be much easier to discuss and get merged.
Particularly, I think the main subject of this PR (adding and editing call data) can be isolated from for example the VCF merge script. The latter depends on the first, but not the other way around.
I noted two more things:
- Adding and editing call data also still needs some unit tests, directly on the call data mechanics. The current test(data) in this PR works via the VCF merge script and is therefore more an integration test and too high level (but also useful).
- Another thing that's not clear to me is how (or if) this approach differs from the quick mockup James gave in this comment.
I guess it will again take some work, but separating concerns will help focus the discussion. Also, I have to thank you for the updates, the code has already improved a lot.
Hello, are you still interested in merging this PR? I was trying to bring some light into my changes with an git-interactive rebase, but I'm not coming through. If you are still interested, I would prefer to start from scratch in a new fork, and include there my changes. Is it ok?
Anyway I've finally performed a review of the code included in my PR, and I hope that my changes look a bit better (I've made some rebasing on them). If still interested in merging this PR, please let me know how should we continue. Thanks!
Finally I've completely rebuilt merge utilities, by creating a new class in parser: MergeReader. This class allows to parse multiple sorted vcf files at a time, returning merged records whenever it's possible. Merge script (vcf_merge.py) and parser functions get quite simplified and consistent. I've also included some unit tests for af (allelic frequencies) property. And I've also tried to rebase my commit history (although I think something is still wrong with it, any advices are welcome).
Please let me know how should I go further with this PR.
Thanks!
I've read the discussion above but am still unclear -- is there currently a way to add new fields to the genotype?
Is this dead on the vine?