gcp-variant-transforms
gcp-variant-transforms copied to clipboard
Dedicated flags for merge-vcf-headers pipeline
While running VT on large inputs we need to provide enough computational power to meet large input's requirements (for example --worker_machine_type=n1-highmem-64 and --num_workers=64).
Using the same settings for merge-vcf-headers is overkill and causes both higher cost and longer run time. We might need to provide dedicated flags for merge-vcf-header (or at least when --optimize_for_large_inputs is set).
Isn't autoscaling properly handling this? Can you provide links to two dataflow runs for the same set of VCFs where the merge header part has taken significantly longer with more resources? (I agree that the cost might be higher but I think the cost of header merging is only a tiny part of the total cost for large inputs.)
Sorry, I meant --worker_machine_type and --disk_size_gb (including num_workers was wrong). All I am saying is that I don't see any reason to use the same machine type for main pipeline and merge-vcf-headers pipeline. I should also mention that I added this issue at the time when we were speculating the reason our workers are failing is due to lack of memory. However, later we realized it was because of disk. Since disk is not too expensive I guess it's fine if we use the same amount of disk for both pipelines. On the other hand, for machine_tupe I am not 100% sure whether we need larger machines when processing large inputs, if that's the case then I think we better distinguish between machine_types of different pipelines.