cycledash
cycledash copied to clipboard
Implement attribute histograms
Again.
with stats as ( select min("sample:DP"::INTEGER) as min,
max("sample:DP"::INTEGER) as max
from genotypes where vcf_id = 222 and sample_name = 'TUMOR' ),
histogram as (
select width_bucket("sample:DP"::INTEGER, min, max, 9) as bucket,
int4range(min("sample:DP"::INTEGER), max("sample:DP"::INTEGER), '[]') as range,
count(*) as freq
from genotypes, stats where vcf_id = 222 and sample_name = 'TUMOR'
group by bucket order by bucket
)
select bucket, range, freq, repeat('*', (freq::float / max(freq) over() * 30)::int) as bar
from histogram;
This gets you part of the way there. The problem is that this uses evenly spaced bars. Instead, we want to truncate buckets early, if, say, we have a lot of outliers. I.e. if we'd have buckets between [1,100], [100, 200], [200, 300], [300,400], etc, but 95% of the variants fall within the first bucket, we'd probably want to have buckets more like [1, 10], [10, 20], …, [100, infinity]. Should be able to do this by modifying the stats CTE + using functions like median etc, rather than min/max.