cycledash icon indicating copy to clipboard operation
cycledash copied to clipboard

Implement attribute histograms

Open ihodes opened this issue 10 years ago • 1 comments

Again.

ihodes avatar Nov 20 '14 22:11 ihodes

with stats as ( select min("sample:DP"::INTEGER) as min, 
                       max("sample:DP"::INTEGER) as max 
                from genotypes where vcf_id = 222 and sample_name = 'TUMOR' ),
     histogram as (
         select width_bucket("sample:DP"::INTEGER, min, max, 9) as bucket,
                int4range(min("sample:DP"::INTEGER), max("sample:DP"::INTEGER), '[]') as range,
                count(*) as freq
         from genotypes, stats where vcf_id = 222 and sample_name = 'TUMOR'
         group by bucket order by bucket
     )
select bucket, range, freq, repeat('*', (freq::float / max(freq) over() * 30)::int) as bar 
from histogram;

This gets you part of the way there. The problem is that this uses evenly spaced bars. Instead, we want to truncate buckets early, if, say, we have a lot of outliers. I.e. if we'd have buckets between [1,100], [100, 200], [200, 300], [300,400], etc, but 95% of the variants fall within the first bucket, we'd probably want to have buckets more like [1, 10], [10, 20], …, [100, infinity]. Should be able to do this by modifying the stats CTE + using functions like median etc, rather than min/max.

ihodes avatar Mar 05 '15 18:03 ihodes