Exomiser icon indicating copy to clipboard operation
Exomiser copied to clipboard

Investigate encoding gnomAD, frequency and pathogenicity data separately in AlleleProperties proto

Open julesjacobsen opened this issue 3 years ago • 3 comments

Currently all allele data is stored as a map of string: float in an AlleleProperties message. The size on disk for these is about 60 GB when unpacked (and even this is lightly compressed) and 40 GB when compressed for distribution.

message AlleleProperties {
    string rs_id = 1;
    map<string, float> properties = 2;
    ClinVar clinVar = 3;
}

A single a gnomAD variant requires 16 bytes to encode the data. Where float = 4 bytes, string = 1 byte / character, so a key/ value of GNOMAD_G_EAS = 0.0056 = 12 bytes + 4 bytes = 16 bytes / variant (see the protobuf encoding docs for details)

An alternative would be to encode the gnomAD data as a message using enum and int32 types which are more efficiently encoded and would require 5-7 bytes to encode the data equivalent to the existing key: value pairs whilst adding valuable information about the allele number, allele count and homozygotes. The allele frequency can be calculated from the allele number and count.


message AlleleProperties {
    string rs_id = 1;
    map<string, float> properties = 2;
    ClinVar clinVar = 3;
    repeated GnomadData gnomadData = 4;
}

message GnomadData {
    SampleSet sampleSet = 1; // 1 byte (Varint)
    Population pop = 2; // 1 byte (Varint)
    // AF doesn't need storing as it is calculated as AC/AN
    int32 ac = 3; // 1 byte (Varint)
    int32 an = 4; // 1-3 bytes (Varint - 2^24 = 16777216 max)
    int32 hom = 5; //1 byte
}

enum SampleSet {
    EXOME = 0;
    GENOME = 1;
}

enum Population {
    AFR = 0;
    AMI = 1;
    AMR = 2;
    ASJ = 3;
    EAS = 4;
    FIN = 5;
    NFE = 6;
    MID = 7;
    SAS = 8;
    OTH = 9;
}

encoding gnomAD data as a message like this could save about 50% required disk space whilst adding more useful data due to the way float, int32 and string are encoded in protobuf.

gnomAD 3.1 has 644,267,978 variants passing quality filters observed in 76,156 genome samples. Assuming an average of 3 alleles per variant:

644,267,978 * 3 * 16 = 30924862944 B = 30.9 GB
644,267,978 * 3 * 7  = 13529627538 B = 13.5 GB

This is a substantial saving! Potentially there are speed gains to be had too as these would be easy to convert to Frequency objects, although this class would require changes too.

Other gains would come from removing the 1000 genomes, ExAC and TopMed data sets as these have been included in gnomAD.

Re-designing the pathogenicity scores could also help, but these will still require the 'float' for the score.

Downsides

Older Exomiser versions will not be able to use the new data, and this will not be apparent to the casual user who just updates the data but doesn't read the README. In this case the data will simply appear to be missing. To mitigate this breaking the protobuf schema should hopefully produce nonsense data resulting in a crash and no incorrect results.

Ideally all frequency data would conform to this format, but it doesn't and many (most?) only report the frequency, so these would need to be encoded using something closer to the existing Frequency class:

message Frequency {
    string frequencySource = 1; // or define a FrequencySource Enum
    float frequency = 2; 
}

message AlleleProperties {
    string rs_id = 1;
    repeated GnomadData gnomadData = 2; // note breaking numbering to deliberately prevent old versions reading this.
    repeated Frequency frequencies = 4;
    repeated PathogenicityScore pathogenicityScores = 5;
    ClinVar clinVar = 3;
}

So the AlleleProtoAdaptor and AlleleConverter would need adapting to this, but it should be more efficient as the models will be more closely aligned. However the serialisation code (AlleleConverter and Allele) plus more in data-genome might need more substantial changes.

Alternatives

Carry on as before, suck-up ever-increasing disk usage and data egress costs, miss-out on richer data enabling things such as enabling a more accurate PM2 ACMG assignment (where hom == 0 is required for AR/XR disorders, along with a low allele frequency).

Just remove the 1000 genomes, ExAC and TopMed data, benefit from backwards compatibility but miss out on the improved information content.

julesjacobsen avatar Jun 08 '22 11:06 julesjacobsen

UK10K and ALFA data also includes AN and AC, but not homozygous counts, so the GnomadData may be more generally applicable. Can then also uses these data to produce a pop max aggregate frequency.

julesjacobsen avatar Jun 09 '22 13:06 julesjacobsen

Also, UKB data https://www.decode.com/ukbsummary/ whatever format that's in.

UKB is now part of gnomAD v4.0 - see issue #528

julesjacobsen avatar Jul 21 '22 11:07 julesjacobsen

This is a long comment - scroll to the conclusions section at the end if you don't want the details...

Using the following as the new schema:

message AlleleProperties {
    string rs_id = 1;
    map<string, float> properties = 2 [deprecated = true];
    ClinVar clinVar = 3;
    repeated Frequency frequencies = 4;
    repeated PathogenicityScore pathogenicityScores = 5;
}

message Frequency {
    FrequencySource frequencySource = 1; // 1 byte (Varint) + 2 bytes to embed in the Frequency
    // Population population = 2; // 1 byte (Varint) not used  as the overhead of another embedded object was another 2 bytes 
    // AF doesn't need storing as it is calculated as AC/AN
    int32 ac = 3; // 1 byte (Varint)
    int32 an = 4; // 1-3 bytes (Varint - 2^24 = 16777216 max)
    int32 hom = 5; // 1 byte
    float frequency = 6; // 4 bytes // only required if the AN/AC isn't available for a dataset, takes zero space if not present
    string dataSource = 7; // ? bytes // 
}

enum FrequencySource {
    UNSPECIFIED_FREQUENCY_SOURCE = 0;

    KG = 1; [deprecated = true]
    TOPMED = 2; [deprecated = true]
    UK10K = 3;

    ESP_EA = 4;
    ESP_AA = 5;
    ESP_ALL = 6;

    GNOMAD_E_AFR = 14;
    GNOMAD_E_AMR = 15;
    // ... etc

    GNOMAD_G_AFR = 22;
    GNOMAD_G_AMR = 23;
    // ... etc
}

enum Population {
    ALL = 0;
    AFR = 1;
    AMI = 2;
    // ... etc
}

message PathogenicityScore {
    PathogenicitySource pathogenicitySource = 1;
    float score = 2;
}

enum PathogenicitySource {
    UNKNOWN_PATH_SOURCE = 0;
    VARIANT_EFFECT = 1;
    // ... plus others
}

Disk Usage

gnomAD exomes allele frequencies - gzipped VCF 5.3 GB

encoding raw proto (MB) gzip compressed proto (MB) MVStore size (includes key and rsid) (MB)
original 858 250 478
new 589 227 448

So there is only a dramatic space saving if the database isn't compressed (which it is). A good 50% of the database is actually the key set and rsid which comes to 255MB alone.

Similarly for the new pathogenicity score encoding of all of dnNSFP (raw and compressed protobuf only wasn't measured)

encoding raw proto (MB) gzip compressed proto (MB) MVStore size (includes key and rsid) (GB)
original n/a n/a 4.2
new n/a n/a 4.0

There is only a modest 200MB space saving of the compressed database.

The biggest gains come from not including any of the 1000 Genomes, UK10K or TOPMed data - removing these (assuming it is present in gnomAD) but retaining the original encoding reduces the database by ~50%. Further removing all variants which only have an rsid (i.e. no frequency or pathogenicity data) reduces the data by another ~ 50%.

2202_hg19 MVStore size (GB)
Original release 62.4
Remove(TOPMed, 1K, ExAC), new path encoding 32.6
Remove(TOPMed, 1K, ExAC, rsid only variants), new path encoding 14.4

Application Performance

Decoding throughput performance - converting 15.6 million variants in the gnomad exomes dataset to Frequency objects and calculating most common allele frequency:

encoding time (s)
original 19.5
new 17.6

So the conversion into domain Frequency objects is a bit faster using the new encoding. However, Exomiser doesn't store all the frequency objects as objects, instead they are stored as primitive arrays inside of a FrequencyData object. Here we might lose out on overall memory usage when the application is running which would be bad as we're storing 1 float as 2 int and an extra int for the homs - 4 bytes vs 12 bytes...

encoding time (s) memory usage (MB)
original FrequencyData 27 2052
new FrequencyData 23 2046
AlleleProto.AlleleProperties >60 > 4096

With a bit of optimisation (storing the rsid as a byte[] and compressing all the AN, AC, HOM data into a single int[], using a builder without intermediate Frequency object creation) it's possible to make the data fit into about the same (or sightly less) memory and gain a few seconds when creating 15.5 million objects. Contrast this to just wrapping and using the AlleleProperties objects directly which took more than the 4 GB available to run the test and took over 60 s due to continuous GC.

Conclusions

Using the new protobuf schema alone will result in a small gain in speed and disk usage, at the cost of breaking changes to the database and codebase, plus losing the ability to represent a population frequency as a single float value.

Huge (~50%-75%) space savings will come from simply removing all alleles with only an rsID. This seems like a reasonable thing to save on, but it will lessen the value of the KnownVariantFilter() (not part of the standard setup) which uses the rsID to filter out alleles in dbSNP without known frequencies.

All the frequency only data (TOPMed and 1K) and rsIDs without known frequencies are extracted from dbSNP and the TOPMed, 1K and ExAC data are all contained in gnomAD 2.1+ and ALFA releases. As a bonus gnomAD 3.1 also includes pre-computed SpliceAI scores.

So, apart from the downsides of breaking changes, the inability to represent frequency only data and loss of rsIDs without any associated frequency data, I'd say that on balance it will be worth making these changes. @damiansm @pnrobinson would you agree?

julesjacobsen avatar Sep 06 '22 13:09 julesjacobsen