empress icon indicating copy to clipboard operation
empress copied to clipboard

Identify common (sample) metadata values, and replace them with unique integers

Open fedarko opened this issue 4 years ago • 6 comments

@kwcantrell brought this up in this morning's meeting. Currently sample metadata is stored as follows in the HTML (this is from the moving pictures dataset, formatting modified for ease of reading):

["barcode-sequence", "body-site", "year", "month", "day", "subject", "reported-antibiotic-usage", "days-since-experiment-start"],
[["AGTGCGATGCGT", "gut", "2009.0", "3.0", "17.0", "subject-1", "No", "140.0"],
 ["ATGGCAGCTCTA", "gut", "2008.0", "10.0", "28.0", "subject-2", "Yes", "0.0"],
 ["CTGAGATACGCG", "gut", "2009.0", "1.0", "20.0", "subject-2", "No", "84.0"],
 ["CCGACTGAGATG", "gut", "2009.0", "3.0", "17.0", "subject-2", "No", "140.0"],
 ["CCTCTCGTGATC", "gut", "2009.0", "4.0", "14.0", "subject-2", "No", "168.0"],
 ["ACACACTATGGC", "gut", "2009.0", "1.0", "20.0", "subject-1", "No", "84.0"],
 ["ACTACGTGTGGT", "gut", "2009.0", "2.0", "17.0", "subject-1", "No", "112.0"],
 ["AGCTGACTAGTC", "gut", "2008.0", "10.0", "28.0", "subject-1", "Yes", "0.0"],
 ["ACGATGCGACCA", "left palm", "2009.0", "1.0", "20.0", "subject-1", "No", "84.0"],
 ["AGCTATCCACGA", "left palm", "2009.0", "2.0", "17.0", "subject-1", "No", "112.0"],
 ["ATGCAGCTCAGT", "left palm", "2009.0", "3.0", "17.0", "subject-1", "No", "140.0"],
 ["CACGTGACATGT", "left palm", "2009.0", "4.0", "14.0", "subject-1", "No", "168.0"],
 ["CATATCGCAGTT", "left palm", "2008.0", "10.0", "28.0", "subject-2", "Yes", "0.0"],
 ["CGTGCATTATCA", "left palm", "2009.0", "1.0", "20.0", "subject-2", "No", "84.0"],
 ["CTAACGCAGTCA", "left palm", "2009.0", "3.0", "17.0", "subject-2", "No", "140.0"],
 ["CTCAATGACTCA", "left palm", "2009.0", "4.0", "14.0", "subject-2", "No", "168.0"],
 ["ACAGTTGCGCGA", "right palm", "2008.0", "10.0", "28.0", "subject-1", "Yes", "0.0"],
 ["CACGACAGGCTA", "right palm", "2009.0", "1.0", "20.0", "subject-1", "No", "84.0"],
 ["AGTGTCACGGTG", "right palm", "2009.0", "2.0", "17.0", "subject-1", "No", "112.0"],
 ["CAAGTGAGAGAG", "right palm", "2009.0", "3.0", "17.0", "subject-1", "No", "140.0"],
 ["CATCGTATCAAC", "right palm", "2009.0", "4.0", "14.0", "subject-1", "No", "168.0"],
 ["ATCGATCTGTGG", "right palm", "2008.0", "10.0", "28.0", "subject-2", "Yes", "0.0"],
 ["GCGTTACACACA", "right palm", "2009.0", "3.0", "17.0", "subject-2", "No", "140.0"],
 ["GAACTGTATCTC", "right palm", "2009.0", "4.0", "14.0", "subject-2", "No", "168.0"],
 ["CTCGTGGAGTAG", "right palm", "2009.0", "1.0", "20.0", "subject-2", "No", "84.0"],
 ["CAGTGTCAGGAC", "tongue", "2008.0", "10.0", "28.0", "subject-1", "Yes", "0.0"],
 ["ATCTTAGACTGC", "tongue", "2009.0", "1.0", "20.0", "subject-1", "No", "84.0"],
 ["CAGACATTGCGT", "tongue", "2009.0", "2.0", "17.0", "subject-1", "No", "112.0"],
 ["CGATGCACCAGA", "tongue", "2009.0", "3.0", "17.0", "subject-1", "No", "140.0"],
 ["CTAGAGACTCTT", "tongue", "2009.0", "4.0", "14.0", "subject-1", "No", "168.0"],
 ["CTGGACTCATAG", "tongue", "2008.0", "10.0", "28.0", "subject-2", "Yes", "0.0"],
 ["GAGGCTCATCAT", "tongue", "2009.0", "1.0", "20.0", "subject-2", "No", "84.0"],
 ["GATACGTCCTGA", "tongue", "2009.0", "3.0", "17.0", "subject-2", "No", "140.0"],
 ["GATTAGCACTCT", "tongue", "2009.0", "4.0", "14.0", "subject-2", "No", "168.0"]]

It would be useful to identify common string values in the metadata, map these to unique integers, and then replace these values in the metadata. Then, when looking up sample metadata info in the BIOM table or something, numeric values would be replaced with their original string value. (This'd work because all metadata is stored as strings in Empress right now.)

An example of what this might look like, by just replacing ten nonunique values I arbitrarily picked:

{"gut": 0, "left palm": 1, "right palm": 2, "tongue": 3, "2008.0": 4, "2009.0": 5, "subject-1": 6, "subject-2": 7, "No": 8, "Yes": 9},
["barcode-sequence", "body-site", "year", "month", "day", "subject", "reported-antibiotic-usage", "days-since-experiment-start"],
[["AGTGCGATGCGT", 0, 5, "3.0", "17.0", 6, 8, "140.0"],
 ["ATGGCAGCTCTA", 0, 4, "10.0", "28.0", 7, 9, "0.0"],
 ["CTGAGATACGCG", 0, 5, "1.0", "20.0", 7, 8, "84.0"],
 ["CCGACTGAGATG", 0, 5, "3.0", "17.0", 7, 8, "140.0"],
 ["CCTCTCGTGATC", 0, 5, "4.0", "14.0", 7, 8, "168.0"],
 ["ACACACTATGGC", 0, 5, "1.0", "20.0", 6, 8, "84.0"],
 ["ACTACGTGTGGT", 0, 5, "2.0", "17.0", 6, 8, "112.0"],
 ["AGCTGACTAGTC", 0, 4, "10.0", "28.0", 6, 9, "0.0"],
 ["ACGATGCGACCA", 1, 5, "1.0", "20.0", 6, 8, "84.0"],
 ["AGCTATCCACGA", 1, 5, "2.0", "17.0", 6, 8, "112.0"],
 ["ATGCAGCTCAGT", 1, 5, "3.0", "17.0", 6, 8, "140.0"],
 ["CACGTGACATGT", 1, 5, "4.0", "14.0", 6, 8, "168.0"],
 ["CATATCGCAGTT", 1, 4, "10.0", "28.0", 7, 9, "0.0"],
 ["CGTGCATTATCA", 1, 5, "1.0", "20.0", 7, 8, "84.0"],
 ["CTAACGCAGTCA", 1, 5, "3.0", "17.0", 7, 8, "140.0"],
 ["CTCAATGACTCA", 1, 5, "4.0", "14.0", 7, 8, "168.0"],
 ["ACAGTTGCGCGA", 2, 4, "10.0", "28.0", 6, 9, "0.0"],
 ["CACGACAGGCTA", 2, 5, "1.0", "20.0", 6, 8, "84.0"],
 ["AGTGTCACGGTG", 2, 5, "2.0", "17.0", 6, 8, "112.0"],
 ["CAAGTGAGAGAG", 2, 5, "3.0", "17.0", 6, 8, "140.0"],
 ["CATCGTATCAAC", 2, 5, "4.0", "14.0", 6, 8, "168.0"],
 ["ATCGATCTGTGG", 2, 4, "10.0", "28.0", 7, 9, "0.0"],
 ["GCGTTACACACA", 2, 5, "3.0", "17.0", 7, 8, "140.0"],
 ["GAACTGTATCTC", 2, 5, "4.0", "14.0", 7, 8, "168.0"],
 ["CTCGTGGAGTAG", 2, 5, "1.0", "20.0", 7, 8, "84.0"],
 ["CAGTGTCAGGAC", 3, 4, "10.0", "28.0", 6, 9, "0.0"],
 ["ATCTTAGACTGC", 3, 5, "1.0", "20.0", 6, 8, "84.0"],
 ["CAGACATTGCGT", 3, 5, "2.0", "17.0", 6, 8, "112.0"],
 ["CGATGCACCAGA", 3, 5, "3.0", "17.0", 6, 8, "140.0"],
 ["CTAGAGACTCTT", 3, 5, "4.0", "14.0", 6, 8, "168.0"],
 ["CTGGACTCATAG", 3, 4, "10.0", "28.0", 7, 9, "0.0"],
 ["GAGGCTCATCAT", 3, 5, "1.0", "20.0", 7, 8, "84.0"],
 ["GATACGTCCTGA", 3, 5, "3.0", "17.0", 7, 8, "140.0"],
 ["GATTAGCACTCT", 3, 5, "4.0", "14.0", 7, 8, "168.0"]]

The metadata looks a lot smaller (and I haven't even replaced nonunique stuff in the month/day/etc. fields). For massive datasets with lots of metadata (e.g. the EMP) this could be really useful.

fedarko avatar Aug 18 '20 20:08 fedarko

This kind of encoding will be greatly beneficial for taxonomy since a lot of nodes will share level 1/2/... values.

It might be better to reverse {"gut": 0, "left palm": 1, "right palm": 2, "tongue": 3, "2008.0": 4, "2009.0": 5, "subject-1": 6, "subject-2": 7, "No": 8, "Yes": 9}

and make it {0: "gut", 1: "left palm", 2: "right palm", 3: "tongue", 4: "2008.0", 5: "2009.0", 6: "subject-1", 7: "subject-2", 8: "No", 9: "Yes"}

So converting from the number to value would be easier.

Also, do you think it would be worthwhile to have some condition such as if > 20% of values in a metadata field share a value then encode the field using the above suggested method? The reason being, for example, the majority (or all?) features have a unique barcode-sequence in the above example so encoding them will add additional memory since we will still need to store all the sequence + add an additional number to represent all the sequence.

kwcantrell avatar Aug 19 '20 01:08 kwcantrell

Agreed that this'll help a lot. I think we could go even a bit further and just store the encodings in an array, similar to how we handled treeData:

["gut", "left palm", "right palm", "tongue", "2008.0", "2009.0", "subject-1", "subject-2", "No", "yes"]

... which can be accessed the same way (0 -> "gut", etc.) As an added bonus, if we set the encodings so that the first element is the most common value, the second is the next most common value, etc., then it'll be really easy to interpret this line in the HTML and say "what strings are most common throughout this metadata?".

Also, do you think it would be worthwhile to have some condition such as if > 20% of values in a metadata field share a value then encode the field using the above suggested method? The reason being, for example, the majority (or all?) features have a unique barcode-sequence in the above example so encoding them will add additional memory since we will still need to store all the sequence + add an additional number to represent all the sequence.

This is a good question - there's probably a fancy way of determining if it's "worth" encoding a given value or not (probably involves looking at the value's length, and how many times it occurs, versus the added memory / space to encode it). As a first pass, though, I think it makes sense to just encode all values that occur, say, more than once (or maybe more than twice).

in a metadata field

I'm not confident about this, but I think it makes sense to allow "duplicates" across metadata fields. A reason being that if, say, a metadata file has like 5 different Year fields, then it makes sense to lump those all together.

I think this is going to be pretty useful :)

fedarko avatar Aug 19 '20 01:08 fedarko

Agreed that this'll help a lot. I think we could go even a bit further and just store the encodings in an array, similar to how we handled treeData:

["gut", "left palm", "right palm", "tongue", "2008.0", "2009.0", "subject-1", "subject-2", "No", "yes"]

I like this solution.

I'm not confident about this, but I think it makes sense to allow "duplicates" across metadata fields. A reason being that if, say, a metadata file has like 5 different Year fields, then it makes sense to lump those all together.

That's a good point. It would make a lot more sense to encode, for example, "gut" the same across all metadata fields.

This is a good question - there's probably a fancy way of determining if it's "worth" encoding a given value or not (probably involves looking at the value's length, and how many times it occurs, versus the added memory / space to encode it). As a first pass, though, I think it makes sense to just encode all values that occur, say, more than once (or maybe more than twice).

In that case, maybe as a first pass, we should just encode all values regardless of there occurrence.

kwcantrell avatar Aug 19 '20 01:08 kwcantrell

The more general solution to this issue would be to support data compression in Python and JS. For example the JSON string representing the mapping file is Zipped and the byte stream is encoded in base64 for JavaScript to read, decompress, and load from JSON. Seems like there's a JS library for handling zipped data: https://gildas-lormeau.github.io/zip.js/core-api.html

Zipping the EMP mapping file we've been using for testing makes the data go from 27MB to 2.3 MB.

On (Aug-18-20|18:48), kwcantrell wrote:

I'm not confident about this, but I think it makes sense to allow "duplicates" across metadata fields. A reason being that if, say, a metadata file has like 5 different Year fields, then it makes sense to lump those all together.

That's a good point. It would make a lot more sense to encode, for example, "gut" the same across all metadata fields.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://urldefense.com/v3/https://github.com/biocore/empress/issues/337*issuecomment-675804585;Iw!!Mih3wA!U27M2dkxBdKogVlIdwweJYajX59SuIEaxaPO1OFjtLjJY3RK6aFL5HXrIPvaEiY$

ElDeveloper avatar Aug 19 '20 15:08 ElDeveloper

@ElDeveloper that is probably the better solution because we wouldn't have to refactoring the js. All we would need to to do is implement the compression/decompression.

kwcantrell avatar Aug 19 '20 16:08 kwcantrell

I took a bit of time and put together an early version of the python compression code for this. Here's what the code produces on the moving pictures sample metadata, with every non-unique value compressed (compare with the stuff above on the same dataset):

["barcode-sequence", "body-site", "year", "month", "day", "subject", "reported-antibiotic-usage",
 "days-since-experiment-start"],
["2009.0", "No", "subject-1", "subject-2", "17.0", "right palm", "tongue", "gut", "3.0", "140.0",
 "1.0", "20.0", "84.0", "left palm", "2008.0", "10.0", "28.0", "Yes", "0.0", "4.0", "14.0",
 "168.0", "2.0", "112.0"],
[["AGTGCGATGCGT", 7, 0, 8, 4, 2, 1, 9],
 ["ATGGCAGCTCTA", 7, 14, 15, 16, 3, 17, 18],
 ["CTGAGATACGCG", 7, 0, 10, 11, 3, 1, 12],
 ["CCGACTGAGATG", 7, 0, 8, 4, 3, 1, 9],
 ["CCTCTCGTGATC", 7, 0, 19, 20, 3, 1, 21],
 ["ACACACTATGGC", 7, 0, 10, 11, 2, 1, 12],
 ["ACTACGTGTGGT", 7, 0, 22, 4, 2, 1, 23],
 ["AGCTGACTAGTC", 7, 14, 15, 16, 2, 17, 18],
 ["ACGATGCGACCA", 13, 0, 10, 11, 2, 1, 12],
 ["AGCTATCCACGA", 13, 0, 22, 4, 2, 1, 23],
 ["ATGCAGCTCAGT", 13, 0, 8, 4, 2, 1, 9],
 ["CACGTGACATGT", 13, 0, 19, 20, 2, 1, 21],
 ["CATATCGCAGTT", 13, 14, 15, 16, 3, 17, 18],
 ["CGTGCATTATCA", 13, 0, 10, 11, 3, 1, 12],
 ["CTAACGCAGTCA", 13, 0, 8, 4, 3, 1, 9],
 ["CTCAATGACTCA", 13, 0, 19, 20, 3, 1, 21],
 ["ACAGTTGCGCGA", 5, 14, 15, 16, 2, 17, 18],
 ["CACGACAGGCTA", 5, 0, 10, 11, 2, 1, 12],
 ["AGTGTCACGGTG", 5, 0, 22, 4, 2, 1, 23],
 ["CAAGTGAGAGAG", 5, 0, 8, 4, 2, 1, 9],
 ["CATCGTATCAAC", 5, 0, 19, 20, 2, 1, 21],
 ["ATCGATCTGTGG", 5, 14, 15, 16, 3, 17, 18],
 ["GCGTTACACACA", 5, 0, 8, 4, 3, 1, 9],
 ["GAACTGTATCTC", 5, 0, 19, 20, 3, 1, 21],
 ["CTCGTGGAGTAG", 5, 0, 10, 11, 3, 1, 12],
 ["CAGTGTCAGGAC", 6, 14, 15, 16, 2, 17, 18],
 ["ATCTTAGACTGC", 6, 0, 10, 11, 2, 1, 12],
 ["CAGACATTGCGT", 6, 0, 22, 4, 2, 1, 23],
 ["CGATGCACCAGA", 6, 0, 8, 4, 2, 1, 9],
 ["CTAGAGACTCTT", 6, 0, 19, 20, 2, 1, 21],
 ["CTGGACTCATAG", 6, 14, 15, 16, 3, 17, 18],
 ["GAGGCTCATCAT", 6, 0, 10, 11, 3, 1, 12],
 ["GATACGTCCTGA", 6, 0, 8, 4, 3, 1, 9],
 ["GATTAGCACTCT", 6, 0, 19, 20, 3, 1, 21]]

Even on this small dataset, the space saving is pretty clear -- ls -ahlt puts the sample metadata info above at 1.8K and the old sample metadata info at 2.8K. For the EMP empress.html file (without feature metadata since I don't have it), this takes it down from 119 MB to 108 MB.

I think this method may have some merit besides (or in addition to) using zip data compression; with that, as far as I can tell, we'd still need to uncompress the original data -- which would involve loading a lot of redundant strings. Here, even the uncompressed data takes up less space, since it's mostly numbers. (Also, while implementing zip.js might take some careful thinking and refactoring, fumbling my way through this solution has gone pretty quickly ...)

As @kwcantrell mentioned this approach should be applicable to feature metadata, as well (and would likely be even more useful there, since there are gonna be lots of "k__Bacteria"s and so on).

fedarko avatar Aug 30 '20 03:08 fedarko