corona
corona copied to clipboard
Using Natural Language Transformers for Classification
Glad I stumbled upon this project - was working on a theory using the same base dataset.
Since protein/genes are essentially sequences of letters, it led me to the idea of using Transformer models like BERT to classify sequences to their structure. If that theory was valid, I'd want to try a multi-task approach to pairing the valid treatment sequence to the virus sequence and look at whether the model can predict the treatment sequence given the input virus sequence.
I haven't studied the structure as much as you guys probably have - so I'd defer to you on whether this would be plausible/feasible given what we know so far.
Here's a few other starting points I've looked at:
ReSimNet: Drug Response Similarity Prediction using Siamese Neural Networks Jeon and Park et al., 2018
https://github.com/dmis-lab/ReSimNet
BERN is a BioBERT-based multi-type NER tool that also supports normalization of extracted entities.
https://github.com/dmis-lab/bern
Hmm, so I don't know what you mean by "treatment sequence." Usually, I've seem these transformer models trained as big unsupervised predictors of the next character.
The idea would be modeling it after something like the SQuAD/SWAG dataset for Question Answer, where you have typically a large body of text as initial context (virus sequence), followed by the answer and the positions of the spans for that answer, if found in text (vaccine/cure sequence).
Example of a BioBERT dataset formatted for SQuAD: https://storage.googleapis.com/ce-covid-public/BioASQ-6b/train/Full-Abstract/BioASQ-train-factoid-6b-full-annotated.json
Additional dataset from BioASQ: https://storage.cloud.google.com/ce-covid-public/2ndYearDatasetTask2b.json
I also compiled additional sequence data which may or may not overlap with the download script you had.
https://drive.google.com/drive/folders/18aAuP3OhGMLKV8jZpt_8vpLY5JSqOS9E?usp=sharing
There are 3 sets - Coronaviruses, Influenzaviruses, and SARS related. The jsonl files are the raw data information that was compiled by filtering for complete sequences, and the virus families, and then using the accession code to download the sequences, which are the json files - so they should match the same format as your allseq.json file
- 11132 sequences for Influenza
- 3002 sequences for Coronavirus
- 2023 sequences for SARS
@trisongz I downloaded the files and put something together. Let me know if it's similar to what you are suggesting? By the way, I am familiar with the transformers library, and I don't think you can use the pre-trained language models (vocabulary) for these types of sequences. Anyways, here's the Colab link of what I put together - let me know if it's related!
@amoux That's pretty awesome! I hadn't thought of using a node graph, mainly because I don't work with them as often as I'd like to.
So I've been messing around with different methods and out of the box, transformers won't necessarily work. You pointed out the first one, which is creating the vocabulary. There wasn't a single number that every sequence was divisible by, so what I did instead was process the sequences to find the lowest prime number for that given sequence, and split the sequence by that prime.
## working file - covseq.json
Total Non-Unique Primes: 8297
Total Unique Primes: 1998
Unique Primes: [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31,
35, 37, 41, 43, 45, 47, 49, 50, 53, 54, 57, 59, 61, 62, 63, 67, 71, 73,
77, 79, 83, 85, 89, 91, 95, 97, 100, 101, 103, 106, 107, 108, 109, 113,
115, 119, 121, 123, 124, 125, 126, 127, 129, 131, 133, 135, 137, 139,
143, 145, 149, 151, 155, 157, 161, 163, 167, 171, 173, 175, 179, 181,
183, 187, 189, 191, 193, 197, 199, 200, 201, 203, 205, 209, 211..]
Afterwards, I compiled all the split sequence chunks into a list, and deduplicated the list to have a remaining list of unique sequence chunks.
fluseq.json has 251607 tokens
covseq.json has 215855 tokens
sarseq.json has 96971 tokens
Total Non-Unique Tokens: 564433
Total Unique Tokens: 208565
ATGGAGAGAATAAAAGAACTGAGAGATCTAATGTCGCAGTCCCGCACTCGCGAGATACTCACTAAGACCACTGTGGACCATATGGCCATAATCAAAAAGTACACATCAGGAAGGCAAGAGAAGAACCCCGCACTCAGAATGAAGTGGATGATGGCAATGAGATACCCAATTACAGCAGACAAGAGAATAATGGACATGATTCCAGAGAGGAATGAACAAGGACAAACCCTCTGGAGCAAAACAAACGATGCTGGATCAGACCGAGTGATGGTATCACCTCTGGCCGTAACATGGTGGAATAGGAATGGCCCAACAACAAGTACAGTTCATTACCCTAAGGTATATAAAACTTATTTCGAAAAGGTCGAAAGGTTGAAACATGGTACCTTCGGCCCTGTCCACTTCAGAAATCAAGTTAAAATAAGGAGGAGAGTTGATACAAACCCTGGCCATGCAGATCTCAGTGCCAAGGAGGCACAGGATGTGATTATGGAAGTTGTTTTCCCAAATGAAGTGGGGGCAAGAATACTGACATCAGAGTCACAGCTGGCAATAACAAAAGAGAAGAAAGAAGAGCTCCAGGATTGTAAAATTGCTCCCTTGATGGTGGCGTACATGCTAGAAAGAGAATTGGTCCGTAAAACAAGGTTTCTCCCAGTAGCCGGCGGAACAGGCAGTGTTTATATTGAAGTGTTGCACTTAACCCAAGGGACGTGCTGGGAGCAGATGTACACTCCAGGAGGAGAAGTGAGAAATGATGATGTTGACCAAAGTTTGATTATCGCTGCTAGAAACATAGTAAGAAGAGCAGCAGTGTCAGCAGACCCATTAGCATCTCTCTTGGAAATGTGCCACAGCACACAGATTGGAGGAGTAAGGATGGTGGACATCCTTAGACAGAATCCAACTGAGGAACAAGCCGTAGACATATGCAAGGCAGCAATAGGGTTGAGGATTAGCTCATCTTTCAGTTTTGGTGGGTTCACTTTCAAAAGGACAAGCGGATCATCAGTCAAGAAAGAAGAAGAAGTGCTAACGGGCAACCTCCAAACACTGAAAATAAGAGTACATGAAGGGTATGAAGAATTCACAATGGTTGGGAGAAGAGCAACAGCTATTCTCAGAAAGGCAACCAGGAGA
Still a massive vocab for most models, so I tried using XLNet (the values are a bit messed up here - realized I had 1 as a prime, as seen in the above, which led to much smaller size)
import torch
from transformers import *
tokenizer = XLNetTokenizer.from_pretrained(''xlnet-base-cased')
model = XLNetModel.from_pretrained('xlnet-base-cased')
num_added_toks = tokenizer.add_tokens(complete_tokens) # list of the deduplicated tokens
print('We have added', num_added_toks, 'tokens')
model.resize_token_embeddings(len(tokenizer))
>> We have added 65134 tokens
>> Embedding(97134, 768)
This is where I'm currently at. My first goal is to attempt for Sequence Classification/Entailment. Stuck on how to pre-process the data into the correct format for that task.
Also - I realized that the flu dataset is a lot smaller than it should be, so I'll reupload the updated version in the folder soon.