boltz icon indicating copy to clipboard operation
boltz copied to clipboard

Missing MSAs in the provided raw msa data

Open tpdmskim opened this issue 1 year ago • 0 comments

Hi,

I downloaded the raw MSA files you provided using the following commands:

wget https://boltz1.s3.us-east-2.amazonaws.com/rcsb_raw_msa.tar
tar -xf rcsb_raw_msa.tar
rm rcsb_raw_msa.tar

After extracting the archive, I noticed that some MSAs for certain sequences are missing, even though structural data for these sequences exists. (in the rcsb_processed_targets/structures/*.npz)

Upon checking, I found that approximately 16,130 sequences that present in the structure file but do not have corresponding raw msa data.

To illustrate this issue, I have identified some sequences that appear to be missing from the raw MSA dataset

id,sequence
25194588de88b5cded80db552b93c98f00928c8d73fa69bc76edfce6581b8f70,GSPEFSLDVRQEELGAVVDKEMAATSAAIEDAVRRIEDMMNQARHASSGVKLEVNERILNSCTDLMKAIRLLVTTSTSLQKEIVESGRGAATQQEFYAKNSRWTEGLISASKAVGWGATQLVEAADKVVLHTGKYEELIVCSHEIAASTAQLVAASKVKANKHSPHLSRLQECSRTVNERAANVVASTKSGQEQIEDRDTMDFSGL
8d86d4a88063d08a61a75672326b45dae86206057c63052bb343a2a29a930603,MSLKDVSLSSFDAHDLDLDKFPEVVRDRLTQFLDAQELTIADIGAPVTDAVAHLRSFVLNGGKRIRPLYAWAGFLAAQGHKNSSEKLESVLDAAASLEFIQACALIHDDIIDSSDTRRGAPTVHRAVEADHRANNFEGDPEHFGVSVSILAGDMALVWAEDMLQDSGLSAEALARTRDAWRGMRTEVIGGQLLDIYLESHANESVELADSVNRFKTAAYTIARPLHLGASIAGGSPQLIDALLHYGHDIGIAFQLRDDLLGVFGDPAITGKPAGDDIREGKRTVLLALALQRADKQSPEAATAIRAGVGKVTSPEDIAVITEHIRATGAEEEVEQRISQLTESGLAHLDDVDIPDEVRAQLRALAIRSTERREGHHHHHH
88bd00693617e61f245336c86cdcaa313cb9103d8dfb1417c451eb48581b1027,MKIEEGKLVIWINGDKGYNGLAEVGKKFEKDTGIKVTVEHPDKLEEKFPQVAATGDGPDIIFWAHDRFGGYAQSGLLAEITPAAAFQDKLYPFTWDAVRYNGKLIAYPIAVEALSLIYNKDLLPNPPKTWEEIPALDKELKAKGKSALMFNLQEPYFTWPLIAADGGYAFKYENGKYDIKDVGVDNAGAKAGLTFLVDLIKNKHMNADTDYSIAEAAFNKGETAMTINGPWAWSNIDTSAVNYGVTVLPTFKGQPSKPFVGVLSAGINAASPNKELAKEFLENYLLTDEGLEAVNKDKPLGAVALKSYEEELAKDPRIAATMENAQKGEIMPNIPQMSAFWYAVRTAVINAASGRQTVDAALAAAQTNAAAMARFEDPTRRPYKLPDLCTELNTSLQDIEITCVYCKTVLELTEVFEFARKDLFVVYRDSIPHAACHKCIDFYSRIRELRHYSDSVYGDTLEKLTNTGLYNLLIRCLRCQKPLNPAEKLRHLNEKRRFHNIAGHYRGQCHSCCNRARQERLQRGSAAAESSELTFQELLGERR
022c5e46aa4d58f82ea0b1dcb834398f4f1195826519611b0447b8b5e3536ef3,MERDGCAGGGSRGGEGGRAPREGPAGNGRDRGRSHAAEAPGDPQAAASLLAPMDVGEEPLEKAARARTAKDPNTYKVLSLVLSVCVLTTILGCIFGLKPSCAKEVKSCKGRCFERTFGNCRCDAACVELGNCCLDYQETCIEPEHIWTCNKFRCGEKRLTRSLCACSDDCKDKGDCCINYSSVCQGEKSWVEEPCESINEPQCPAGFETPPTLLFSLDGFRAEYLHTWGGLLPVISKLKKCGTYTKNMRPVYPTKTFPNHYSIVTGLYPESHGIIDNKMYDPKMNASFSLKSKEKFNPEWYKGEPIWVTAKYQGLKSGTFFWPGSDVEINGIFPDIYKMYNGSVPFEERILAVLQWLQLPKDERPHFYTLYLEEPDSSGHSYGPVSSEVIKALQRVDGMVGMLMDGLKELNLHRCLNLILISDHGMEQGSCKKYIYLNKYLGDVKNIKVIYGPAARLRPSDVPDKYYSFNYEGIARNLSCREPNQHFKPYLKHFLPKRLHFAKSDRIEPLTFYLDPQWQLALNPSERKYCGSGFHGSDNVFSNMQALFVGYGPGFKHGIEADTFENIEVYNLMCDLLNLTPAPNNGTHGSLNHLLKNPVYTPKHPKEVHPLVQCPFTRNPRDNLGCSCNPSILPIEDFQTQFNLTVAEEKIIKHETLPYGRPRVLQKENTICLLSQHQFMSGYSQDILMPLWTSYTVDRNDSFSTEDFSNCLYQDFRIPLSPVHKCSFYKNNTKVSYGFLSPPQLNKNSSGIYSEALLTTNIVPMYQSFQVIWRYFHDTLLRKYAEERNGVNVVSGPVFDFDYDGRCDSLENLRQKRRVIRNQEILIPTHFFIVLTSCKDTSQTPLHCENLDTLAFILPHRTDNSESCVHGKHDSSWVEELLMLHRARITDVEHITGLSFYQQRKEPVSDILKLKTHLPTFSQED
e7c9b9684782cf14c45d492f2db59bb891e75fa420a0f9dc20006e8ee4a4f341,MGSSHHHHHHSSGLVPRGSHMRMLPSFLALLLGSGLAFNAQANTSTLKVCAASDEMPYSNKQQEGFENQLAKILADTMDRELEFVWSDKAAIFLVTEKLLKNQCDVVMGVDKGDPRVATSDPYYKSGYAFIYPADKGLDIKNWQSPALKDMSKFAIVPGSPSEVMLREIDKYEGNFNYTMSLIGFKSRRNQYVRYAPDLLVSEVVSGKADIAHIWAPEAARYVKSASVPLKMVVSEEIAPTRDGEGVRQQFEQSIAVRSDDQELLKEINTALHKADPKIKAVLKDEGIPLL
c34e68840e1b514f02e2b06a18e871b663f9aeeb29769cbcbf0ab10269e0cc12,QLQLVETGGGLVKPGGSLRLSCVVSGFTFDDYRMAWVRQAPGKELEWVSSIDSWSINTYYEDSVKGRFTISTDNAKNTLYLQMSSLKPEDTAVYYCAAEDRLGVPTINAHPSKYDYNYWGQGTQVTVSS
6df645d3a8b0f9462ae8babcb971fb4200f2d7952a30e05ca5d8bd84b246f7ca,VERDKYANFTINFTMENQIHTGMEYDNGRFIGVKFKSVTFKDSVFKECYFEDVTSSNTFFRNCTFINTVFYNTDLFEYKFVNSRLINSTFLHNKEGTSPSASGGS

I would like to know if this is expected behavior or if there was an issue with the dataset. Could you please confirm whether these MSAs were intentionally excluded, or if there is an error in the dataset?

Thank you!

tpdmskim avatar Feb 11 '25 10:02 tpdmskim