smashpp icon indicating copy to clipboard operation
smashpp copied to clipboard

Recommandation for eukaryotic species

Open kullrich opened this issue 1 year ago • 2 comments

Hi,

are there any recommandation for eukaryotic species?

I am currently comparing two highly similar eukaryotic genome sequences, but get no synteny nor any rearrangements at all?

wget http://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
wget http://ftp.ensembl.org/pub/current_fasta/pan_troglodytes/dna/Pan_troglodytes.Pan_tro_3.0.dna.toplevel.fa.gz
gunzip Pan_troglodytes.Pan_tro_3.0.dna.toplevel.fa.gz
smashpp -n 32 -m 5000 -f 10000 -fs L -r Homo_sapiens.GRCh38.dna.primary_assembly.fa -t Pan_troglodytes.Pan_tro_3.0.dna.toplevel.fa

The results are empty, however I would expect to see some differences between human and chimp.

====[ PREPARE DATA ]==================================
[+] Homo_sapiens.GRCh38.dna.primary_assembly.fa (FASTA) -> Homo_sapiens.GRCh38.dna.primary_assembly.seq (seq) finished.
[+] Pan_troglodytes.Pan_tro_3.0.dna.toplevel.fa (FASTA) -> Pan_troglodytes.Pan_tro_3.0.dna.toplevel.seq (seq) finished.

====[ REGULAR MODE ]==================================
[+] Creating model of Homo_sapiens.GRCh38.dna.primary_assembly.fa done.
[+] Filtering Pan_troglodytes.Pan_tro_3.0.dna.toplevel.fa done => 0 segments

====[ INVERTED MODE ]=================================
[+] Creating model of Homo_sapiens.GRCh38.dna.primary_assembly.fa done.
[+] Filtering Pan_troglodytes.Pan_tro_3.0.dna.toplevel.fa done => 0 segments

Thank you in anticipation

Best regards

Kristian

kullrich avatar Sep 07 '22 10:09 kullrich

Dear Kristian,

First lets understand the characteristics of the data I've followed your instructions and got this:

-rw-rw-r-- 1 x x 504569856 set 16 10:47 Homo_sapiens.GRCh38.dna.primary_assembly.fa -rw-rw-r-- 1 x x 3151425857 jun 4 09:50 Homo_sapiens.GRCh38.dna.primary_assembly.fa_bk

It seems that you are using this Homo_sapiens.GRCh38.dna.primary_assembly.fa sequences that contains less than 500 MB (while the Homo_sapiens.GRCh38.dna.primary_assembly.fa_bk seems to have all the info).

Is it supposed? What represents this sequence?

Best regards, Diogo

pratas avatar Sep 16 '22 09:09 pratas

Hi, if I do it I get the following, so 800MB for the gz and the unzipped file has 3006MB so in my case the full reference genome is present?

-bash-4.2$ wget http://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
--2022-09-17 10:01:48--  http://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.139
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.139|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 881211416 (840M) [application/x-gzip]
Saving to: ‘Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz’

100%[======================================================>] 881,211,416 43.6MB/s   in 20s    

2022-09-17 10:02:11 (41.8 MB/s) - ‘Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz’ saved [881211416/881211416]

-bash-4.2$ gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
-bash-4.2$ ls -al --block-size=M Homo_sapiens.GRCh38.dna.primary_assembly.fa
3006M Jun  4 10:50 Homo_sapiens.GRCh38.dna.primary_assembly.fa

Best regards Kristian

kullrich avatar Sep 17 '22 08:09 kullrich