chromeister
chromeister copied to clipboard
coordinates from the events file
Hi Esteban,
I'm interested in pulling the specific chromosomal coordinates identified in the events file after running run_and_plot_chromeister.sh. I believe they are reported globally across the genome (the genome has 21 chromosomes in the fasta) in this file, nor does the file specify the chromosome from which they are derived. What is the best way to covert global coordinate to local chromosomal coordinates for the events identified?
Thanks in advance,
Jeff
Sorry, I should be more specific. The events.txt file has multiple descriptions of the syntenic blocks, I'm specifically interested in the rearranged block (i.e., inversions, transposition, etc.) and not the conserved blocks. Thanks, Jeff
Hello Jeff!
You are correct in your guess: the coordinates are always global (internally the fasta headers > sequence ...
get removed). In case you want these to be local there are two approaches:
- You could re-run everything with every chromosome in a separate file (this takes a bit more time but it is still fast).
- You can convert them by using the following script that I just pushed into the repository (remember to update your repository). It is located in the
bin
folder and namedglobal_to_local.sh
. To use it, you have to runbin/global_to_local.sh comparison.csv comparison.events.txt > localEvents.txt
. Important: the two provided arguments are thecomparison.csv
which is the auto-generated csv file (which contains coordinates) and the events file. The new local coordinates per chromosome will be in thelocalEvents.txt
file. Disclaimer: I have not had much time to test it so be careful with it and check that it works (e.g. if you notice that coordinates are too big or negative, then that would be a sign that it does not work). Also, remember that the coordinates in the events file are a coarse-grained approximation!
Note: The script uses a naive approach which makes it slow if you the fasta files contain many sequences or if the events file contains many events. It is naive because it will check every event with all the lengths of the sequences until one that is bigger is found. This could be easily improved with a binary search, but it should work without further problem if you have e.g. 30 by 30 chromosomes.
Btw: I am no longer able to maintain this repository on a daily basis as I recently switched jobs. Still I will try to help when possible!
Hope this helps, Esteban
Esteban,
Thanks so much for you help - very much appreciated - I'll give you script a try now!
A couple more quick questions - is there a way to export the figure as a higher resolution file? I'm looking for something that I could import into illustrator to make manuscript quality figures. Lastly, I'm still trying to wrap my head around the level argument. It appears that it should be set to 4, except when using with large plant genome where you indicate that it could be set to 1. What exactly is this argument doing?
I know you've switched jobs and really appreciate you still helping with chromeister.
Jeff
Hey there @pjm43 ,
Thanks so much for you help - very much appreciated
Thanks for using CHROMEISTER! :)
I'll give you script a try now!
Let me know if it works alright!
is there a way to export the figure as a higher resolution file?
I think this should be possible. In essence, what CHROMEISTER plots is basically a matrix of 0's and 1's after it has computed some stuff. This is the matrix called dotplot.mat.raw.txt
(although the first two lines include the sequence lengths). I have tried something out that might work out for you. First, remove the two length lines from the file by doing:
tail -n +3 dotplot.mat.raw.txt > dotplot.nolen
Then copy this python code:
import matplotlib.pyplot as plt
import numpy as np
import matplotlib
matplotlib.use('Agg')
fig = plt.figure(figsize=(16, 16))
matrix = np.loadtxt('dotplot.nolen')
plt.imshow(matrix, cmap='Greys', interpolation='nearest')
plt.savefig("dotplot.svg")
Note: Remember to change the dotplot.nolen
to the name of your comparison file.
Note 2: You may need to set up a virtual env and install python packages with pip.
Then you can run the above python script and it will save the matrix to an svg
file.
If you are happy with the output, you can then add the axes names and titles (check out the matplotlib library for this).
This is the resulting svg
for a normal comparison:
(Note: you should be able to download the svg
by right clicking on it, I am not sure how github will render it).
This is not a very clean solution but it might work for what you need.
IMPORTANT: saving dotplots from CHROMEISTER as vector graphics file was not planned originally because the png
file should provide with enough detail - especially considering that CHROMEISTER is very, very heuristic. This means that if you make the dotplot of too much resolution then some boundary pixels might show signal when there is none in reality, i.e. CHROMEISTER will give you a good coarse-grain preview, but going into the fine-grain detail might produce some artifacts (remember that every pixel in the dotplot typically corresponds to around 100,000 base pairs in sequences of length 100 MBp!). Always check that the results are correct and that they make sense!
I'm still trying to wrap my head around the level argument. It appears that it should be set to 4, except when using with large plant genome where you indicate that it could be set to 1. What exactly is this argument doing?
I would not give too much though to this parameter, but in essence it controls how many nucleotides are skipped to calculate the hash function. This means that if you input z=4
(the default) only 1 in every 4 nucleotides will be used to compute the hash of a word. If you use z=1
, all of them will be used, thus leading to less collisions. Therefore, a low z
value is recommended when comparing very large sequences with many repeats (such as plants) to avoid "too many" collisions. But in my experience this affects the output by very little and only in some comparisons, so I think you are probably safe leaving to 4.
Hope it helps! Esteban