Cogent When running cogent, most of sequences ended up being in the tucked category

Hi! I have four populations I have been trying to run Cogent for. Three seemed to work really well and I didn't have an issue. The other one, I have been struggling with. First I tried the instructions for family finding for a small dataset. 8969 of the sequences ended up going into one partition (and didn't seem to have much sequence similarity). Then within this bin there were many folders titled split with a number. I thought the problem might be because I had greater than 20,000 sequences as input so I switched to the instructions for the large dataset. I'm not sure what happened here but I had only 3,432 bins created in the precluster_out directory and a large number of sequences were in the tucked category (15,174). What are the tucked sequences and what do you think I am doing incorrectly? (For my other populations, I had around 9000 bins created and there wasn't this same issue of so many sequences going into one partition). Thank you!

Jan 28 '21 23:01 mprotas69

Hi @mprotas69 , The "tucked" sequences mean they are subsequences of other larger sequences, so in an effort to deal with larger datasets and since subsequences do not give more information in terms of understanding the coding regions of the genome, they are put (tucked) aside.

I'm not sure about the 8969 sequences being initially in one giant bin. The cutoff could have been too relaxed?

Is this an unusual transcriptome of an exotic species or have some repetitiveness?

My default in Cogent are tuned based on how k-mer similarities are shared in the human genes (based on GENCODE).

The two defaults you may want to play with would be:

usage: process_kmer_to_graph.py [-h] [-c COUNT_FILENAME]
                                [--sim_threshold SIM_THRESHOLD]
                                [--ncut_threshold NCUT_THRESHOLD] [--version]
                                fasta_filename dist_filename output_dir
                                output_prefix

positional arguments:
  fasta_filename
  dist_filename
  output_dir
  output_prefix

optional arguments:
  -h, --help            show this help message and exit
  -c COUNT_FILENAME, --count_filename COUNT_FILENAME
                        Count filename (if not given, assume all weight is 1)
  --sim_threshold SIM_THRESHOLD
                        similarity threshold (default: 0.05)
  --ncut_threshold NCUT_THRESHOLD
                        ncut threshold (default: 0.2)
  --version             show program's version number and exit

The --sim_threshold is set at 0.05 meaning k-mer profiles of two transcripts only need to share 5% similarity. This seems low, but when I looked at GENCODE human genes, rarely did two genes of diff families share more than 5%. You may want to increase this number.

--ncut_threshold is the threshold used by the normalized cut algorithm and implemented by skimage based on this paper. It's been a while but if I remember correctly the lower you set --ncut_threshold to be the more partitioned the graph will be, so you could also make this --ncut_threshold lower. The default is 0.2, again based on human gene results, but you can set it lower (the original paper used it for image segmentation and used 0.06).

Let me know how it goes. If it still looks odd and you wanna share data w me lemme know and give me an email to request confidential data upload. -Liz

Jan 29 '21 18:01 Magdoll

Hi Liz, Thanks so much for the suggestions. It is an unusual species (a crustacean) and likely does have some sort of repetitiveness.
I changed the sim_thershold to .2. This seemed to work well as the fake genome generated had a similar BUSCO profile to that of the sequences before they were run through cogent (except the number of complete and single-copy BUSCOs was larger now and the Complete and duplicated BUSCOs was smaller, as expected). It does puzzle me though that the other three populations of the same species worked well with the sim_threshold of .05 and did not run into the same issue. Do you have any ideas about why that might be? Thanks! Meredith

On Jan 29, 2021, at 10:21 AM, Elizabeth Tseng [email protected] wrote:

Hi @mprotas69 https://github.com/mprotas69 , The "tucked" sequences mean they are subsequences of other larger sequences, so in an effort to deal with larger datasets and since subsequences do not give more information in terms of understanding the coding regions of the genome, they are put (tucked) aside.

I'm not sure about the 8969 sequences being initially in one giant bin. The cutoff could have been too relaxed?

Is this an unusual transcriptome of an exotic species or have some repetitiveness?

My default in Cogent are tuned based on how k-mer similarities are shared in the human genes (based on GENCODE).

The two defaults you may want to play with would be:

usage: process_kmer_to_graph.py [-h] [-c COUNT_FILENAME] [--sim_threshold SIM_THRESHOLD] [--ncut_threshold NCUT_THRESHOLD] [--version] fasta_filename dist_filename output_dir output_prefix

positional arguments: fasta_filename dist_filename output_dir output_prefix

optional arguments: -h, --help show this help message and exit -c COUNT_FILENAME, --count_filename COUNT_FILENAME Count filename (if not given, assume all weight is 1) --sim_threshold SIM_THRESHOLD similarity threshold (default: 0.05) --ncut_threshold NCUT_THRESHOLD ncut threshold (default: 0.2) --version show program's version number and exit The --sim_threshold is set at 0.05 meaning k-mer profiles of two transcripts only need to share 5% similarity. This seems low, but when I looked at GENCODE human genes, rarely did two genes of diff families share more than 5%. You may want to increase this number.

--ncut_threshold is the threshold used by the normalized cut algorithm and implemented by skimage https://scikit-image.org/docs/stable/auto_examples/segmentation/plot_ncut.html#id1 based on this paper https://people.eecs.berkeley.edu/~malik/papers/SM-ncut.pdf. It's been a while but if I remember correctly the lower you set --ncut_threshold to be the more partitioned the graph will be, so you could also make this --ncut_threshold lower. The default is 0.2, again based on human gene results, but you can set it lower (the original paper used it for image segmentation and used 0.06).

Let me know how it goes. If it still looks odd and you wanna share data w me lemme know and give me an email to request confidential data upload. -Liz

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Magdoll/Cogent/issues/92#issuecomment-769968993, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASO67OWSWUSKS4TEIIFLTYLS4L4BHANCNFSM4WX5QLHA.

Feb 01 '21 18:02 mprotas69

Hi @mprotas69 (Meredith),

I'm...not sure why that was OK for the other two populations! Can you do a BLAST of the Cogent results to see if these genes in this strange population also showed up in the other ones? Maybe run Cogent w new param on the other ones as well?

-Liz

Feb 01 '21 22:02 Magdoll

Cogent Cogent copied to clipboard

When running cogent, most of sequences ended up being in the tucked category

Cogent
Cogent copied to clipboard