ccsmeth
ccsmeth copied to clipboard
Extract features file structure
Hi, I used ccsmeth extract command to extract features. How should it be structured when opened in tsv file in python? Could you please give some information related to the file structure? Thank you so much...
Hi @RahelehSalehi , the features-tsv file are in the following format, each row represents features of a CpG site:
chrom, position_in_chrom, strand, read_id, position_in_read,
seq_of_fwd_kmer, no_of_fwd_subreads, ipd_mean_of_fwd_kmer, ipd_std_of_fwd_kmer(deprecated), pw_mean_of_fwd_kmer, pw_std_of_fwd_kmer(deprecated), qual_of_fwd_kmer, mapq_of_fwd_kmer,
seq_of_rev_kmer, no_of_rev_subreads, ipd_mean_of_rev_kmer, ipd_std_of_rev_kmer(deprecated), pw_mean_of_rev_kmer, pw_std_of_rev_kmer(deprecated), qual_of_fwd_kmer, mapq_of_fwd_kmer,
methy_label
Best, Peng
Thank you so much for your response. Regarding the extracted features, when you normalized the signals what is the difference between 'zscore', 'min-max', 'min-mean', or 'mad' normalization methods? Since I extracted features from my data, some of the mean IPD values are negative values, do you know why it happens?
@RahelehSalehi , it is because of the zscore normalization. You can check the zscore formula, there can be negative values after zscore normalization. The related code is here: https://github.com/PengNi/ccsmeth/blob/master/ccsmeth/extract_features.py#L169.
Best, Peng
@PengNi I find this interesting. I am currently exploring this tool for my data and it's been a bit technical as I have a little bioinformatic background. I have used your trained model for calling the modification and the extraction of the features, please how do I use the output of the ccsmeth extract in the deep neural network of the ccsmeth according to your paper on arxiv.
I also want to know if it will be worthwhile to train my own model
@olaraym , hi, you can just follow the steps in quick strat to call modifications and frequencies. To train a new model, please check the ccsmeth train
or ccsmeth trainm
commands. If you data is non-human, it is worth a try.
@PengNi thank you very much for your response, I appreciate it. My data is non-human and I will definitely try it out.
Hi PengNi, Thank you so much for sharing your code with us. I extracted the features by extracted_features code for my data. the features are chrom, position_in_chrom, strand, read_id, position_in_read, seq_of_fwd_kmer, no_of_fwd_subreads, ipd_mean_of_fwd_kmer, ipd_std_of_fwd_kmer(deprecated), pw_mean_of_fwd_kmer, pw_std_of_fwd_kmer(deprecated), qual_of_fwd_kmer, mapq_of_fwd_kmer, seq_of_rev_kmer, no_of_rev_subreads, ipd_mean_of_rev_kmer, ipd_std_of_rev_kmer(deprecated), pw_mean_of_rev_kmer, pw_std_of_rev_kmer(deprecated), qual_of_fwd_kmer, mapq_of_fwd_kmer, methy_label I was wondering if it is possible to tell me, what is strand? when should it be positive and when should it be negative? thanks
Hi @RahelehSalehi, if the read is mapped to the reverse strand of the reference (SAM FLAG 0x10), then strand
is -
.
Could you please explain a little more? when we have a seq which is involved forward strand and reverse strand. If it is -, is the strand in the features list the reverse strand?
Could you please explain a little more? when we have a seq which is involved forward strand and reverse strand. If it is -, is the strand in the features list the reverse strand?
The strand
value is based on if there is 0x10
in the FLAG field of an alignment segment. Ref: https://samtools.github.io/hts-specs/SAMv1.pdf
Hi Peng, I have a question for you about the extracted file. there are two parameters which are mapq and identity. Please explain a bit about them. I'd like to know when I should change the default. Thank you so much. Best Raheleh
Hi Raheleh,
mapq
and identity
are for removing low quality reads, representing the mapping quality and identity of an alignment item (read to reference alignmet), respectively. The defaults of the two params are 1
and 0.0
, respecitively, which generally keep all the reads for feature extraction.
Best, Peng
Hi Peng, Thanks a lot for your response. Could you please explain to me what are the differences when I set mapq to 1 or 0? Do you think it is mandatory to set mapq to 1 if I want to align my dataset? If I set mapq and identity in the following numbers, could you please explain to me about each set? 1- mapq = 0, identity =0. 2-mapq=0,identity=1, 3-mapq=1,identity=0, 4-mapq=1,identity=1? Thank you so much.
Hi Raheleh, mapq
is an integer ranged from 0-255 (check https://samtools.github.io/hts-specs/SAMv1.pdf); identity
is a decimal ranged from 0-1 (check https://www.differencebetween.com/difference-between-similarity-and-identity-in-sequence-alignment/). For both the two params, higher values mean higher read quality, which wil make more reads being removed, and may lead a better predeiction with only the high-quality reads.
Best, Peng