ccsmeth icon indicating copy to clipboard operation
ccsmeth copied to clipboard

Extract features file structure

Open RahelehSalehi opened this issue 1 year ago • 15 comments

Hi, I used ccsmeth extract command to extract features. How should it be structured when opened in tsv file in python? Could you please give some information related to the file structure? Thank you so much...

RahelehSalehi avatar Mar 15 '23 13:03 RahelehSalehi

Hi @RahelehSalehi , the features-tsv file are in the following format, each row represents features of a CpG site:

chrom, position_in_chrom, strand, read_id, position_in_read,
seq_of_fwd_kmer, no_of_fwd_subreads, ipd_mean_of_fwd_kmer, ipd_std_of_fwd_kmer(deprecated), pw_mean_of_fwd_kmer, pw_std_of_fwd_kmer(deprecated), qual_of_fwd_kmer, mapq_of_fwd_kmer,
seq_of_rev_kmer, no_of_rev_subreads, ipd_mean_of_rev_kmer, ipd_std_of_rev_kmer(deprecated), pw_mean_of_rev_kmer, pw_std_of_rev_kmer(deprecated), qual_of_fwd_kmer, mapq_of_fwd_kmer,
methy_label

Best, Peng

PengNi avatar Mar 17 '23 01:03 PengNi

Thank you so much for your response. Regarding the extracted features, when you normalized the signals what is the difference between 'zscore', 'min-max', 'min-mean', or 'mad' normalization methods? Since I extracted features from my data, some of the mean IPD values are negative values, do you know why it happens?

RahelehSalehi avatar Mar 23 '23 11:03 RahelehSalehi

@RahelehSalehi , it is because of the zscore normalization. You can check the zscore formula, there can be negative values after zscore normalization. The related code is here: https://github.com/PengNi/ccsmeth/blob/master/ccsmeth/extract_features.py#L169.

Best, Peng

PengNi avatar Mar 23 '23 11:03 PengNi

@PengNi I find this interesting. I am currently exploring this tool for my data and it's been a bit technical as I have a little bioinformatic background. I have used your trained model for calling the modification and the extraction of the features, please how do I use the output of the ccsmeth extract in the deep neural network of the ccsmeth according to your paper on arxiv.

olaraym avatar Mar 28 '23 01:03 olaraym

I also want to know if it will be worthwhile to train my own model

olaraym avatar Mar 28 '23 01:03 olaraym

@olaraym , hi, you can just follow the steps in quick strat to call modifications and frequencies. To train a new model, please check the ccsmeth train or ccsmeth trainm commands. If you data is non-human, it is worth a try.

PengNi avatar Mar 28 '23 01:03 PengNi

@PengNi thank you very much for your response, I appreciate it. My data is non-human and I will definitely try it out.

olaraym avatar Mar 28 '23 01:03 olaraym

Hi PengNi, Thank you so much for sharing your code with us. I extracted the features by extracted_features code for my data. the features are chrom, position_in_chrom, strand, read_id, position_in_read, seq_of_fwd_kmer, no_of_fwd_subreads, ipd_mean_of_fwd_kmer, ipd_std_of_fwd_kmer(deprecated), pw_mean_of_fwd_kmer, pw_std_of_fwd_kmer(deprecated), qual_of_fwd_kmer, mapq_of_fwd_kmer, seq_of_rev_kmer, no_of_rev_subreads, ipd_mean_of_rev_kmer, ipd_std_of_rev_kmer(deprecated), pw_mean_of_rev_kmer, pw_std_of_rev_kmer(deprecated), qual_of_fwd_kmer, mapq_of_fwd_kmer, methy_label I was wondering if it is possible to tell me, what is strand? when should it be positive and when should it be negative? thanks

RahelehSalehi avatar Jul 13 '23 13:07 RahelehSalehi

Hi @RahelehSalehi, if the read is mapped to the reverse strand of the reference (SAM FLAG 0x10), then strand is -.

PengNi avatar Jul 14 '23 08:07 PengNi

Could you please explain a little more? when we have a seq which is involved forward strand and reverse strand. If it is -, is the strand in the features list the reverse strand?

RahelehSalehi avatar Jul 14 '23 08:07 RahelehSalehi

Could you please explain a little more? when we have a seq which is involved forward strand and reverse strand. If it is -, is the strand in the features list the reverse strand?

The strand value is based on if there is 0x10 in the FLAG field of an alignment segment. Ref: https://samtools.github.io/hts-specs/SAMv1.pdf

PengNi avatar Jul 14 '23 09:07 PengNi

Hi Peng, I have a question for you about the extracted file. there are two parameters which are mapq and identity. Please explain a bit about them. I'd like to know when I should change the default. Thank you so much. Best Raheleh

RahelehSalehi avatar Oct 24 '23 10:10 RahelehSalehi

Hi Raheleh,

mapq and identity are for removing low quality reads, representing the mapping quality and identity of an alignment item (read to reference alignmet), respectively. The defaults of the two params are 1 and 0.0, respecitively, which generally keep all the reads for feature extraction.

Best, Peng

PengNi avatar Oct 25 '23 09:10 PengNi

Hi Peng, Thanks a lot for your response. Could you please explain to me what are the differences when I set mapq to 1 or 0? Do you think it is mandatory to set mapq to 1 if I want to align my dataset? If I set mapq and identity in the following numbers, could you please explain to me about each set? 1- mapq = 0, identity =0. 2-mapq=0,identity=1, 3-mapq=1,identity=0, 4-mapq=1,identity=1? Thank you so much.

RahelehSalehi avatar Oct 25 '23 09:10 RahelehSalehi

Hi Raheleh, mapq is an integer ranged from 0-255 (check https://samtools.github.io/hts-specs/SAMv1.pdf); identity is a decimal ranged from 0-1 (check https://www.differencebetween.com/difference-between-similarity-and-identity-in-sequence-alignment/). For both the two params, higher values mean higher read quality, which wil make more reads being removed, and may lead a better predeiction with only the high-quality reads.

Best, Peng

PengNi avatar Oct 25 '23 10:10 PengNi