GRNsight icon indicating copy to clipboard operation
GRNsight copied to clipboard

Extend the degradation rates and production rates data from Neymotin to all genes, not just regulatory transcription factors

Open kdahlquist opened this issue 4 months ago • 2 comments

@kdahlquist needs to provide an update to the degradation rates and production rates for all genes in the Neymotin data, not just the regulatory transcription factors.

kdahlquist avatar Nov 05 '25 23:11 kdahlquist

I found the file with the extended list of degradation rates from Neymotin et al. 2014. I did a QA visual inspection of the data and found the following:

  • There is a total of 5380 records.
  • 72 records are noncoding RNA as follows:
  • systematic_name standard_name YNCI0012 ICR1 YNCM0032C RNA170 YNCE0007C RPR1 YNCF0003C RUF20 YNCA0003W snR18 YNCM0026C snR24 YNCK0009W snR38 YNCG0014C snR39 YNCG0014C snR39B YNCN0001W snR40 YNCP0015C snR41 YNCP0015C snR47 YNCG0026W snR48 YNCO0007W snR50 YNCP0013C snR51 YNCE0020C snR52 YNCE0003W snR53 YNCL0042C snR55 YNCB0003W snR56 YNCO0003C snR58 YNCP0003W snR59 YNCJ0009C snR60 YNCL0041C snR61 YNCO0015C snR62 YNCK0001W snR64 YNCN0014W snR66 YNCE0002W snR67 YNCK0013W snR69 YNCH0014W snR71 YNCM0017W snR72 YNCM0015W snR74 YNCM0014W snR75 YNCM0013W snR76 YNCM0012W snR77 YNCM0011W snR78 YNCL0005C snR79 YNCE0001C snR80 YNCO0006W snR81 YNCM0002C snR85 YNCE0017W SRG1 YNCD0012W tD(GUC)D YNCK0005C tE(UUC)K YNCB0002W tF(GAA)B YNCF0008C tG(GCC)F2 YNCP0009W tG(GCC)P1 YNCG0011W tH(GUG)G2 YNCG0033W tI(AAU)G YNCP0020W tI(AAU)P1 YNCE0009C tK(CUU)E1 YNCG0004W tK(UUU)G1 YNCL0047W tK(UUU)L YNCO0016W tK(UUU)O YNCD0031C tL(CAA)D YNCN0007W tL(CAA)N YNCD0008W tL(UAA)D YNCL0049C tL(UAA)L YNCL0032C tL(UAG)L1 YNCP0019W tN(GUU)P YNCH0013C SUF8 YNCO0031W tP(UGG)O3 YNCJ0008W tR(ACG)J YNCD0019C tS(AGA)D2 YNCD0028W tS(AGA)D3 YNCG0025C tS(AGA)G YNCG0046W tT(UGU)G2 YNCG0018C tV(AAC)G1 YNCG0002C tV(AAC)G3 YNCD0022C tV(CAC)D YNCH0016C tV(CAC)H YNCG0009C tW(CCA)G1 YNCK0010W tW(CCA)K YNCP0001C tW(CCA)P
  • They did not have the correct systematic name, so I looked them up at SGD. SUF8 also did not have the correct standard name, so I fixed that, too.
  • Changed standard name of DUR1,2 to DUR12; ARG5,6 to ARG56, and ADE5,7 to ADE57 as per the updated standard names for these genes.
  • Changed standard name of IMP2' to IMP21
  • TOA1 was duplicated in the original data from Neymotin. Values were the same, so deleted one record. Total records are now 5379
  • Updated the standard names of many more genes (forgot to record how many, on the order of hundreds) who had been assigned standard names since 2014.

New version of the degradation rates table is found here: https://lmu.box.com/s/yllyj5ds2ndfq6vtf93gh9kcm28egr6f

kdahlquist avatar Nov 07 '25 23:11 kdahlquist

Production rates table has been updated. We are using 2X degradation rate as the production rate initial guesses, so took the revised degradation rate table with corrected IDs and computed the new production rates. The file can be found here: https://lmu.box.com/s/yllyj5ds2ndfq6vtf93gh9kcm28egr6f

@ntran18, would you please update the Expression database with these two tables? We do not need to keep the old versions of the data, this is a replacement.

kdahlquist avatar Nov 07 '25 23:11 kdahlquist

I modified the scripts to update production rates and degradation rates for expression easily. In term of logic, I didn't change anything. I make the code more organize and add command line arguments so we can easily update different tables. I also add a line of code to clear table each time we load a new production rates and degradation rates.

The next step is for @ceciliazaragoza, @MilkaZek, and @Amelie1253 to update it locally with my updated documentation to verify that the code is working for everyone and the documentation is clear for everyone.

PR

ntran18 avatar Nov 14 '25 07:11 ntran18

I received this error when updating my database:

python3 loader.py --prod --deg | psql postgresql://localhost/postgres

TRUNCATE TABLE
ERROR:  insert or update on table "production_rate" violates foreign key constraint "production_rate_gene_id_taxon_id_fkey"
DETAIL:  Key (gene_id, taxon_id)=(Q0045, 559292) is not present in table "gene".
TRUNCATE TABLE
ERROR:  insert or update on table "degradation_rate" violates foreign key constraint "degradation_rate_gene_id_taxon_id_fkey"
DETAIL:  Key (gene_id, taxon_id)=(Q0045, 559292) is not present in table "gene".

ceciliazaragoza avatar Nov 20 '25 22:11 ceciliazaragoza

The issue is because on the new production rate and degradation rate, there are genes that weren't in our current gene table. The production rate and degradation rate tables are linked with the gene table. If there is any gene that is not in the gene table, and we want to have it on the production rate and degradation rate tables, it would raise an error.

The question here is, do we just want to cross that gene out if it's not in the expression gene table? Or we would need to download information from AllianceMine for that gene.

ntran18 avatar Nov 21 '25 00:11 ntran18

Since the data sources here do not correspond and may also be from different points in time (e.g., 2020 vs. 2025) I’d say the first step is to produce a list of the missing genes and present them to @kdahlquist for inspection. Her domain knowledge will help determine which course of action will be most appropriate, and it’s even possible that different actions will be appropriate for different genes

dondi avatar Nov 21 '25 01:11 dondi

Yes, please send me the list of genes. My immediate guess is that some of them had a name change. I will be able to investigate and decide what we should do.

kdahlquist avatar Nov 21 '25 05:11 kdahlquist

Here is the file containing all the missing genes from the new degradation rates and production rates. Both degradation rates and production rates files have the same missing genes. On line 4053, there is an empty Gene ID too.

missing_genes.csv

ntran18 avatar Nov 21 '25 05:11 ntran18

If I am understanding correctly, there is a separate gene table in the Expression database? Would you please post that table to Box? I looked there and only the degradation/production rate data is there.

kdahlquist avatar Nov 21 '25 15:11 kdahlquist

If I remember correctly, @ntran18 said that all of the genes in the gene table in the expression database were found in the gene table of the network database, so we can use the gene table from the network database in the expression database. Please correct me if I'm wrong. I remade the degradation rates and production rates tables from scratch. As before there was 1 duplicate entry that I removed. There are 72 noncoding RNAs. I corrected the standard name for 226 genes either because Neymotin did not report one for that gene or because an alias was used instead. While I used Yeastract to check the IDs, I then did a visual inspection of all the changed genes. The errors @ntran18 found before were likely introduced when I used Yeastract to convert the IDs. However, I'm pretty certain that all is correct now. The two tables are attached.

DegradationRates_2025-12-01.csv ProductionRates_2025-12-01.csv

kdahlquist avatar Dec 01 '25 22:12 kdahlquist

I see there are two rows for YNCP0019W, but they have different values. Are those considered duplicates or a unique row?

Image

ntran18 avatar Dec 03 '25 05:12 ntran18

Note to myself:

  • If we are going to use the gene table from the network database, we need to change the expression schema. All the tables would then refer to gene_regulatory_network.genes table as foreign key.
  • Don't need to keep checking the missing genes in production_rates or degradation rates (or maybe we should).
  • Since we are using the gene_regulatory_network.genes table, we need to make sure new developers create the schema and populate the data of the network first, before that of gene expression.
  • Should we just have a single gene table that both PPI, GRN and gene expression can access?

ntran18 avatar Dec 03 '25 05:12 ntran18

I see there are two rows for YNCP0019W, but they have different values. Are those considered duplicates or a unique row? Image

I'm not sure I'm seeing what you are seeing, these are different IDs. One ends in 19W and the other ends in 20W.

kdahlquist avatar Dec 03 '25 16:12 kdahlquist

Don't need to keep checking the missing genes in production_rates or degradation rates (or maybe we should).

We absolutely need to keep checking for missing genes. Even though we have expanded the list, not all genes have a degradation/production rate. Or am I misunderstanding what you mean here?

kdahlquist avatar Dec 03 '25 16:12 kdahlquist

Should we just have a single gene table that both PPI, GRN and gene expression can access?

I don't know, but I'm hoping that we can be conservative in our changes because we are trying to get ready to publish.

kdahlquist avatar Dec 03 '25 16:12 kdahlquist

Sorry this is the correct capture for the duplication

Image

ntran18 avatar Dec 03 '25 18:12 ntran18

@dondi and @ntran18 will meet during office hours to walk through the script so that @dondi can get a full detailed picture of how this is loaded

@ntran18 in case I have time to review the code beforehand, please indicate which files/function are most relevant to this sequence

dondi avatar Dec 03 '25 18:12 dondi

One of the gene IDs above was a mistake and had the wrong ID. I have fixed it and re-uploaded the degradation rate and production rate files to Box. I resorted on the systematic name so they are no longer next to each other in the file. The new ID is YNCG0011W. https://lmu.box.com/s/3ax0ezy1c5rtbseht3ywjsria2yuteb2 https://lmu.box.com/s/oxzcfk5q2inrsox27jo9e4ldg6wk4rwa

kdahlquist avatar Dec 03 '25 22:12 kdahlquist

I drilled down on this with @ntran18 and we tracked the entire import sequence; we found one bug where the gene table was actually being loaded after the production and degradation rates but that didn’t completely resolve everything because in an incremental data update, a pre-check needs to be made on whether a gene from the delta update is already in the gene table. The current code does not perform this check and thus risks a duplicate key error when it tries to insert the gene

@ntran18 and I looked into ways to do this with minimum effort and found that the PostgreSQL copy command has an on error option that allows it to skip records encountering errors without aborting the entire sequence. However, this option is available in a newer version of PostgreSQL that what we currently have

Without this built-in option, there may be a need to perform the check ourselves in the script. This remains a sizable amount of effort so upon further thought, pursuing on error as follows may still result in less overall work:

  • Update the local PostgreSQL version (on @ntran18’s machine) to the version that supports on error (17, iirc)
  • Test out on error locally
  • Once we are more confident that on error works, we can then look into updating the production server version. I think this will overall amount to a lower level of effort than recoding the duplication check manually, as long as we have confidence that it does work

@ntran18 when you get a chance, please look at the above process and chime in on feasibility

dondi avatar Dec 09 '25 19:12 dondi