depmap_omics icon indicating copy to clipboard operation
depmap_omics copied to clipboard

New post processing for creating master files of model --> mc --> Profile --> CDS

Open siyer-broad opened this issue 9 months ago • 5 comments

First round for code review. Please be brutal. Will address bugs first and then efficiency and good practice recommendations after

siyer-broad avatar Mar 24 '25 13:03 siyer-broad

Could we all hold off on reviewing this PR? It overlaps with my work on the depmap data manifest and Terra workspace generation.

dpmccabe avatar Mar 24 '25 15:03 dpmccabe

Could we all hold off on reviewing this PR? It overlaps with my work on the depmap data manifest and Terra workspace generation.

Sure thing

GoldenLA avatar Mar 24 '25 15:03 GoldenLA

Let us blacklist them or give them some other data type then. Coding for edge cases like this is not a good idea.

On Tue, Mar 25, 2025 at 11:52 AM Simone Zhang @.***> wrote:

@.**** commented on this pull request.

In model_to_omics_data_config.json https://github.com/broadinstitute/depmap_omics/pull/253#discussion_r2012421460 :

  •        "exclusion_filters": {
    
  •            "_comment": "Exclude rows where database column 'omics_sequencing.blacklist' has value 1(is True)",
    
  •            "columns": ["blacklist"],
    
  •            "blacklist": [1],
    
  •            "filter_mode": "all"
    
  •        },
    
  •        "inclusion_filters": {
    
  •            "_comment": "Include rows where db column 'omics_sequencing.expected_type' is rna,wgs or wes",
    
  •            "columns": ["expected_type"],
    
  •            "expected_type": ["rna", "wgs", "wes"],
    
  •            "filter_mode": "any"
    
  •        }
    
  •    },
    
  •    "omics_profile": {
    
  •        "_comment": "Gumbo table name 'omics_profile' is block key. Table name MUST exactly match database table name",
    
  •        "exclusion_filters": {
    

Here we should also exclude profiles where the profile_source is "taiga". Those are legacy profiles that we don't have bams for

— Reply to this email directly, view it on GitHub https://github.com/broadinstitute/depmap_omics/pull/253#pullrequestreview-2714310034, or unsubscribe https://github.com/notifications/unsubscribe-auth/BIKWZBMELK4RHZT6UPAKPIT2WF3UHAVCNFSM6AAAAABZU6JCQKVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDOMJUGMYTAMBTGQ . You are receiving this because you authored the thread.Message ID: @.***>

siyer-broad avatar Mar 25 '25 17:03 siyer-broad

When would we not process data that has been sequenced and given a release date?

On Tue, Mar 25, 2025 at 3:04 PM Simone Zhang @.***> wrote:

@.**** commented on this pull request.

In model_to_omics_data_config.nocomments.json https://github.com/broadinstitute/depmap_omics/pull/253#discussion_r2012768989 :

@@ -0,0 +1,173 @@ +{

  • "table_filters": {
  •    "omics_sequencing": {
    
  •        "exclusion_filters": {
    
  •            "columns": ["blacklist"],
    
  •            "blacklist": [1],
    
  •            "filter_mode": "all"
    
  •        },
    
  •        "inclusion_filters": {
    

Another condition would be processed_sequence == True so we newly loaded/not processed samples don't get nominated as main sequencing ids

— Reply to this email directly, view it on GitHub https://github.com/broadinstitute/depmap_omics/pull/253#pullrequestreview-2714894270, or unsubscribe https://github.com/notifications/unsubscribe-auth/BIKWZBNFEIH76HZK7JZHNZL2WGSFBAVCNFSM6AAAAABZU6JCQKVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDOMJUHA4TIMRXGA . You are receiving this because you authored the thread.Message ID: @.***>

siyer-broad avatar Mar 25 '25 19:03 siyer-broad

That’s a good point. I’ll address that. On Tue, Mar 25, 2025 at 3:42 PM Simone Zhang @.***> wrote:

@.**** commented on this pull request.

In model_to_omics_data_config.json https://github.com/broadinstitute/depmap_omics/pull/253#discussion_r2012822277 :

  •            "columns": ["blacklist"],
    
  •            "blacklist": [1],
    
  •            "filter_mode": "all"
    
  •        },
    
  •        "inclusion_filters": {
    
  •            "_comment": "Include rows where db column 'omics_sequencing.expected_type' is rna,wgs or wes",
    
  •            "columns": ["expected_type"],
    
  •            "expected_type": ["rna", "wgs", "wes"],
    
  •            "filter_mode": "any"
    
  •        }
    
  •    },
    
  •    "omics_profile": {
    
  •        "supercede_filter": {
    
  •            "_comment": "Include rows where atleast one of the columns 'internal_release_date', 'consortium_release_date' or 'public_release_date' is not None",
    
  •            "columns": ["internal_release_date", "consortium_release_date", "public_release_date"],
    
  •            "internal_release_date": ["None"],
    

I don't really know if it actually happens, but is release date != None enough? Like do ops ever assign a release date in the future, so we'll perhaps have to do release date < today?

— Reply to this email directly, view it on GitHub https://github.com/broadinstitute/depmap_omics/pull/253#pullrequestreview-2714991492, or unsubscribe https://github.com/notifications/unsubscribe-auth/BIKWZBM3YPOKG5GF4MKQN532WGWR5AVCNFSM6AAAAABZU6JCQKVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDOMJUHE4TCNBZGI . You are receiving this because you authored the thread.Message ID: @.***>

siyer-broad avatar Mar 25 '25 19:03 siyer-broad