Upgrade scripts used for clinical data ingestion to python3

Open rxu17 opened this issue 4 months ago • 0 comments

Problem:

For the iAtlas to cBioportal project, we need to use python3 to run our processing pipeline for the clinical and maf datasets. We use some of the scripts here in the datahub-study-curation-tools repo to help with the processing so we are not rewriting code on our end namely:

oncotree mapping (to map to CANCER_TYPE and CANCER_TYPE_DESCRIPTION) using our clinical files' ONCOTREE_CODE values
add clinical header (this is the required format for cbioportal ingestion for clinical files)
generate metadata files (required files for cbioportal ingestion for clinical files)
generate caselists (required files for cbioportal ingestion for clinical files)

But these scripts use python 2.

Solution:

Here we add changes to port from python 2 to python 3 to be able to use these scripts in our pipeline.

Main changes are the following:

Updated to print() statement syntax in python3
deprecation of U mode in open(), it's now the default behavior
Updated to use urlib library in python 3, see https://docs.python.org/2/library/urllib2.html and https://docs.python.org/3/library/urllib.request.html#module-urllib.request
Deprecation of the built-insunicode function, Now in python3 by default all strings are unicode
Fixed some mixed indentation in the code

Testing:

Tested on the iatlas data to cbioportal project, and results were successfully ingested into cbioportal and validated

Aug 27 '25 06:08 rxu17