datahub-study-curation-tools
datahub-study-curation-tools copied to clipboard
Upgrade scripts used for clinical data ingestion to python3
Problem:
For the iAtlas to cBioportal project, we need to use python3 to run our processing pipeline for the clinical and maf datasets. We use some of the scripts here in the datahub-study-curation-tools repo to help with the processing so we are not rewriting code on our end namely:
- oncotree mapping (to map to
CANCER_TYPEandCANCER_TYPE_DESCRIPTION) using our clinical files'ONCOTREE_CODEvalues - add clinical header (this is the required format for cbioportal ingestion for clinical files)
- generate metadata files (required files for cbioportal ingestion for clinical files)
- generate caselists (required files for cbioportal ingestion for clinical files)
But these scripts use python 2.
Solution:
Here we add changes to port from python 2 to python 3 to be able to use these scripts in our pipeline.
Main changes are the following:
- Updated to print() statement syntax in python3
- deprecation of
Umode inopen(), it's now the default behavior - Updated to use urlib library in python 3, see https://docs.python.org/2/library/urllib2.html and https://docs.python.org/3/library/urllib.request.html#module-urllib.request
- Deprecation of the built-ins
unicodefunction, Now in python3 by default all strings are unicode - Fixed some mixed indentation in the code
Testing:
- Tested on the iatlas data to cbioportal project, and results were successfully ingested into cbioportal and validated