GEOparse
GEOparse copied to clipboard
Suggestion for an improvement of the GEOparse.utils.smart_open() function
I found that some GEO files contain carriage return characters in the meta data, causing exceptions (GEOparse.GEOTypes.DataIncompatibilityException). To reproduce the error you can test functions with "GPL10740" dataset as follows:
gpl = GEOparse.get_GEO(geo="GPL10740", silent=True, include_data=True, destdir=".")
(<class 'GEOparse.GEOTypes.DataIncompatibilityException'>, DataIncompatibilityException('\nData columns do not match columns description index in GSM1530106\nColumns in table are: )\nIndex in columns are: ID_REF, VALUE, DETECTION P-VALUE\n',), <traceback object at 0x7f1fee64be48>)
columns
variable taken from GEOparse.parse_columns(soft)
is:
Index(['ID_REF', 'VALUE', 'DETECTION P-VALUE'], dtype='object')
table_data.columns
variable taken from GEOparse.parse_table_data(soft)
is:
Index([')'], dtype='object')
This is due to the line containing a carriage return:
!Sample_relation = Alternative to: GSM1530054 (gene-level analysis^M) !Sample_series_id = GSE62617 !Sample_series_id = GSE70707 #ID_REF = #VALUE = RMA normalized signal intensity #DETECTION P-VALUE = !sample_table_begin ID_REF VALUE DETECTION P-VALUE
I suggest a small modification on the GEOparse.utils.smart_open()
function for working with such a dataset as follows:
@contextmanager def smart_open(filepath, **open_kwargs): """Open file intelligently depending on the source and python version. Args: filepath (:obj:`str`): Path to the file. Yields: Context manager for file handle. """ if "errors" not in open_kwargs: open_kwargs["errors"] = "ignore" if filepath[-2:] == "gz": open_kwargs["mode"] = "rt" fopen = gzip.open else: open_kwargs["mode"] = "r" fopen = open open_kwargs["newline"] = "\n" # I do not know why here is an "if" statement because this always calls fopen with the same parameters. if sys.version_info[0]