cmapPy
cmapPy copied to clipboard
parse_gctx: don't sort returned values
Hi @oena @levlitichev
I was thinking about doing a pull request where I modified parse_gctx to not return the dataframes sorted by index/column. The reason I propose this is if you read them out and get them in the order that they appear in the file, you can then choose the ones you are interested in, figure out their index id, and then use the ridx/cidx option to load them, which is much faster.
Also, could make it an option to do the sort. What do you think?
Hi @dllahr! Not sure I totally follow. Do you mean just for the metadata only options? Otherwise the IDs are subsetted before hyperslab selection occurs.
Sorry, no I mean that right now when you get the metadata back (and I think when you get it all back) all of the ID's have been sorted. The use-case I ran into was:
- got just the row metadata back
- identified the overlap between the genes I wanted an those that were present
- identified the indices of the genes I wanted in the row metadata
- attempted to load using the ridx option, got a completely different set of genes
- realized that the row metadata had been sorted, rather than returned as it appears in the file
Ok, gotcha. That does seem like a useful thing to do. Maybe we can start with having it as an option and see how things go?
@oena Has this been taken up? I was thinking of working on this.
@dllahr did you follow up on this? No worries if not, just checking
Sorry I'm late replying, I did not get to doing anything with this.