spikeinterface
spikeinterface copied to clipboard
Force `_main_ids` to be strings
Related to the serialization thread (#1397), after discussing with @h-mayorquin, we thought that casting Base._main_ids as str always may solve several issues with keys not being json serializable.
Allowing ids to be integers is mainly to maintain a similar behavior with spikeextractors, but now that ids and indices are clearly separated I think we can just use strings.
@samuelgarcia thoughts?
I think this choice has to be unrelated to json stuff. I am not against having str only for ids. Having integer was also convinient in some cases. We have to see this like Dataframe.index which can be int, str and many more no ?
Lets maturate this idea more.
@magland : I think you use to have opinion on this : does channel_ids/unit_ids must str only or int/str ?
I understand why forcing IDs to be str is desirable. However, I believe this would break many of the recordings/sortings that I work with, that use integer IDs. So not sure what is the best way forward here.
@magland could you elaborate why they would break? If you use, for instance, channel IDs to index or slice an array, that is not the way it should be because we have a clear api distinction between IDs and indices now. Do you have other breaking examples?
I have a lot of existing code that uses integers for IDs when creating recordings/sortings. Frank lab also has recordings/sortings stored to disk in custom formats that I believe use integers for IDs.
For example, here's a line of code that I just wrote:
recording_signal = se.NumpyRecording([signal], sampling_frequency=recording_sub_whitened.sampling_frequency, channel_ids=np.arange(1, KK + 1, dtype=np.int32))
It would be cumbersome to convert that code to use strings instead of ints.
No we would do the conversion in the base object and under the hood. So if the downstream code does not depend on the IDs being int there's no problem
So we'd be able to pass in ints for any of the API functions and they get translated to strings under the hood?
Well normally you would then the channel_ids from the object and pass them around!
I believe this would still break my code in various places (some unexpected), like when I get the channel IDs and compare them to integers... or check to see if an integer is among the channel IDs. But if you decide to go that route, I guess I'll just have to deal with it. From a design perspective I think it is the right choice... but the fact is, it will cause some things to break for me.
I see! I agree it's better going forward. Probably also better to switch sooner than later!
Maybe it should wise to deprecate explcitly in next release (actual master) and remove ids integer in the following one. So anyone could figure out which part of code will be broken in advance. This would be more smart no ?
Maybe it should wise to deprecate explcitly in next release (actual master) and remove ids integer in the following one. So anyone could figure out which part of code will be broken in advance. This would be more smart no ?
I think that would be problematic, since you'd get a bunch of deprecation warnings for things that aren't really going to be problems. Like if I set the unit IDs as integers, that is still going to be supported according to @alejoe91. What you can't test for is how the user interprets the output of .unit_ids or .channel_ids.
Yep ;)
I not sure to support the magical and hidden convertion : int to str in the base class. Magic stuff are futurs bugs!
Now that I think more about this, I think the problem is not that int IDs are supported... the problem is that int IDs are used as the keys in dicts. A different solution would be for you to always use the stringified versions of IDs when creating dicts. When a user passes in a dict with ID keys, they need to make sure they are string keys. Similarly when a function outputs a dict, it needs to make sure the keys are strings. Then if a user reads a dict they would need to do X[str(id)] ...
I not sure to support the magical and hidden convertion : int to str in the base class. Magic stuff are futurs bugs!
If you don't support int's being passed in, it's going to break a lot of my existing code as well as in spyglass.
No worry, the idea is not to break codes! The key as int in dict is one aspect one the problem, the other is mainly to have something clear without ambiguity for a new user.
For me key as int and str is OK. Key as str only would be better. I think that key that are transformed along the way (in base or when making dict or whenever) is a bad solution.
So the 2 options I like are:
- keep the actual behavior (int+str), we lived with this since a while without major drawback
- str only everwhere since init of the class (but this break code)
All other solutions that make hidden and magic cast to str are a design error I think.
Regarding json serialization, the following could be a better strategy
Replace things like this:
{
1: [...],
2: [...],
...
}
with things like this
{
'locations': [
{'id': 1, 'location': [...]},
{'id': 2, 'location': [...]},
...
}
}
This would require a big refactoring and I don't think it's needed. We have functions to make dictionaries serializable so it's now a solved problem. I think that by design, it's better to move forward with strings only (you would still be able to pass in integers in the Numpy constructor). I don't expect many things to break honestly, but we can make a separate branch that you can test against (same with spyglass). @khl02007 let us hear your opinion too!
I guess we never did agree on this one, did we?
Can we do this with a very slow deprecation cycle. Start taking throwing deprecation warnings for ints when setting channel_ids and unit_ids?
Honestly, having unit_ids as int for sorting is really really convinient in many sututation. For instance in Numpy.from_time_label where the labels become ids.
Is it possible to introduce a getter _get_main_ids(dtype="int") so representation is str under the hood but can grab as int if required / make an easy find & replace on existing code?
Honestly, having unit_ids as int for sorting is really really convinient in many sututation. For instance in Numpy.from_time_label where the labels become ids.
Can't we just use ids_to_indices there or is it more complicated than that?