umap icon indicating copy to clipboard operation
umap copied to clipboard

Aligned UMAP: clarification on 'overlapping' points through time?

Open drscotthawley opened this issue 2 years ago • 2 comments

Hi! Thanks very much for developing these wonderful tools. I've used UMAP for a little while and now I'm very excited to try out Aligned UMAP. The instructions provide an example which is not merely "contrived", it's... hard to see how to use actual time-dependent data.

In the example, we have 10 digits and time evolution is put into 10 time steps (the coincidence of the two different 10's really slowed me down lol), but... somehow you make it so that data points are shared between time steps? I don't see how to adapt that to the (common?) case in which all the data points change at each time step.

It's still not clear to me how we're supposed to build some sort of "overlapping" amount of points. (Are we expected to insert "glue frames" in between our time slices, for which we grab half the points from the previous time and half from the next time?)

My current application is that I have 6 different classes of points, with 360 examples for each class, for which there are vectors that are 64 dimensional and evolve over 512 time steps. I'm ok with downsampling the 512 to, lol, maybe 8 or 16 for starters. But... still the "make them overlap" isn't clear to me -- they all change.

Could someone please clarify? I'd be happy to contribute to the documentation,....once I understand how this is supposed to work.

It is generally the case that the indices of points that are supposed to align will persist from time step to time step. Is there a mode whereby we can can make use of that?

Thanks.

drscotthawley avatar Mar 18 '23 22:03 drscotthawley

Perhaps the more complex, but also more realistic example based on US congressional voting may help a little.

Generally the goal is to either have some specific identity that persists over multiple timesteps (e.g. a given member of congress, who has different voting in each year, but can be identified from one year to the next), or to bin the data with an overlapping binning strategy. Without knowing more about your specific data I can't say whether option one makes some sense (perhaps it does?). To achieve the latter we can do a kind of downsampling with overlapping bins. To give a concrete example, given 512 time steps, we could create a sequence of datasets where the first dataset has data from timesteps [0,1,2,3,4], the second dataset from timesteps [2,3,4,5,6], the third dataset from timesteps [4,5,6,7,8], and so on. So now you have overlapping data, because there are data point from timesteps 2,3, and 4 that are in both the first dataset, and in the second dataset, and so on.

Does this help?

On Sat, Mar 18, 2023 at 6:03 PM Scott H. Hawley @.***> wrote:

Hi! Thanks very much for developing these wonderful tools. I've used UMAP for a little while and now I'm very excited to try out Aligned UMAP. The instructions https://umap-learn.readthedocs.io/en/latest/aligned_umap_basic_usage.html provide an example which is not merely "contrived", it's... hard to see how to use actual time-dependent data.

In the example, we have 10 digits and time evolution is put into 10 time steps (the coincidence of the two different 10's really slowed me down lol), but... somehow you make it so that data points are shared between time steps? I don't see how to adapt that to (the very common?) case in which all the data points change at each time step.

It's still not clear to me how we're supposed to build some sort of "overlapping" amount of points.

My current application is that I have 6 different classes of points, with 360 examples for each class, for which there are vectors that are 64 dimensional and evolve over 512 time steps. I'm ok with downsampling the 512 to, lol, maybe 8 or 16 for starters. Uh but... still the "make them overlap" isn't clear to me.

Could someone please clarify? I'd be happy to contribute to the documentation,....once I understand how this is supposed to work.

It is generally the case that the indices of points that are supposed to align will persist from time step to time step. Is there a mode whereby we can can make use of that?

Thanks.

— Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/984, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3IUBPI74HMNL3I4PQFQULW4YWKJANCNFSM6AAAAAAV7WS77U . You are receiving this because you are subscribed to this thread.Message ID: @.***>

lmcinnes avatar Mar 20 '23 14:03 lmcinnes

Thank you so much for your reply! I was unaware of the congressional voting example so I'm looking at that too.

I wanted to wait to write back until I'd tried to implement your suggestions:

So in your example above, if I understand correctly, after implementing the time-overlapping as you describe, I might end up with a "slices" array with shape like (254, 5, ...) where the 254 is what we get from stepping across 512 with a stride of 2 and a window length of 5, and the "..." would in my case be the additional 360 data points of 64 dimensions each, or (254, 5, 360, 64). (i'll do just one class instead of the 6 i mentioned at first)

So slices.shape == (254, 5, 360, 64). Then the relation_dict variables would be

relation_dict = {i+2:i for i in range(len(slices[0]))}
relation_dicts = [relation_dict.copy() for i in range(len(slices) - 1)]

I'll give that a shot!

...Darn. When I try running

aligned_mapper = umap.AlignedUMAP().fit(slices, relations=relation_dicts)

, I get ValueError: Found array with dim 3. None expected <= 2. It doesn't seem to want to receive arrays with dimesions >= 3, but isn't that what we naturally get with the overlapping time slices? Thanks for your time!

PS- A different strategy I can imagine would be to pass in the full (512, 360, 64) array -- or perhaps downsample in time a bit first, maybe (16, 360, 64), and then use the "identity" relations

relation_dict = {i:i for i in range(len(slices[0]))}
relation_dicts = [relation_dict.copy() for i in range(len(slices) - 1)]

(i.e. no overlapping at all, but telling it that points are supposed to match across time steps?)

When I try that, the execution takes... a very long time. Still waiting to see if/when it completes.

drscotthawley avatar Mar 22 '23 23:03 drscotthawley