pygraphistry icon indicating copy to clipboard operation
pygraphistry copied to clipboard

Cucat Featurization base

Open tanmoyio opened this issue 2 years ago • 8 comments

Starter script

import pandas as pd
import cudf
import graphistry

df = pd.read_csv('https://gist.githubusercontent.com/silkspace/c7b50d0c03dc59f63c48d68d696958ff/raw/31d918267f86f8252d42d2e9597ba6fc03fcdac2/redteam_50k.csv', index_col=0)
red_team = pd.read_csv('https://gist.githubusercontent.com/silkspace/5cf5a94b9ac4b4ffe38904f20d93edb1/raw/888dabd86f88ea747cf9ff5f6c44725e21536465/redteam_labels.csv', index_col=0)
df['feats'] = df.src_computer + ' ' + df.dst_computer + ' ' + df.auth_type + ' ' + df.logontype
tdf = pd.concat([red_team.reset_index(), df.reset_index()])
tdf['node'] = range(len(tdf))


g = graphistry.nodes((tdf))
g1 = g.umap(X=['feats'], feature_engine='cu_cat')
print(g1._node_features)
g2 = g.umap(X=['feats'], feature_engine='dirty_cat')
print(g2._node_features)

tanmoyio avatar May 15 '23 15:05 tanmoyio

@tanmoyio can we close, or this is still live and needs review?

lmeyerov avatar Jun 15 '23 00:06 lmeyerov

  • for cu_cat itself (DT3 branch), I have worked out the dynamic memory handling for T4 v A100 flexibility.
  • Also worked out datetime passthru. However this needs to bypass cudf dataframing in cu_cat AND pygraphistry so that g.plotter infers datetime correctly to provide time series box -- currently i accomplish this in a hacky way by binding it to embeddings after transforming but before plotting, thus avoiding cudf requirement
  • Now I have refactored code to only require gapencoder and tablevectorizer files/functions DT4 branch forked from DT3

dcolinmorgan avatar Jul 19 '23 02:07 dcolinmorgan

Awesome - is the plan to start landing, or more first?

And would it make sense to start reviewing any PRs? If a sequence, can you stack them & point out so clear?

lmeyerov avatar Jul 19 '23 07:07 lmeyerov

landing would be wonderful -- before end of july is my dream

DT4 is latest cu_cat PR branch which passes many pytests + works as expected in every demo ive done in last few months

dcolinmorgan avatar Jul 21 '23 03:07 dcolinmorgan

ok @tanmoyio can you help double check tests, take for a testdrive, and land first in cu_cat and then here?

After, can you help add to main graphistry (https://github.com/graphistry/graphistry/blob/master/compose/dockerfiles/base/05-nvidia.Dockerfile) ? I think we should keep default-off for now, and should test that it's truly default off -- that existence doesn't (yet) trigger it to be used, only explicit use.

lmeyerov avatar Jul 23 '23 00:07 lmeyerov

test-full-ai test L395 seems to be getting hung up by 1 of 3 features being exactly reproduced

  • when first discussing with @silkspace -- this is exactly what we realized approximate estimation would liekly return and user must make sure features make sense, just like with dirty_cat
  • likely need to test a few so that 2/3 are always reproduced in several estimations rather that 1 case of 3/3 reproduction

dcolinmorgan avatar Jul 26 '23 10:07 dcolinmorgan

  • @silkspace wrt https://github.com/graphistry/pygraphistry/pull/486#issuecomment-1651436514 , may understand better?

lmeyerov avatar Jul 29 '23 06:07 lmeyerov

this seems to have drifted a bit from main, see merge conflict

i'm unsure about the cu_cat bits here, but if this pr replaces a bunch of import cudf with dynamic "cudf.dataframe" in str(module(df))" checks to avoid slow imports, sounds useful & overdue...

lmeyerov avatar Jul 04 '24 05:07 lmeyerov