pygraphistry
pygraphistry copied to clipboard
Cucat Featurization base
Starter script
import pandas as pd
import cudf
import graphistry
df = pd.read_csv('https://gist.githubusercontent.com/silkspace/c7b50d0c03dc59f63c48d68d696958ff/raw/31d918267f86f8252d42d2e9597ba6fc03fcdac2/redteam_50k.csv', index_col=0)
red_team = pd.read_csv('https://gist.githubusercontent.com/silkspace/5cf5a94b9ac4b4ffe38904f20d93edb1/raw/888dabd86f88ea747cf9ff5f6c44725e21536465/redteam_labels.csv', index_col=0)
df['feats'] = df.src_computer + ' ' + df.dst_computer + ' ' + df.auth_type + ' ' + df.logontype
tdf = pd.concat([red_team.reset_index(), df.reset_index()])
tdf['node'] = range(len(tdf))
g = graphistry.nodes((tdf))
g1 = g.umap(X=['feats'], feature_engine='cu_cat')
print(g1._node_features)
g2 = g.umap(X=['feats'], feature_engine='dirty_cat')
print(g2._node_features)
@tanmoyio can we close, or this is still live and needs review?
- for cu_cat itself (DT3 branch), I have worked out the dynamic memory handling for T4 v A100 flexibility.
- Also worked out datetime passthru. However this needs to bypass cudf dataframing in cu_cat AND pygraphistry so that g.plotter infers datetime correctly to provide time series box -- currently i accomplish this in a hacky way by binding it to embeddings after transforming but before plotting, thus avoiding cudf requirement
- Now I have refactored code to only require gapencoder and tablevectorizer files/functions DT4 branch forked from DT3
Awesome - is the plan to start landing, or more first?
And would it make sense to start reviewing any PRs? If a sequence, can you stack them & point out so clear?
landing would be wonderful -- before end of july is my dream
DT4 is latest cu_cat PR branch which passes many pytests + works as expected in every demo ive done in last few months
ok @tanmoyio can you help double check tests, take for a testdrive, and land first in cu_cat and then here?
After, can you help add to main graphistry (https://github.com/graphistry/graphistry/blob/master/compose/dockerfiles/base/05-nvidia.Dockerfile) ? I think we should keep default-off for now, and should test that it's truly default off -- that existence doesn't (yet) trigger it to be used, only explicit use.
test-full-ai test L395 seems to be getting hung up by 1 of 3 features being exactly reproduced
- when first discussing with @silkspace -- this is exactly what we realized approximate estimation would liekly return and user must make sure features make sense, just like with dirty_cat
- likely need to test a few so that 2/3 are always reproduced in several estimations rather that 1 case of 3/3 reproduction
- @silkspace wrt https://github.com/graphistry/pygraphistry/pull/486#issuecomment-1651436514 , may understand better?
this seems to have drifted a bit from main, see merge conflict
i'm unsure about the cu_cat bits here, but if this pr replaces a bunch of import cudf with dynamic "cudf.dataframe" in str(module(df))" checks to avoid slow imports, sounds useful & overdue...