Brian Hie comments

Results 12 comments of


                                            Brian Hie

How do I cite this work (academically)?

@cai-lw here's a STOC paper describing random projection trees: http://cseweb.ucsd.edu/~dasgupta/papers/rptree-stoc.pdf This could be what you're looking for?

Apply evolocity to other language models?

It is possible to modify the package to use other language models, but this is currently not supported and would require some implementation effort

Error with colab notebook for nucleoprotein

The notebook uses an old version of the esm package (`fair-esm==0.4.0`) that supports parsing non-amino acid characters. If you want to fix the error (and reproduce the paper results), you...

Kernel dies with 500GB memory

Can you try using the `sketch` parameter to reduce memory consumption? Maybe set the sketch value to a low enough value to fit in memory, if you are indeed getting...

Thread getting Killed

Perhaps it's killed due to memory usage? Can you try using sparse matrices, e.g., ```python import scipy.sparse X_coexpr = scipy.sparse.csr_matrix(X_coexpr) ```

How to save results?

For now there's no anndata writer for the language model itself. You can try doing `adata.uns['model'] = None` before saving (i.e., don't save the model) and that should probably work.

How to save results?

The model is only needed for velocity score computation, not any of the downstream analysis. I would compute the velocities on the HPC, save it without the model, then just...

@pan-genome we were able to just use the standard HuggingFace sampling API (e.g., loading with `AutoModelForCausalLM.from_pretrained()`, sampling with `model.generate()`) to generate 500k+ on an 80 Gb GPU.

Max Seq length for inference

Something like ```python model_config = AutoConfig.from_pretrained( 'togethercomputer/evo-1-131k-base', trust_remote_code=True, revision="1.1_fix", ) model_config.max_seqlen = 500_000 model = AutoModelForCausalLM.from_pretrained( 'togethercomputer/evo-1-131k-base', config=model_config, trust_remote_code=True, revision="1.1_fix", ) outputs = model.generate( input_ids, max_new_tokens=500_000, temperature=1., top_k=4, ) ```

when the protein length is larger than 1022, the recommend output will be staggered by two digits

This may be a bug with the way we are handling sequences longer than the ESM context length, will investigate.