cleora
cleora copied to clipboard
pyo3 instegration and adding support for parquet output and s3 stores
I'd like to add a few features.
-
Integration with pyo3 bindings which will enable to publish library as a python package and use without using subprocess
-
Support for a parquet output persistence:
output_format="parquet"
because writing to parquet row by row is inefficient thus and additional parameter will be required to write with the chunks:chunk_size=3000
-
Support for a s3 as a input and output store
Example usage:
import cleora
output_dir = 's3://output'
fb_cleora_input_clique_filename = "s3://input/fb_cleora_input_clique.txt"
fb_cleora_input_star_filename = "s3://input/fb_cleora_input_star.txt"
cleora.run(
input=[fb_cleora_input_clique_filename],
type_name="tsv",
dimension=1024,
max_iter=5,
seed=10,
prepend_field=False,
log_every=1000,
in_memory_embedding_calculation=True,
cols_str="complex::reflexive::CliqueNode",
output_dir=output_dir,
output_format="parquet",
relation_name="emb",
chunk_size=3000,
)