cleora pyo3 instegration and adding support for parquet output and s3 stores

pyo3 instegration and adding support for parquet output and s3 stores

Open qooba opened this issue 2 years ago • 0 comments

I'd like to add a few features.

Integration with pyo3 bindings which will enable to publish library as a python package and use without using subprocess
Support for a parquet output persistence: output_format="parquet" because writing to parquet row by row is inefficient thus and additional parameter will be required to write with the chunks: chunk_size=3000
Support for a s3 as a input and output store

Example usage:

import cleora

output_dir = 's3://output'
fb_cleora_input_clique_filename = "s3://input/fb_cleora_input_clique.txt"
fb_cleora_input_star_filename = "s3://input/fb_cleora_input_star.txt"

cleora.run(
    input=[fb_cleora_input_clique_filename],
    type_name="tsv",
    dimension=1024,
    max_iter=5,
    seed=10,
    prepend_field=False,
    log_every=1000,
    in_memory_embedding_calculation=True,
    cols_str="complex::reflexive::CliqueNode",
    output_dir=output_dir,
    output_format="parquet",
    relation_name="emb",
    chunk_size=3000,
)

Aug 24 '22 23:08 qooba

cleora cleora copied to clipboard

pyo3 instegration and adding support for parquet output and s3 stores

cleora
cleora copied to clipboard