Automatic upload traces to hugging-face
Make tools to simplify adding traces of agents to an ever growing huggingfaces dataset.
-
create 2 datasets on hugging face
- one that would be an index to be able to easily retrieve traces based on attributes similar to the dataframe when we run
load_result_df - one that contains actual zipped traces that can be retrieved from a pointer in the index
- one that would be an index to be able to easily retrieve traces based on attributes similar to the dataframe when we run
-
make code to upload a study trace by trace and easy way to group the traces by study in the index.
-
legality:
- limit adding only from the domains that are whitlisted (e.g. our benchmarks or a subset of them)
- based on which LLM and which benchmarks attribute a specific license to it.
we can leverage the exp_args.exp_id (a uuid) as a unique reference for each trace
So @recursix can i work on this if u dont mind ?
That would be awesome as we've been running out of time to work on this.
I have something specific in mind, and there are other stakeholders that might have opinions on how it will be designed. You probably also have an idea of how you want to design it. So we should probably start with a more elaborated set of specs / API. Would you want to start with what you have in mind?
@RohitP2005, still interested?
Yeah, I just need some more time. Is that ok with you
Yes it's good. Would you like to meet next week?
Yeah, Sounds good @recursix
From my side, I’m thinking of structuring the design as follows:
Dataset Structure
- Index Dataset: Stores metadata (
exp_id, study name, LLM, benchmark, license). - Traces Dataset: Stores zipped trace files, referenced by
exp_id.
API Functionality
- Trace Upload API: Uploads traces with metadata, ensuring only whitelisted domains/benchmarks are added.
- Index Query API: Queries index dataset to retrieve trace pointers based on attributes.
- License Management: Automatically assigns and validates licenses based on benchmark and LLM.
Legal Compliance
- Integrates checks for domain whitelisting and license attribution to ensure data integrity and compliance.
Looking forward to refining these specs and aligning with everyone’s input! Let me know if there's anything you'd like to add!
sorry for late reply. That sounds good overall. I would still like to discuss this with you over e.g. zoom or find a place to chat. Can you contact me by email? [email protected]
Yeah sure @recursix , I will contact u through email