AgentLab icon indicating copy to clipboard operation
AgentLab copied to clipboard

Automatic upload traces to hugging-face

Open recursix opened this issue 1 year ago • 10 comments

Make tools to simplify adding traces of agents to an ever growing huggingfaces dataset.

  • create 2 datasets on hugging face

    • one that would be an index to be able to easily retrieve traces based on attributes similar to the dataframe when we run load_result_df
    • one that contains actual zipped traces that can be retrieved from a pointer in the index
  • make code to upload a study trace by trace and easy way to group the traces by study in the index.

  • legality:

    • limit adding only from the domains that are whitlisted (e.g. our benchmarks or a subset of them)
    • based on which LLM and which benchmarks attribute a specific license to it.

recursix avatar Oct 07 '24 17:10 recursix

we can leverage the exp_args.exp_id (a uuid) as a unique reference for each trace

recursix avatar Oct 11 '24 18:10 recursix

So @recursix can i work on this if u dont mind ?

RohitP2005 avatar Jan 07 '25 03:01 RohitP2005

That would be awesome as we've been running out of time to work on this.

I have something specific in mind, and there are other stakeholders that might have opinions on how it will be designed. You probably also have an idea of how you want to design it. So we should probably start with a more elaborated set of specs / API. Would you want to start with what you have in mind?

recursix avatar Jan 07 '25 03:01 recursix

@RohitP2005, still interested?

recursix avatar Jan 10 '25 18:01 recursix

Yeah, I just need some more time. Is that ok with you

RohitP2005 avatar Jan 10 '25 18:01 RohitP2005

Yes it's good. Would you like to meet next week?

recursix avatar Jan 10 '25 19:01 recursix

Yeah, Sounds good @recursix

RohitP2005 avatar Jan 10 '25 21:01 RohitP2005

From my side, I’m thinking of structuring the design as follows:

Dataset Structure

  • Index Dataset: Stores metadata (exp_id, study name, LLM, benchmark, license).
  • Traces Dataset: Stores zipped trace files, referenced by exp_id.

API Functionality

  • Trace Upload API: Uploads traces with metadata, ensuring only whitelisted domains/benchmarks are added.
  • Index Query API: Queries index dataset to retrieve trace pointers based on attributes.
  • License Management: Automatically assigns and validates licenses based on benchmark and LLM.

Legal Compliance

  • Integrates checks for domain whitelisting and license attribution to ensure data integrity and compliance.

Looking forward to refining these specs and aligning with everyone’s input! Let me know if there's anything you'd like to add!

RohitP2005 avatar Jan 10 '25 21:01 RohitP2005

sorry for late reply. That sounds good overall. I would still like to discuss this with you over e.g. zoom or find a place to chat. Can you contact me by email? [email protected]

recursix avatar Jan 15 '25 02:01 recursix

Yeah sure @recursix , I will contact u through email

RohitP2005 avatar Jan 15 '25 20:01 RohitP2005