argilla icon indicating copy to clipboard operation
argilla copied to clipboard

feat: new SDK PoC

Open frascuchon opened this issue 1 year ago • 7 comments

Description

This is a draft PR for the new Core SDK definition. The main gold of this PR is to define the foundations of the new Python SDK module/package with (temporal) fully backward compatibility with current SDK components.

The PR defines a new argilla.sdk module where all these changes occur. The main interaction class is argilla.sdk.Dataset which defines the new methods and attributes to work with the Argilla feedback datasets in a new fashion way.

This class currently extends the RemoteFeedbackDataset which is the entry point of the backward compatibility.

Here is a sample code snippet showing how to work with this new definition:

from argilla import sdk

sdk.init(...) # This line replaces the rg.init

ws = sdk.Workspace.from_name("argilla") # This is a shorcut of the rg.Workspace class

ds_config = sdk.DatasetConfiguration(guidelines="My guidelines")
ds_config.fields.append(
   sdk.Field(
      name="field1", 
      title="Field 1", 
      required=True, 
      settings=sdk.TextFieldSettings(use_markdown=False)
   )
)

dataset = sdk.Dataset.create(name="my-ds", workspace=ws, config=ds_config)

dataset.config.fields.create(
   sdk.Field(name="field2", title="Field 2", required=False, settings=sdk.TextFieldSettings(use_markdown=True))
)

dataset.config.fields["field1"].update(title="Field 1 Updated")
dataset.config.fields["field2"].delete()

dataset.config.questions.create(
   sdk.Question(
      name="question1",
      title="Question 1",
      required=True,
      # builder to define a rating question (instead of creating the different  options)
      settings=sdk.RatingQuestionSettings.build(1, 6)
   )
)

dataset.publish() # Will move the dataset from draft to ready. This flow is not yet covered in the UI but is quite useful 
dataset.config.to_dict() # this will return a serializable representation of the configuration

Closes #...

Type of change

(Please delete options that are not relevant. Remember to title the PR according to the type of change)

  • [ ] New feature (non-breaking change which adds functionality)
  • [X] Refactor (change restructuring the codebase without changing functionality)
  • [X] Improvement (change adding some improvement to an existing functionality)

How Has This Been Tested

(Please describe the tests that you ran to verify your changes. And ideally, reference tests)

  • [ ] Test A
  • [ ] Test B

Checklist

  • [ ] I added relevant documentation
  • [ ] I followed the style guidelines of this project
  • [ ] I did a self-review of my code
  • [ ] I made corresponding changes to the documentation
  • [ ] My changes generate no new warnings
  • [ ] I have added tests that prove my fix is effective or that my feature works
  • [ ] I filled out the contributor form (see text above)
  • [ ] I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)

frascuchon avatar Dec 14 '23 16:12 frascuchon

SDK Comments

Looking great already @frascuchon :)

  • maybe we can keep using rg in communication. from argilla import sdk as rg.

  • align init, get_item, set_item, update delete create, from_name, by_name usage for all classes. Workspace and User work with from_name but ideally we want to use similar methods for all classes. For example, for the Question and Field, we also use by_id and by_name_and_workspace.

ds.config.fields["field_name"] = rg.Field(...) # weird flow because name needs to align with key but can be an easy check.
ds.config.fields.create(rg.Field(...))
ds.config.fields.append(rg.Field(...))
ds.config.fields.by_name('field_name')
ds.config.fields['field_name']
ds.config.fields.by_id(0000)
  • will we allow for doing operations as defined above for *Config and *Settings too?

  • will we also allow for aggregate operations

ds.config.fields = [rg.Field(...), rg.Field(...)] | FieldsCollection
  • I don't like introducing/communicating about the build method myself and rather have init and/or create

  • I would maybe align usage of *Config and *Settings to use one or the other RatingQuestionConfig and DatasetConfig or RatingQuestionSettings and DatasetSettings. I think it is a bit confusing to have both.

  • Dataset.publish should have clear validation testing things like. Do we have the required fields/questions? Is the Dataset in a valid state? etc.

  • I think aligning Field and Question attributes by adding description to Field would be nice. This would unify the Field and Question classes and make it easier to work with them. Also to unify them further like with the settings (see comment below). https://github.com/argilla-io/argilla/issues/3565

  • Did you use settings to each individual field and question to allow for better maintainability compared to explicit *Field? I like it.

  • update-method does not contain the required kwarg.

  • within the fields and questions. create we still return argilla.sdk._api._fields.Field but I think it would be better to use in-place operations accros the board. fields.create(...) should return fields and fields.append(...) should return fields too.

  • We should add a new __repr__ to the Dataset class to make it possible to get rid of references to the old structure in the __repr__ of the RemotFeedbackDataset class.

  • DatasetConfiguration still contains allow_extra_metadata attribute.

davidberenstein1957 avatar Dec 19 '23 15:12 davidberenstein1957

Great job!!

I like it a lot!

I suscribe most of the comments from @davidberenstein1957 and highlight these two:

  • Align usage of *Config and *Settings to use one or the other RatingQuestionConfig and DatasetConfig or RatingQuestionSettings and DatasetSettings. I think it is a bit confusing to have both.

I think Config I prefer everywhere, it's easier to write down.

I would even consider using RatingConfig instead of RatingQuestionConfig, and have LabelConfig, MultiLabelConfig, RankingConfig if you think is not too weird. We already have very long class names.

  • I think aligning Field and Question attributes by adding description to Field would be nice.

I think this is a nice idea if not much hassle. Beyond aligning the API i think it could be useful in the UI

dvsrepo avatar Dec 19 '23 16:12 dvsrepo

This is out of scope of the poc, but this covers dataset creation, publication, and update. Any initial thoughts about retrieving the dataset (the old load, or new from_argilla)?

cc @frascuchon

dvsrepo avatar Dec 19 '23 16:12 dvsrepo

This looks great! I tried to use some functionalities but it returned errors. Don't know if these are already known but wanted to share:

  • sdk.init does not accept workspace like the old one.
  • When I want to create the dataset again with the same name, it returned an error. Instead, there might be a warning for that.
  • guidelines can be optional or a default version can be utilized.
  • For the DatasetConfiguration, when it is first defined with field being an sdk.Field object like below:
ds_config = sdk.DatasetConfiguration(fields=field_question,
                                     guidelines="Guidelines for the annotation of the dataset",
                                     allow_extra_metadata=False,
                                     )

append does not work as it is not a list, as in like ds_config.fields.append(sdk.Field(...)) This also applies to questions.

  • The problem above also did not allow to use ds_config.to_dict()
  • I expected to update the config directly via dataset.config.update(ds_config) but it did not work.
  • I expected to give labels to LabelQuestionSettings with options but it caused errors. Specifically, it created the question but did not validate it, and then later returned me an error while adding to the dataset.
  • I queried a non-existing field with ds_new.config.fields["non-existent"] and it did not return anything. Instead, it can give an error or warning.
  • For QuestionSettings, one way would be using the question type as an argument like QuestionSettings("label"...)

kursathalat avatar Dec 19 '23 16:12 kursathalat

This is my suggestion. It doesn't include advanced flows like created records in bulk but it has an abstraction over the basic/core functionality called DatasetBuilder:

from argilla.sdk import (
  Workspace,
  Dataset,
  Field,
  TextFieldSettings,
  Question,
  DatasetBuilder,
)

# Omitting sdk.init but it should be the same one that Paco defined.

# Creating a workspace
workspace = Workspace(name="Workspace")
# This is requesting backend and replacing `workspace` instance with the content returned by the request.
workspace.create()

# Creating a dataset
dataset = Dataset(name="Dataset", workspace=workspace)
# This is requesting backend and replacing `dataset` instance with the content returned by the request.
dataset.create()
# Ok, so now the dataset should have an `id` right?
print(dataset.id)
# > 6efed84c-790c-4b58-a882-0cb15fd18582

# Ouch I forgot the guidelines!, let's update the dataset with them
dataset.guidelines = "The dataset guidelines that I forgot"
# This will request backend and replate `dataset` instance with the content from the update response
dataset.update()
# Let's see if the instance is really updated...
print(dataset.guidelines)
# > "The dataset guidelines that I forgot"
# Coolio, let's continue...

# Let's add some fields to the dataset
field = Field(
  name="Field",
  title="Field Title",
  required=False,
  settings=TextFieldSettings(use_markdown=False),
  dataset=dataset,
)
# Nice, let's create it!
field.create()
# I'm stupid it should be required, let's change it
field.required = True
field.update()
# Nice, let's see if it's really updated
print(field.required)
# > True
# Ok, let's see if the field has been really created for the dataset
# The next line will request backend to listing fields for that dataset
Field.list(dataset=dataset)
# > A list of fields for that specific dataset is returned (with the one we created only)

# Let's add a question now so annotators can add responses
question = Question(
  name="Question",
  title="Question Title",
  description="Question Description",
  required=True,
  dataset=dataset,
)
# Again this method is creating the question doing a request to the backend and replacing the question instance
question.create()
# Ok so now I can try to retrieve a list of questions for the dataset and check if my question is there
Question.list(dataset=dataset)
# > A list of questions for that specific dataset is returned (with the one we created only)

# Ok, I'm almost finished. Let's set the dataset as published
dataset.publish()
# Perfect, so the dataset should now be ready to start working on it, let's see...
print(dataset.status)
# > "ready"
# Nice!

#######

# I want to get an old dataset that I created some time ago
old_dataset = Dataset.get_by_name("Old Dataset")
# Cool, let's see when I created it
print(old_dataset.inserted_at)
# > inserted_at is printed
# Ok, and when it was the last time we worked on it?
print(old_dataset.last_activity_at)
# > last_activity_at is printed
# Let's delete it because is not useful anymore
old_dataset.delete()
# This previous delete line will request backend and delete the dataset but because we are returning the dataset in the response we will replace the instance too

# Let's check if it's really deleted
Dataset.get(old_dataset.id)
# > A NotFoundError exception is raised

#######

# This last section it's pure experimentation and we should check and think about a better abstraction
# or even discard it completely.

# Ok, all of this is fine but I need to do a lot of interactions with the sdk.
# Maybe we can offer a most simple flow so I can create the dataset in a single step...
dataset_build = DatasetBuilder(
    name="Dataset",
    guidelines="The dataset guidelines that I forgot",
    workspace=workspace,
    fields=[
        Field(
            name="Field",
            title="Field Title",
            required=True,
            settings=TextFieldSettings(use_markdown=False),
        )
    ],
    questions=[
        Question(
            name="Question",
            title="Question Title",
            description="Question Description",
            required=True,
        )
    ]
)

# I should add more fields!
dataset_build.fields.append(
  Field(
    name="Field B",
    title="Field Title B",
    required=False,
    settings=TextFieldSettings(use_markdown=False)
  )
)

# Ok, let's create it!
dataset = dataset_build.create(publish=False)
# Good, let's publish it because it's a dataset instance I can do .publish()
dataset.publish()

# Ok, this it's good but I would like to have a JSON template so I can share it with my friends
json_dataset_template = dataset_build.to_json()
# Nice! Now I can share it so my friends can load it like this
dataset_build = DatasetBuilder.from_json(json_dataset_template)
# Let's change the name so we don't have conflicts with the previous one
dataset_build.name = "A Dataset from a JSON template"
# Ok, let's create and publish it
dataset = dataset_build.create()
dataset.publish()

jfcalvo avatar Dec 20 '23 16:12 jfcalvo

Hi @jfcalvo , I see what you mean but I assumed that the code that @frascuchon could also be initialized through a default __init__ that we had before too, where we would be able to pass all arguments at once.

davidberenstein1957 avatar Jan 03 '24 11:01 davidberenstein1957

@frascuchon, I also opt for renaming MetadataProperty to MetadataSettings or something similar as with the Fields and Questions to align the API usage a bit more in that regard too.

davidberenstein1957 avatar Jan 03 '24 11:01 davidberenstein1957