argilla
argilla copied to clipboard
feat: new SDK PoC
Description
This is a draft PR for the new Core SDK definition. The main gold of this PR is to define the foundations of the new Python SDK module/package with (temporal) fully backward compatibility with current SDK components.
The PR defines a new argilla.sdk
module where all these changes occur. The main interaction class is argilla.sdk.Dataset
which defines the new methods and attributes to work with the Argilla feedback datasets in a new fashion way.
This class currently extends the RemoteFeedbackDataset
which is the entry point of the backward compatibility.
Here is a sample code snippet showing how to work with this new definition:
from argilla import sdk
sdk.init(...) # This line replaces the rg.init
ws = sdk.Workspace.from_name("argilla") # This is a shorcut of the rg.Workspace class
ds_config = sdk.DatasetConfiguration(guidelines="My guidelines")
ds_config.fields.append(
sdk.Field(
name="field1",
title="Field 1",
required=True,
settings=sdk.TextFieldSettings(use_markdown=False)
)
)
dataset = sdk.Dataset.create(name="my-ds", workspace=ws, config=ds_config)
dataset.config.fields.create(
sdk.Field(name="field2", title="Field 2", required=False, settings=sdk.TextFieldSettings(use_markdown=True))
)
dataset.config.fields["field1"].update(title="Field 1 Updated")
dataset.config.fields["field2"].delete()
dataset.config.questions.create(
sdk.Question(
name="question1",
title="Question 1",
required=True,
# builder to define a rating question (instead of creating the different options)
settings=sdk.RatingQuestionSettings.build(1, 6)
)
)
dataset.publish() # Will move the dataset from draft to ready. This flow is not yet covered in the UI but is quite useful
dataset.config.to_dict() # this will return a serializable representation of the configuration
Closes #...
Type of change
(Please delete options that are not relevant. Remember to title the PR according to the type of change)
- [ ] New feature (non-breaking change which adds functionality)
- [X] Refactor (change restructuring the codebase without changing functionality)
- [X] Improvement (change adding some improvement to an existing functionality)
How Has This Been Tested
(Please describe the tests that you ran to verify your changes. And ideally, reference tests
)
- [ ] Test A
- [ ] Test B
Checklist
- [ ] I added relevant documentation
- [ ] I followed the style guidelines of this project
- [ ] I did a self-review of my code
- [ ] I made corresponding changes to the documentation
- [ ] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my feature works
- [ ] I filled out the contributor form (see text above)
- [ ] I have added relevant notes to the
CHANGELOG.md
file (See https://keepachangelog.com/)
SDK Comments
Looking great already @frascuchon :)
-
maybe we can keep using
rg
in communication.from argilla import sdk as rg
. -
align init, get_item, set_item,
update
delete
create
,from_name
,by_name
usage for all classes.Workspace
andUser
work withfrom_name
but ideally we want to use similar methods for all classes. For example, for theQuestion
andField
, we also useby_id
andby_name_and_workspace
.
ds.config.fields["field_name"] = rg.Field(...) # weird flow because name needs to align with key but can be an easy check.
ds.config.fields.create(rg.Field(...))
ds.config.fields.append(rg.Field(...))
ds.config.fields.by_name('field_name')
ds.config.fields['field_name']
ds.config.fields.by_id(0000)
-
will we allow for doing operations as defined above for
*Config
and*Settings
too? -
will we also allow for aggregate operations
ds.config.fields = [rg.Field(...), rg.Field(...)] | FieldsCollection
-
I don't like introducing/communicating about the
build
method myself and rather have init and/orcreate
-
I would maybe align usage of
*Config
and*Settings
to use one or the otherRatingQuestionConfig
andDatasetConfig
orRatingQuestionSettings
andDatasetSettings
. I think it is a bit confusing to have both. -
Dataset.publish
should have clear validation testing things like. Do we have the required fields/questions? Is theDataset
in a valid state? etc. -
I think aligning
Field
andQuestion
attributes by addingdescription
toField
would be nice. This would unify theField
andQuestion
classes and make it easier to work with them. Also to unify them further like with the settings (see comment below). https://github.com/argilla-io/argilla/issues/3565 -
Did you use
settings
to each individual field and question to allow for better maintainability compared to explicit*Field
? I like it. -
update
-method does not contain therequired
kwarg. -
within the
fields
andquestions
. create we still returnargilla.sdk._api._fields.Field
but I think it would be better to use in-place operations accros the board.fields.create(...)
should returnfields
andfields.append(...)
should returnfields
too. -
We should add a new
__repr__
to theDataset
class to make it possible to get rid of references to the old structure in the__repr__
of theRemotFeedbackDataset
class. -
DatasetConfiguration
still containsallow_extra_metadata
attribute.
Great job!!
I like it a lot!
I suscribe most of the comments from @davidberenstein1957 and highlight these two:
- Align usage of *Config and *Settings to use one or the other RatingQuestionConfig and DatasetConfig or RatingQuestionSettings and DatasetSettings. I think it is a bit confusing to have both.
I think Config I prefer everywhere, it's easier to write down.
I would even consider using RatingConfig
instead of RatingQuestionConfig
, and have LabelConfig
, MultiLabelConfig
, RankingConfig
if you think is not too weird. We already have very long class names.
- I think aligning Field and Question attributes by adding description to Field would be nice.
I think this is a nice idea if not much hassle. Beyond aligning the API i think it could be useful in the UI
This is out of scope of the poc, but this covers dataset creation, publication, and update. Any initial thoughts about retrieving the dataset (the old load, or new from_argilla)?
cc @frascuchon
This looks great! I tried to use some functionalities but it returned errors. Don't know if these are already known but wanted to share:
-
sdk.init
does not accept workspace like the old one. - When I want to create the dataset again with the same name, it returned an error. Instead, there might be a warning for that.
-
guidelines
can be optional or a default version can be utilized. - For the
DatasetConfiguration
, when it is first defined with field being ansdk.Field
object like below:
ds_config = sdk.DatasetConfiguration(fields=field_question,
guidelines="Guidelines for the annotation of the dataset",
allow_extra_metadata=False,
)
append
does not work as it is not a list, as in like ds_config.fields.append(sdk.Field(...))
This also applies to questions.
- The problem above also did not allow to use
ds_config.to_dict()
- I expected to update the config directly via
dataset.config.update(ds_config)
but it did not work. - I expected to give labels to
LabelQuestionSettings
with options but it caused errors. Specifically, it created the question but did not validate it, and then later returned me an error while adding to the dataset. - I queried a non-existing field with
ds_new.config.fields["non-existent"]
and it did not return anything. Instead, it can give an error or warning. - For
QuestionSettings
, one way would be using the question type as an argument likeQuestionSettings("label"...)
This is my suggestion. It doesn't include advanced flows like created records in bulk but it has an abstraction over the basic/core functionality called DatasetBuilder
:
from argilla.sdk import (
Workspace,
Dataset,
Field,
TextFieldSettings,
Question,
DatasetBuilder,
)
# Omitting sdk.init but it should be the same one that Paco defined.
# Creating a workspace
workspace = Workspace(name="Workspace")
# This is requesting backend and replacing `workspace` instance with the content returned by the request.
workspace.create()
# Creating a dataset
dataset = Dataset(name="Dataset", workspace=workspace)
# This is requesting backend and replacing `dataset` instance with the content returned by the request.
dataset.create()
# Ok, so now the dataset should have an `id` right?
print(dataset.id)
# > 6efed84c-790c-4b58-a882-0cb15fd18582
# Ouch I forgot the guidelines!, let's update the dataset with them
dataset.guidelines = "The dataset guidelines that I forgot"
# This will request backend and replate `dataset` instance with the content from the update response
dataset.update()
# Let's see if the instance is really updated...
print(dataset.guidelines)
# > "The dataset guidelines that I forgot"
# Coolio, let's continue...
# Let's add some fields to the dataset
field = Field(
name="Field",
title="Field Title",
required=False,
settings=TextFieldSettings(use_markdown=False),
dataset=dataset,
)
# Nice, let's create it!
field.create()
# I'm stupid it should be required, let's change it
field.required = True
field.update()
# Nice, let's see if it's really updated
print(field.required)
# > True
# Ok, let's see if the field has been really created for the dataset
# The next line will request backend to listing fields for that dataset
Field.list(dataset=dataset)
# > A list of fields for that specific dataset is returned (with the one we created only)
# Let's add a question now so annotators can add responses
question = Question(
name="Question",
title="Question Title",
description="Question Description",
required=True,
dataset=dataset,
)
# Again this method is creating the question doing a request to the backend and replacing the question instance
question.create()
# Ok so now I can try to retrieve a list of questions for the dataset and check if my question is there
Question.list(dataset=dataset)
# > A list of questions for that specific dataset is returned (with the one we created only)
# Ok, I'm almost finished. Let's set the dataset as published
dataset.publish()
# Perfect, so the dataset should now be ready to start working on it, let's see...
print(dataset.status)
# > "ready"
# Nice!
#######
# I want to get an old dataset that I created some time ago
old_dataset = Dataset.get_by_name("Old Dataset")
# Cool, let's see when I created it
print(old_dataset.inserted_at)
# > inserted_at is printed
# Ok, and when it was the last time we worked on it?
print(old_dataset.last_activity_at)
# > last_activity_at is printed
# Let's delete it because is not useful anymore
old_dataset.delete()
# This previous delete line will request backend and delete the dataset but because we are returning the dataset in the response we will replace the instance too
# Let's check if it's really deleted
Dataset.get(old_dataset.id)
# > A NotFoundError exception is raised
#######
# This last section it's pure experimentation and we should check and think about a better abstraction
# or even discard it completely.
# Ok, all of this is fine but I need to do a lot of interactions with the sdk.
# Maybe we can offer a most simple flow so I can create the dataset in a single step...
dataset_build = DatasetBuilder(
name="Dataset",
guidelines="The dataset guidelines that I forgot",
workspace=workspace,
fields=[
Field(
name="Field",
title="Field Title",
required=True,
settings=TextFieldSettings(use_markdown=False),
)
],
questions=[
Question(
name="Question",
title="Question Title",
description="Question Description",
required=True,
)
]
)
# I should add more fields!
dataset_build.fields.append(
Field(
name="Field B",
title="Field Title B",
required=False,
settings=TextFieldSettings(use_markdown=False)
)
)
# Ok, let's create it!
dataset = dataset_build.create(publish=False)
# Good, let's publish it because it's a dataset instance I can do .publish()
dataset.publish()
# Ok, this it's good but I would like to have a JSON template so I can share it with my friends
json_dataset_template = dataset_build.to_json()
# Nice! Now I can share it so my friends can load it like this
dataset_build = DatasetBuilder.from_json(json_dataset_template)
# Let's change the name so we don't have conflicts with the previous one
dataset_build.name = "A Dataset from a JSON template"
# Ok, let's create and publish it
dataset = dataset_build.create()
dataset.publish()
Hi @jfcalvo , I see what you mean but I assumed that the code that @frascuchon could also be initialized through a default __init__
that we had before too, where we would be able to pass all arguments at once.
@frascuchon, I also opt for renaming MetadataProperty
to MetadataSettings
or something similar as with the Fields
and Questions
to align the API usage a bit more in that regard too.