argilla icon indicating copy to clipboard operation
argilla copied to clipboard

feat: python-rq integration using datasets reindex as proof of concept

Open jfcalvo opened this issue 1 year ago • 2 comments

Description

This PR include changes as a proof of concept to check how to integrate rq background processor with Argilla.

The changes include also two new endpoints:

  • PUT /api/v1/datasets/:dataset_id/reindex
    • This endpoint will return a HTTP 202 (Accepted) status.
    • A background job will be enqueue to reindex the dataset.
    • The response body will include the id of the job and its status (queued in this case if everything was fine).
    • Users can use the id of the job to get information about what is the status of the job.
  • GET /api/v1/jobs/:job_id
    • This endpoint is used to obtain information about one specific job (returning the id and status).
    • Jobs are right now not stored on database and I'm using rq API to get information about its status.
    • rq is saving job information for 500 seconds on Redis, so after a job is finished or failed the user has 500 seconds to get information about it.

Posible improvements:

  • Define a proper Redis connection using a pool of connections and getting settings from environment variables. Redis is using a pool of connections by default and I have added a new environment variable to set the connection (ARGILLA_REDIS_URL).
  • Define a better way to store our jobs, maybe using a new jobs table on Argilla database and allowing to save results of the jobs there. We will start with this approach of using rq results stored in Redis and in the future for more complex flows we will think into adding some data if necessary to our database.
  • Define a rq queue only for search engine purposes. We will use default queue for now.
  • Once Reindexer class code is merged from PR adding reindex cli task we can remove it from the code in this PR. We already merge the PR adding the reindex cli task and now the jobs are importing it and using it.
  • Add a result field to Job schema so we can include the result of the job inside it. (Useful to know if there are errors or additional information about the process)

Things to investigate/discuss:

  • How to use Redis on our docker images, specially with QuickStart images on HF.
  • Alternatives to use Redis using fakeredis python library instead.

Type of change

(Please delete options that are not relevant. Remember to title the PR according to the type of change)

  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [x] New feature (non-breaking change which adds functionality)
  • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • [ ] Refactor (change restructuring the codebase without changing functionality)
  • [ ] Improvement (change adding some improvement to an existing functionality)
  • [ ] Documentation update

How Has This Been Tested

(Please describe the tests that you ran to verify your changes. And ideally, reference tests)

  • [ ] Test A
  • [ ] Test B

Checklist

  • [ ] I added relevant documentation
  • [ ] follows the style guidelines of this project
  • [ ] I did a self-review of my code
  • [ ] I made corresponding changes to the documentation
  • [ ] My changes generate no new warnings
  • [ ] I have added tests that prove my fix is effective or that my feature works
  • [ ] I filled out the contributor form (see text above)
  • [ ] I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)

jfcalvo avatar Dec 18 '23 17:12 jfcalvo

The URL of the deployed environment for this PR is https://argilla-quickstart-pr-4427-ki24f765kq-no.a.run.app

github-actions[bot] avatar Dec 19 '23 12:12 github-actions[bot]

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (6630d7b) 90.13% compared to head (de3721e) 91.21%. Report is 578 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #4427      +/-   ##
===========================================
+ Coverage    90.13%   91.21%   +1.07%     
===========================================
  Files          233      351     +118     
  Lines        12493    19912    +7419     
===========================================
+ Hits         11261    18163    +6902     
- Misses        1232     1749     +517     
Flag Coverage Δ
pytest ?

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Dec 19 '23 12:12 codecov[bot]