turkle icon indicating copy to clipboard operation
turkle copied to clipboard

Add API to Turkle

Open cash opened this issue 5 years ago • 21 comments

Currently, we have a few scripts that parse html which is less than ideal. We should replace those with an API similar to mTurk.

Documentation on their API: https://docs.aws.amazon.com/AWSMechTurk/latest/AWSMturkAPI/ApiReference_OperationsArticle.html

I suggest we prototype a subset of the methods around HIT management (create, list, delete).

I'd like to try to maintain parameter compatibility with mTurk so that our scripts/clients are compatible. This likely means leaving several parameters per method as dummy parameters for things like managing paying turkers.

We would also need to implement the same style of user authentication for scripts to be interoptable.

cash avatar Apr 18 '19 15:04 cash

I notice that mTurk uses xml as the data format, but Turkle uses CSV with json formatted data. Will this create a problem for us when it comes to reusing a prebuilt mTurk client? It would be necessary to either replace the csv format or to write a data emulation layer, ie for 2-way XML <-> CSV+JSON translation. Personally I hate xml and prefer working with csv+json due to its simplicity, but it depends on your project goals -- how much do you want to clone mTurk in every aspect?

https://micropyramid.com/blog/how-to-convert-xml-content-into-json-using-xmltodict/

cfortune avatar Apr 18 '19 20:04 cfortune

Usually in Django (using the Django Rest Framework or DRF), your API resources are mapped to your Models through serializers. I believe we would normally map out the existing models like this, but Turkle's application design seems really different than mTurk:

  • class Task
  • class TaskAssignment
  • class Batch
  • class Project

Example: GET /tasks # Returns a list of tasks GET /tasks/<id> # Returns information for a specific task POST /tasks # Create a new task PUT /tasks/<id> # Completely modifies a specific task PATCH /tasks/<id> # Partially updates a specific task DELETE /tasks/<id> # Remove a specific task

So, we would need to build a bunch of custom serializers using the DRF, as an application layer which would respond to mTurk client requests, in order to access the existing models.

cfortune avatar Apr 18 '19 20:04 cfortune

mTurk uses xml for the question parameter when creating a hit using their API (https://docs.aws.amazon.com/AWSMechTurk/latest/AWSMturkAPI/ApiReference_CreateHITOperation.html). Have you noticed it anywhere else?

cash avatar Apr 18 '19 20:04 cash

These API actions require sending and/or receiving XML payloads:

  • CreateHIT
  • CreateHITWithHITType
  • CreateQualificationType
  • GetAssignmentsForHIT
  • GetQualificationRequests
  • GetQualificationType
  • UpdateQualificationType

Here is a list of data structures which are in XML. https://docs.aws.amazon.com/AWSMechTurk/latest/AWSMturkAPI/ApiReference_SchemaLocationArticle.html

We would need to decide how much of this to support / not support. For example, there is a pretty elaborate xml templating language which we may not support (in Formatted Content: XHTML doc), because plain HTML/javascript templates work pretty well anyway...

cfortune avatar Apr 18 '19 22:04 cfortune

We're not interested in qualifications. Is the same true for your use cases?

GetAssignmentsForHIT is deprecated so that leaves CreateHIT and CreateHITWithHITType. Right now, we are only using html/javascript templates and are happy with them.

cash avatar Apr 18 '19 23:04 cash

Turkle currently cares about Group membership, which might be functionally equivalent to the mTurk notion of a Qualification. Only Users who (are part of a particular Group|have a particular Qualification) can work on a particular task.

charman avatar Apr 19 '19 11:04 charman

But, for our use cases, Group membership is always assigned by an Admin. We don't have a need for a process where Users complete a Qualification Task that is then reviewed to determine if they would be (assigned a Qualification|added to a Group).

I do think we want the API to support the programmatic creation of Tasks that are restricted to specific Groups, so limited API support for "Qualifications" (to the extent that they provide the same functionality as Groups) may be worthwhile.

charman avatar Apr 19 '19 12:04 charman

I've been assuming that one motivation for implementing the mturk API is using boto as the client. Looks like boto has a schema file per service: https://github.com/boto/botocore/blob/develop/botocore/data/mturk/2017-01-17/service-2.json

The schema file is then used by a validator. This is fine as long as we use a subset of their API. But if we add to it by adding a Group parameter to CreateHIT, it won't validate. I haven't actually tried this yet.

cash avatar Apr 19 '19 12:04 cash

Seen in the clear light of day, the more I read the mTurk specs., the more I see an impedence mismatch to Turkle's specs. How would these Turkle script operations map to the mTurk api operations?

  • add_user
  • import_users
  • upload_tasks
  • download_results

I'm thinking that, maybe we should just let Turkle "be who he is", meaning we can write a custom api and client that allows us our feature set with the minimal fuss. This may be the shortest development path, even with the added burden of writing our own customized api client software. Thoughts?

cfortune avatar Apr 23 '19 22:04 cfortune

add_user and import_users are both operations that are not supported on mTurk. They have a different user management approach.

upload_tasks is the same as create HITs. download_results is the same as get assignments.

I'm not quite ready to develop our own API, but am certainly open to that. I'd like to see if I can pass a custom service definition to boto and use it. I have a lot of meeting today, but will try to squeeze that in.

cash avatar Apr 24 '19 13:04 cash

I've been working on an issue that we discovered related to unicode characters. I hope to get back to this soon...

cash avatar Apr 30 '19 17:04 cash

I've been able to get the boto client to work with my mock mturk site. I had to pass an endpoint_url to the client. It also required a fake region and a fake aws access token and secret. The access token and key are used for authentication so we would have to implement the same authentication system in turkle.

I added a parameter not in their spec and as expected, the boto validator failed. I haven't seen a way to turn that off. I then grabbed the mturk service definition, modified it, set an environment variable, and it worked. So we would be able to add parameters and methods to their API without much trouble.

Still not sure this is worth it. I'm checking up on the authentication code next - hoping it is some standard like OAuth.

cash avatar May 01 '19 21:05 cash

Hi Cash and Craig,

I hope you are both well. I've been studying up on Django api's and clients. I think we could possibly use code generators in order to do the heavy lifting for both the server and client development. The generated code would have nice standard code design across all models and views, etc. It could eliminate weeks of trial and error, if all goes as promised....

  • Django Rest Framework (DRF) is the defacto API module, with the best support and best documentation. It can do everything you would want a django rest api to do, including auth. I think it makes sense to use it rather than try to roll our own. https://github.com/encode/django-rest-framework
  • There is a project for generating DRF api code! https://github.com/Brobin/drf-generators
  • Here is a project to generate Swagger API documentation from a DRF project (in just a few additional lines), https://github.com/marcgibbons/django-rest-swagger/
  • Swagger client generator can read the generated docs which then allows us to automatically generate Client code, in python and or multiple other languages, too. https://github.com/swagger-api/swagger-codegen

Thoughts?

cfortune avatar May 02 '19 07:05 cfortune

Thanks @cfortune - I took a look at drf-generators and it seems to want to blow away the views.py file in the turkle app to do its magic. Maybe it is intended for API only sites? Using the generator may not be possible for a HTML first site.

I'm starting to read up on DRF - specifically applications that already have HTML views and want to add an api.

cash avatar May 02 '19 21:05 cash

Hi @cash , drf-generator lets you choose which types of serializers to generate, but I think it assumes you will generate your files at the beginning on an empty project, so, ya, it would blow away existing files. Maybe the way to use it is to let it blow away all the files, then we merge those files with the existing Turkle functions. Git merge tools should allow for that. It would be great to make contact with the authors of drf-generators project to get their input on modifying Turkle.

cfortune avatar May 02 '19 23:05 cfortune

drf-generators created an API for CRUD operations on projects, batches, tasks, and task assignments. I believe we would only want to keep the methods for projects and batches. I'm not expecting the workers to use the API so having methods for working with tasks and task assignments don't make as much sense to me.

Maybe its possible to create another app called api that imports the model from turkle for projects and batches and create the API using that.

cash avatar May 03 '19 14:05 cash

Maybe its possible to create another app called api that imports the model from turkle for projects and batches and create the API using that.

I think that is probably the right approach.

I assume we could import the user and group models too, from django admin, for use by drf auth, and limit actions via a drf group, or add drf permissions (read only, read/write, no access).

cfortune avatar May 03 '19 17:05 cfortune

brobin, author of drf-generators wrote this:

If the files already exist (urls, views, etc.) they would overwrite existing code. It will warn you before overwriting. You could always run it and then merge back your existing stuff.

cfortune avatar May 07 '19 15:05 cfortune

After looking through the generated code, I'm less interested in this. It was really simple code that doesn't save that much time over doing it yourself with DRF.

I'd like to get a list of design requirements for the API. On our side we want support for:

  • Managing user accounts and groups
  • Creating projects and batches
  • Monitoring progress
  • Downloading results

The above list does not include

  • Assigning a task to a specific user
  • CRUD operations on tasks (right now this is done at the batch level)
  • Completing assignments

@charman Do you have any comments on the above list?

@cfortune What are your highest priority items

cash avatar May 07 '19 16:05 cash

That's too bad about drf-generators, I thought they would have more introspection of the models and would generate more code. It still may be worthwhile to use them in order to create a nicely scoped scaffold initially.

The highest priority item for us is batch/task management. We can create projects, batches, and do user management manually via crud, because they won't change much over time. Our AI system will be hitting the API day and night, though. I would be interested in the ability to do rest operations on individual tasks rather than on a batch of tasks as a whole. For example:

  • put one or more tasks to an existing batch
  • get and delete (archive) all completed tasks (in one step).

Can we simply reuse existing batches, or does the program assume that each new collection of tasks will need a new batch?

cfortune avatar May 07 '19 23:05 cfortune

Using the html interface, you are restricted to one time batch creation. It doesn't support adding new tasks to a batch - at least not currently.

The mTurk API is set up to work like you describe. It doesn't have the concept of batches. I'm looking through the code to see what assumptions we made on this.

cash avatar May 08 '19 18:05 cash