turkle
turkle copied to clipboard
Add API to Turkle
Currently, we have a few scripts that parse html which is less than ideal. We should replace those with an API similar to mTurk.
Documentation on their API: https://docs.aws.amazon.com/AWSMechTurk/latest/AWSMturkAPI/ApiReference_OperationsArticle.html
I suggest we prototype a subset of the methods around HIT management (create, list, delete).
I'd like to try to maintain parameter compatibility with mTurk so that our scripts/clients are compatible. This likely means leaving several parameters per method as dummy parameters for things like managing paying turkers.
We would also need to implement the same style of user authentication for scripts to be interoptable.
I notice that mTurk uses xml as the data format, but Turkle uses CSV with json formatted data. Will this create a problem for us when it comes to reusing a prebuilt mTurk client? It would be necessary to either replace the csv format or to write a data emulation layer, ie for 2-way XML <-> CSV+JSON translation. Personally I hate xml and prefer working with csv+json due to its simplicity, but it depends on your project goals -- how much do you want to clone mTurk in every aspect?
https://micropyramid.com/blog/how-to-convert-xml-content-into-json-using-xmltodict/
Usually in Django (using the Django Rest Framework or DRF), your API resources are mapped to your Models through serializers. I believe we would normally map out the existing models like this, but Turkle's application design seems really different than mTurk:
- class Task
- class TaskAssignment
- class Batch
- class Project
Example:
GET /tasks # Returns a list of tasks
GET /tasks/<id> # Returns information for a specific task
POST /tasks # Create a new task
PUT /tasks/<id> # Completely modifies a specific task
PATCH /tasks/<id> # Partially updates a specific task
DELETE /tasks/<id> # Remove a specific task
So, we would need to build a bunch of custom serializers using the DRF, as an application layer which would respond to mTurk client requests, in order to access the existing models.
mTurk uses xml for the question parameter when creating a hit using their API (https://docs.aws.amazon.com/AWSMechTurk/latest/AWSMturkAPI/ApiReference_CreateHITOperation.html). Have you noticed it anywhere else?
These API actions require sending and/or receiving XML payloads:
- CreateHIT
- CreateHITWithHITType
- CreateQualificationType
- GetAssignmentsForHIT
- GetQualificationRequests
- GetQualificationType
- UpdateQualificationType
Here is a list of data structures which are in XML. https://docs.aws.amazon.com/AWSMechTurk/latest/AWSMturkAPI/ApiReference_SchemaLocationArticle.html
We would need to decide how much of this to support / not support. For example, there is a pretty elaborate xml templating language which we may not support (in Formatted Content: XHTML doc), because plain HTML/javascript templates work pretty well anyway...
We're not interested in qualifications. Is the same true for your use cases?
GetAssignmentsForHIT is deprecated so that leaves CreateHIT and CreateHITWithHITType. Right now, we are only using html/javascript templates and are happy with them.
Turkle currently cares about Group membership, which might be functionally equivalent to the mTurk notion of a Qualification. Only Users who (are part of a particular Group|have a particular Qualification) can work on a particular task.
But, for our use cases, Group membership is always assigned by an Admin. We don't have a need for a process where Users complete a Qualification Task that is then reviewed to determine if they would be (assigned a Qualification|added to a Group).
I do think we want the API to support the programmatic creation of Tasks that are restricted to specific Groups, so limited API support for "Qualifications" (to the extent that they provide the same functionality as Groups) may be worthwhile.
I've been assuming that one motivation for implementing the mturk API is using boto as the client. Looks like boto has a schema file per service: https://github.com/boto/botocore/blob/develop/botocore/data/mturk/2017-01-17/service-2.json
The schema file is then used by a validator. This is fine as long as we use a subset of their API. But if we add to it by adding a Group parameter to CreateHIT, it won't validate. I haven't actually tried this yet.
Seen in the clear light of day, the more I read the mTurk specs., the more I see an impedence mismatch to Turkle's specs. How would these Turkle script operations map to the mTurk api operations?
- add_user
- import_users
- upload_tasks
- download_results
I'm thinking that, maybe we should just let Turkle "be who he is", meaning we can write a custom api and client that allows us our feature set with the minimal fuss. This may be the shortest development path, even with the added burden of writing our own customized api client software. Thoughts?
add_user and import_users are both operations that are not supported on mTurk. They have a different user management approach.
upload_tasks is the same as create HITs. download_results is the same as get assignments.
I'm not quite ready to develop our own API, but am certainly open to that. I'd like to see if I can pass a custom service definition to boto and use it. I have a lot of meeting today, but will try to squeeze that in.
I've been working on an issue that we discovered related to unicode characters. I hope to get back to this soon...
I've been able to get the boto client to work with my mock mturk site. I had to pass an endpoint_url to the client. It also required a fake region and a fake aws access token and secret. The access token and key are used for authentication so we would have to implement the same authentication system in turkle.
I added a parameter not in their spec and as expected, the boto validator failed. I haven't seen a way to turn that off. I then grabbed the mturk service definition, modified it, set an environment variable, and it worked. So we would be able to add parameters and methods to their API without much trouble.
Still not sure this is worth it. I'm checking up on the authentication code next - hoping it is some standard like OAuth.
Hi Cash and Craig,
I hope you are both well. I've been studying up on Django api's and clients. I think we could possibly use code generators in order to do the heavy lifting for both the server and client development. The generated code would have nice standard code design across all models and views, etc. It could eliminate weeks of trial and error, if all goes as promised....
- Django Rest Framework (DRF) is the defacto API module, with the best support and best documentation. It can do everything you would want a django rest api to do, including auth. I think it makes sense to use it rather than try to roll our own. https://github.com/encode/django-rest-framework
- There is a project for generating DRF api code! https://github.com/Brobin/drf-generators
- Here is a project to generate Swagger API documentation from a DRF project (in just a few additional lines), https://github.com/marcgibbons/django-rest-swagger/
- Swagger client generator can read the generated docs which then allows us to automatically generate Client code, in python and or multiple other languages, too. https://github.com/swagger-api/swagger-codegen
Thoughts?
Thanks @cfortune - I took a look at drf-generators and it seems to want to blow away the views.py file in the turkle app to do its magic. Maybe it is intended for API only sites? Using the generator may not be possible for a HTML first site.
I'm starting to read up on DRF - specifically applications that already have HTML views and want to add an api.
Hi @cash , drf-generator lets you choose which types of serializers to generate, but I think it assumes you will generate your files at the beginning on an empty project, so, ya, it would blow away existing files. Maybe the way to use it is to let it blow away all the files, then we merge those files with the existing Turkle functions. Git merge tools should allow for that. It would be great to make contact with the authors of drf-generators project to get their input on modifying Turkle.
drf-generators created an API for CRUD operations on projects, batches, tasks, and task assignments. I believe we would only want to keep the methods for projects and batches. I'm not expecting the workers to use the API so having methods for working with tasks and task assignments don't make as much sense to me.
Maybe its possible to create another app called api that imports the model from turkle for projects and batches and create the API using that.
Maybe its possible to create another app called api that imports the model from turkle for projects and batches and create the API using that.
I think that is probably the right approach.
I assume we could import the user and group models too, from django admin, for use by drf auth, and limit actions via a drf group, or add drf permissions (read only, read/write, no access).
brobin, author of drf-generators wrote this:
If the files already exist (urls, views, etc.) they would overwrite existing code. It will warn you before overwriting. You could always run it and then merge back your existing stuff.
After looking through the generated code, I'm less interested in this. It was really simple code that doesn't save that much time over doing it yourself with DRF.
I'd like to get a list of design requirements for the API. On our side we want support for:
- Managing user accounts and groups
- Creating projects and batches
- Monitoring progress
- Downloading results
The above list does not include
- Assigning a task to a specific user
- CRUD operations on tasks (right now this is done at the batch level)
- Completing assignments
@charman Do you have any comments on the above list?
@cfortune What are your highest priority items
That's too bad about drf-generators, I thought they would have more introspection of the models and would generate more code. It still may be worthwhile to use them in order to create a nicely scoped scaffold initially.
The highest priority item for us is batch/task management. We can create projects, batches, and do user management manually via crud, because they won't change much over time. Our AI system will be hitting the API day and night, though. I would be interested in the ability to do rest operations on individual tasks rather than on a batch of tasks as a whole. For example:
- put one or more tasks to an existing batch
- get and delete (archive) all completed tasks (in one step).
Can we simply reuse existing batches, or does the program assume that each new collection of tasks will need a new batch?
Using the html interface, you are restricted to one time batch creation. It doesn't support adding new tasks to a batch - at least not currently.
The mTurk API is set up to work like you describe. It doesn't have the concept of batches. I'm looking through the code to see what assumptions we made on this.