Add a text area on the batch import page to allow raw JSONL
Problem
Currently we have an endpoint https://openlibrary.org/admin/imports/add which takes a list of ocaid archive.org identifiers. We want patrons to use the
https://openlibrary.org/import/batch/new endpoint which we should rename as imports to be consistent with /admin/imports/add and /imports.
A clear and concise description of what you want to happen
One should be able to import a new item by entering raw JSON into a text area in the batch import endpoint at /import/batch/new (https://openlibrary.org/import/batch/new).
Once the JSONL is submitted, the same validation that happens with an uploaded JSONL file should be run.
Additional Context
See #8122, which added the existing endpoint. This issue is to extend that by, e.g., adding a <textarea> where the JSONL can be entered instead of attaching it as a file.
It was probably a mistake to have batch_import take bytes here, as this tightly couples the implementation to a file upload: https://github.com/internetarchive/openlibrary/blob/e7f11e7c41b1a9317814c0e96cc1c9bf905c8b67/openlibrary/core/batch_imports.py#L73
Instead, this should likely take a list or perhaps a generator. In any event, by changing the function signature here it should be possible to have the form used for submitting raw JSONL input plug directly into this function, unless the form data comes in as bytes, which I think it will not by default. The batch_imports endpoint will be need to updated as well to account for this change away from bytes: https://github.com/internetarchive/openlibrary/blob/e7f11e7c41b1a9317814c0e96cc1c9bf905c8b67/openlibrary/plugins/openlibrary/code.py#L502-L504.
Proposal & Constraints
No response
Leads
Related files
Stakeholders
@mekarpeles
Instructions for Contributors
- Please run these commands to ensure your repository is up to date before creating a new branch to work on this issue and each time after pushing code to Github, because the pre-commit bot may add commits to your PRs upstream.
I would like to work on this issue and try to fix it. I have experience working with various python and web development libraries including some JSON manipulations and traversal. Assign me this issue
@Devansh-Kushwaha did you have any questions about how to proceed? Since it's been two weeks, if you're no working on this issue we'd like to give someone else a chance :)
@Devansh-Kushwaha did you have any questions about how to proceed? Since it's been two weeks, if you're no working on this issue we'd like to give someone
Yes, I apologize for the delay. I am having problems in setting up the project. I am kinda new to docker.
@Devansh-Kushwaha, if you share any questions you have perhaps we can help.
Can I please be assigned this issue?
Let's also have a way to validate the input (e.g. pass in ?validate=true) and if flag exists, raise and don't import after validating.
Give it a try @slimkevo! Can you reply with your approach and any blockers you're hitting as you get set up?
@mekarpeles I am experiencing issues with uploading raw JSON data through the import functionality. When I attempt to upload JSON data, the import fails with validation errors or SQL exceptions. Specifically, I receive errors indicating issues with the JSON format or missing fields in the database. SQL errors like 'column "submitter" of relation "import_item" does not exist' indicating discrepancies between the JSON data and the database schema.
It sounds as if we may need to add that column to the local development environment SQL schema.
In the interim, something like this should at least resolve the error about the submitter column, @slimkevo:
❯ docker compose exec db bash
WARN[0000] The "HOST" variable is not set. Defaulting to a blank string.
root@29b2d94b9e8d:/# psql -U openlibrary
psql (9.3.25)
Type "help" for help.
openlibrary=# ALTER TABLE public.import_item ADD COLUMN submitter text;
ALTER TABLE
openlibrary=# \d import_item
Table "public.import_item"
Column | Type | Modifiers
-------------+-----------------------------+----------------------------------------------------------
id | integer | not null default nextval('import_item_id_seq'::regclass)
batch_id | integer |
added_time | timestamp without time zone | default timezone('utc'::text, now())
import_time | timestamp without time zone |
status | text | default 'pending'::text
error | text |
ia_id | text |
data | text |
ol_key | text |
comments | text |
submitter | text |
... more output omitted ...
Please let us know how that goes, and whether more must be done to use the endpoint.
In this file there are a pair of uncommented records that should not have validation errors: two_item_import.txt