openlibrary icon indicating copy to clipboard operation
openlibrary copied to clipboard

Add a text area on the batch import page to allow raw JSONL

Open scottbarnes opened this issue 1 year ago • 1 comments

Problem

Currently we have an endpoint https://openlibrary.org/admin/imports/add which takes a list of ocaid archive.org identifiers. We want patrons to use the

https://openlibrary.org/import/batch/new endpoint which we should rename as imports to be consistent with /admin/imports/add and /imports.

A clear and concise description of what you want to happen

One should be able to import a new item by entering raw JSON into a text area in the batch import endpoint at /import/batch/new (https://openlibrary.org/import/batch/new).

Once the JSONL is submitted, the same validation that happens with an uploaded JSONL file should be run.

Additional Context

See #8122, which added the existing endpoint. This issue is to extend that by, e.g., adding a <textarea> where the JSONL can be entered instead of attaching it as a file.

It was probably a mistake to have batch_import take bytes here, as this tightly couples the implementation to a file upload: https://github.com/internetarchive/openlibrary/blob/e7f11e7c41b1a9317814c0e96cc1c9bf905c8b67/openlibrary/core/batch_imports.py#L73

Instead, this should likely take a list or perhaps a generator. In any event, by changing the function signature here it should be possible to have the form used for submitting raw JSONL input plug directly into this function, unless the form data comes in as bytes, which I think it will not by default. The batch_imports endpoint will be need to updated as well to account for this change away from bytes: https://github.com/internetarchive/openlibrary/blob/e7f11e7c41b1a9317814c0e96cc1c9bf905c8b67/openlibrary/plugins/openlibrary/code.py#L502-L504.

Proposal & Constraints

No response

Leads

Related files

Stakeholders


@mekarpeles

Instructions for Contributors

  • Please run these commands to ensure your repository is up to date before creating a new branch to work on this issue and each time after pushing code to Github, because the pre-commit bot may add commits to your PRs upstream.

scottbarnes avatar Jun 25 '24 17:06 scottbarnes

I would like to work on this issue and try to fix it. I have experience working with various python and web development libraries including some JSON manipulations and traversal. Assign me this issue

Devansh-Kushwaha avatar Jun 29 '24 12:06 Devansh-Kushwaha

@Devansh-Kushwaha did you have any questions about how to proceed? Since it's been two weeks, if you're no working on this issue we'd like to give someone else a chance :)

mekarpeles avatar Jul 12 '24 01:07 mekarpeles

@Devansh-Kushwaha did you have any questions about how to proceed? Since it's been two weeks, if you're no working on this issue we'd like to give someone

Yes, I apologize for the delay. I am having problems in setting up the project. I am kinda new to docker.

Devansh-Kushwaha avatar Jul 12 '24 03:07 Devansh-Kushwaha

@Devansh-Kushwaha, if you share any questions you have perhaps we can help.

scottbarnes avatar Jul 12 '24 04:07 scottbarnes

Can I please be assigned this issue?

slimkevo avatar Aug 02 '24 16:08 slimkevo

Let's also have a way to validate the input (e.g. pass in ?validate=true) and if flag exists, raise and don't import after validating.

mekarpeles avatar Aug 02 '24 18:08 mekarpeles

Give it a try @slimkevo! Can you reply with your approach and any blockers you're hitting as you get set up?

mekarpeles avatar Aug 02 '24 18:08 mekarpeles

@mekarpeles I am experiencing issues with uploading raw JSON data through the import functionality. When I attempt to upload JSON data, the import fails with validation errors or SQL exceptions. Specifically, I receive errors indicating issues with the JSON format or missing fields in the database. SQL errors like 'column "submitter" of relation "import_item" does not exist' indicating discrepancies between the JSON data and the database schema.

slimkevo avatar Aug 29 '24 18:08 slimkevo

It sounds as if we may need to add that column to the local development environment SQL schema.

In the interim, something like this should at least resolve the error about the submitter column, @slimkevo:

❯ docker compose exec db bash
WARN[0000] The "HOST" variable is not set. Defaulting to a blank string. 
root@29b2d94b9e8d:/# psql -U openlibrary
psql (9.3.25)
Type "help" for help.

openlibrary=# ALTER TABLE public.import_item ADD COLUMN submitter text;
ALTER TABLE
openlibrary=# \d import_item
                                      Table "public.import_item"
   Column    |            Type             |                        Modifiers                         
-------------+-----------------------------+----------------------------------------------------------
 id          | integer                     | not null default nextval('import_item_id_seq'::regclass)
 batch_id    | integer                     | 
 added_time  | timestamp without time zone | default timezone('utc'::text, now())
 import_time | timestamp without time zone | 
 status      | text                        | default 'pending'::text
 error       | text                        | 
 ia_id       | text                        | 
 data        | text                        | 
 ol_key      | text                        | 
 comments    | text                        | 
 submitter   | text                        | 
... more output omitted ...

Please let us know how that goes, and whether more must be done to use the endpoint.

In this file there are a pair of uncommented records that should not have validation errors: two_item_import.txt

scottbarnes avatar Aug 30 '24 00:08 scottbarnes