hydrus
hydrus copied to clipboard
CLI: add command to upload data from a JSON or CSV file
I'm submitting a
- [X] feature request.
Current Behaviour:
To bulk-load data in the server the only possibility is to send PUT requests to the endpoint
Expected Behaviour:
Data can be loaded by running a command, pointing to a local text file.
@Mec-iS Is it okay to add pandas
as an additional dependency ?
This will help read from a variety of formats (xlsx
,pkl
,json
,csv
,tsv
, SQL queries and a lot more)
However it also has a number of modules for processing data that we don't need, (keeping in mind that hydrus was meant to be "lightweight")
Always use standard library tools. Standard library has a csv
and json
package. The use of pandas
is not justified at the moment.
When you say data, you mean instances of objects that the API serves right?
I think it could be handy to have some text file having some preloaded data that can load instances right away. But we would need to define the format of such a file.
Yeah with data I mean instances/objects that are actually served by the interface (the one we store using the PUT method on the Items
endpoint).
But we would need to define the format of such a file.
Using standard formats is always the way to go. Probably would be better to support, beside JSON, also the different serialization formats for triples for backward compatibility with older Knowledge Bases.
@vaibhavchellani ^^
@xadahiya @Mec-iS I think that this issue is not solved yet? I can continue work from https://github.com/HTTP-APIs/hydrus/pull/168 which was closed. Could you tell the reason that https://github.com/HTTP-APIs/hydrus/pull/168 PR was closed?
this is something needed @sameshl , also ping @vedangj044 as I think he is working on it
@sameshl #168 looks good but it is specific to a particular case where we have the data for a particular resource that too if the data doesn't contain any abstract property.
I am working on a generic preloading script that maps the column names of a CSV file to the resource names of a Hydra Doc.
Starting with resources with 0 abstract property we load the data using the crud.insert
function.
Then for resources where we need to link to properties, we first need to get the resource ID using get function. (this is not yet implemented)
Likewise, the data can be loaded from any CSV file.
Also, a broader solution should focus on all types of files even Rational Databases. We can discuss 2 approaches for RDB
- Generic preloading script which automatically maps data to hydrus-generated database.
- Introducing a keyword named
SQL
in the Hydra vocab. This keyword would contain the SQL query needed to get data from the database and populate the hydrus-generated DB. Example: SQL: SELECT * FROM ARTIST; Now hydrus would run this query and populate it's database accordingly. I think this approach is much safer from a developer's point of view.
Why are batch endpoints useful? How can we add them to the existing REST API?
@Mec-iS Before working on this issue should we set up a Database Config_file
and db_parser.py
for hydrus in a separate PR.
The workflow will be easier and more manipulation can be done with the database as we will have a unified method to connect the DB.
from db_parser import get_db_url
DB_URL = get_db_url()
Everything about the code is in the code. For now we only use file database as SQLite. Ask your colleagues in Slack for generic directions.
Is anyone working on this, at the moment?
I want to work on this issue!