Scribe-Data icon indicating copy to clipboard operation
Scribe-Data copied to clipboard

Decouple `convert` from `get` in Scribe-Data CLI

Open andrewtavis opened this issue 1 year ago β€’ 6 comments

Terms

Description

As of now the convert command is coupled to the get command in the Scribe-Data CLI. The goal of this part of the CLI was that if a user first got JSON data, then they could convert it easily into another file format with the CLI. The current way to do this is:

scribe-data get -lang English -dt verbs -od ./output_data -ot csv

But this would run the Wikidata query again, thus adding to the load on the Wikidata Query Service. It would be great if the user could first get JSON data down and then run:

scribe-data convert --input-file FILE_PATH --output-type csv --output-directory FILE_PATH

Contribution

Happy to support with this and review as is needed 😊 Let's first map out the changes that need to happen, and then we can plan the work from there!

andrewtavis avatar Oct 03 '24 09:10 andrewtavis

Hey @andrewtavis . I'm an outreachy W'24 applicant and I am interested in this.

john-thuo1 avatar Oct 03 '24 14:10 john-thuo1

Hey @john-thuo1! Thanks for your interest in working with us! Assigning now, and please feel free to ask questions here. I'd suggest checking out the CLI code for get and convert, with each having their own files. From there we can make convert a more distinct process :)

CC @mhmohona for questions 😊

andrewtavis avatar Oct 03 '24 21:10 andrewtavis

Hi @mhmohona @andrewtavis ,

I’ve managed to set up the following command in regard to this issue:
scribe-data convert --lan French --data-type nouns --input-file ./fli/French/nouns.json --output-dir ./converted/French --output-type csv

However, I’m having trouble understanding how the CSV outputs should look, given that the JSON files for a particular language don’t follow a consistent format.

For example, in French, there are six JSON files (e.g., verbs.json, nouns.json), each structured differently. Could you provide some clarity on how the CSV output should be structured for each file?

The idea I had is to build these CSV files based on the dtype specified by the user within the convert_to_csv_or_tsv function e.g if dtype == "verbs": writer.writerow(["verb", "tense", "form"]) # pass elif dtype == "emoji": writer.writerow(["key", "emoji", "is_base", "rank"])

This will be for JSON Files, then I have to do sth similar for other file types. Eventually, this might add too much code. Any other approach you might have?

john-thuo1 avatar Oct 08 '24 10:10 john-thuo1

so if you check the json file, it will be easier to find the format. For example, for this section-

{
"3D print": {
"presSimp": "3D print",
"presTPS": "3D prints",
"presPart": "3D printing",
"presFPSCont": "am 3D printing",
"prePluralCont": "are 3D printing",
"presTPSCont": "is 3D printing",
"presPerfSimp": "have 3D printed",
"presPerfTPS": "has 3D printed",
"presPerfSimpCont": "have been 3D printing",
"presPerfTPSCont": "has been 3D printing",
"pastSimp": "3D printed",
"pastSimpCont": "was 3D printing",
"pastSimpPluralCont": "were 3D printing",
"pastPerf": "had 3D printed",
"pastPerfCont": "had been 3D printing",
"futSimp": "will 3D print",
"futCont": "will be 3D printing",
"futPerf": "will have 3D printed",
"futPerfCont": "will have been 3D printing",
"condSimp": "would 3D print",
"condCont": "would be 3D printing",
"condPerf": "would have 3D printed",
"condPerfCont": "would have been 3D printing"
},

You can name header as verb, then under it, can keep the value print, then next header can be presSimp and its value will be 3D print; like this the rest should be formatted. Similarly for other parts of speech, it should be the same way.

@andrewtavis, am I correct? What do you think about it?

mhmohona avatar Oct 09 '24 10:10 mhmohona

I would say that we should do a flat return of the columns. So the above from @mhmohona would have a verb column and then each of the conjugations should be a column in the CSV as well. So the structure of the JSON should determine the output columns.

andrewtavis avatar Oct 09 '24 10:10 andrewtavis

@mhmohona @andrewtavis , More clearer now, thank you!

john-thuo1 avatar Oct 09 '24 21:10 john-thuo1

@andrewtavis, Regarding the get functionality returning only JSON outputs, contrary to the documentation, which indicates the outputs can also be in other formats, should I proceed to work out how we can have the outputs in different formats, or will what I submitted in the PR(conversion functionality) suffice for now?

john-thuo1 avatar Oct 22 '24 16:10 john-thuo1

We should also be able to get these outputs directly via the get command, but this should be by getting a JSON and then using convert on the output file :) Before we had "export CSV/TSV" functionality, but we should just have "export JSON" and then conversions for everything else.

andrewtavis avatar Oct 22 '24 22:10 andrewtavis

@andrewtavis @mhmohona , in the first command I ran, after modifying the codes, I was able to get the output as JSON first, then have it converted directly to CSV using the get command.

With the second command , I was able to convert the JSON file to TSV using the convert command. Is this the desired functionality? image

john-thuo1 avatar Oct 23 '24 11:10 john-thuo1

Yes that's perfect, @john-thuo1! Let's make sure that in the first case there's no JSON sitting around that they didn't ask for, but aside from that sounds perfect 😊

andrewtavis avatar Oct 23 '24 11:10 andrewtavis

Yes that's perfect, @john-thuo1! Let's make sure that in the first case there's no JSON sitting around that they didn't ask for, but aside from that sounds perfect 😊

@andrewtavis If I get you correctly, the data will still be queried as JSON in the first command, it is just that we don't need to output the JSON file. Just the CSV?!

john-thuo1 avatar Oct 23 '24 11:10 john-thuo1

Yes, basically the JSON should be generated in the current directory and then removed in the background.

andrewtavis avatar Oct 23 '24 11:10 andrewtavis

Yes, basically the JSON should be generated in the current directory and then removed in the background.

Alright. I will add that part then and push on the same PR I had opened earlier if that's ok!?

john-thuo1 avatar Oct 23 '24 12:10 john-thuo1

Please do :)

andrewtavis avatar Oct 23 '24 12:10 andrewtavis

Please do :)

@andrewtavis, I have updated the pr. Let me know if it's good.

john-thuo1 avatar Oct 23 '24 15:10 john-thuo1

Closed by #338 πŸš€ Thanks for all the hard work here, @john-thuo1! We can look into getting the tests back to their perfect state soon. The language metadata file changed, so this is likely why some tests were having issues. Let's work from a new issue for any further work :)

andrewtavis avatar Oct 24 '24 22:10 andrewtavis