spatula icon indicating copy to clipboard operation
spatula copied to clipboard

Specify unique id

Open magick93 opened this issue 3 years ago • 3 comments

Is your feature request related to a problem? Please describe.

I would like to be able to specify a column as being a unique value, so as to open up for more convenient use of the downloaded data, including performing subsequent runs.

Describe the solution you'd like When persisting to disk, currently each scrape run creates a new folder, and each item (row in a table) has a randomly generated, unique ID.

If I were able to specify which column in the table had a unique ID, then:

  • rather than creating a new folder, the same folder could be used
  • rather than creating a new file for each table row, the same file could be used (eg, similar to upsert)
  • when a new item is added, then a new file is created (eg, similar to an insert)

Describe alternatives you've considered

Alternatives including using a database, and would involve a lot more development overhead. This solution is, IMO, more lightweight, as it is basically using the local file system as the database, and also depends on the data provider having some kind of unique id, and the scraper developer being able to identify and use this ID.

Additional context

Additionally, it would be good if, on subsequent runs/scrapes, if spatula could read in the already persisted json, and compare it to what is being scraped, and only persist if there is differences.

  • Then, for example, if scrape results are committed to git, only those that have been changed will be committed.
  • Also, it would be easy to see, using the filesystem modified field, which has been recently updated.

magick93 avatar Jun 20 '21 05:06 magick93

thanks for this, I like this idea & think it is definitely worth exploring

jamesturk avatar Jun 21 '21 19:06 jamesturk

I'm thinking of tackling this in 3 pieces:

  • [ ] allow specification of "primary key" & use primary key to name files
  • [ ] allow using same folder for each scrape
  • [ ] modify save code to only save if there are changes (so that modified date doesn't update)

I'm realizing that 1 and 2 are actually already nearly possible, but the UX here could be better (or at least better documented).

  • If you define get_filename on your output type, that will be used instead of the UUID.
  • if you pass --output-dir to spatula scrape that directory will be used instead of a randomly generated one. (Right now however, output-dir will not run if the directory isn't empty.)

I'm wondering if you have any thoughts about the UX here, is get_filename a suitable option instead of defining a specific primary key? Perhaps get_primary_key should be available instead?

I'm thinking at least the first two pieces of this can fit into 0.9. I'm undecided on the modified date piece since it seems like it'll come with a whole new set of edge cases, and the mentioned use case of checking results into git already works as long as the output itself doesn't change.

jamesturk avatar Jun 22 '21 17:06 jamesturk

Thanks @jamesturk

I'm wondering if you have any thoughts about the UX here, is get_filename a suitable option instead of defining a specific primary key? Perhaps get_primary_key should be available instead?

Yes I think get_filename I think this is a suitable option.

Another option, though it may not always be available, is using the the col name for the primary key.

magick93 avatar Jun 22 '21 20:06 magick93