ideas icon indicating copy to clipboard operation
ideas copied to clipboard

Auto-tagging

Open jqnatividad opened this issue 10 years ago • 2 comments

Already, there is a beta implementation at https://github.com/REEEP/ckanext-climate-tagger using the Reegle tagging API

In the beta implementation, it only auto-tags the Title/Description metadata fields. It'd be awesome if it can also tag the datasets themselves.

We're currently using the Reegle API to auto-tag PDFs and its quite powerful.

jqnatividad avatar Oct 14 '14 20:10 jqnatividad

Any update on this capability? This looks like it would help us solve the below challenge?

What is the best way to: 1. Search a CKAN instance for the word "agua" or "Water" and then 2. Add the Tag "Water" to any of the datasets that are found?

Similarly I am trying to add the same datasets to a Group called Hydro

RichFrazier avatar May 01 '17 17:05 RichFrazier

Based on May 10, 2019 discussion on Gitter CKAN channel:

Joel Natividad @jqnatividad 12:37 @davidread we need automatic advanced search :) - ckan/ideas-and-roadmap#228

David Read @davidread 12:48 @jqnatividad yes that looks like a quick win to get a very useful feature added, so +1 to that. But I reckon there's also much to be done for the average user as well. If you type "environment" you'd expect results for deforestation or solar power, as well as datasets from the "Department for Environment", and you don't currently, but you would on Google. Also if you mispell it, you shouldn't have to know to add "*" on the end to incorporate similar words via Levenshtein distance.

Joel Natividad @jqnatividad 12:55 In a previous project, we toyed with https://github.com/REEEP/ckanext-climate-tagger. It worked, but it only did classification on the metadata, and that was a while back. (Indepently, the ClimateTagger API (api.reegle.info) is quite impressive, especially with PDF files) It'd be great if we can leverage some of the cloud-based classification services to automatically add additional tags to datasets based on the dataset content itself, so we can use the existing Solr mechanism to search intelligently within datasets

David Read @davidread 12:58 Great! Yeah, this would rely on a rich dataset of topics and their relationships, like REEEP but for all subjects. Maybe there is something in the dbpedia space would have something along these lines.

Alex Harding @hardingalexh 13:19 that would be an amazing add. My team is working on an extension that parses csv/tsv files for very specifically formatted genomics data and makes that content searchable, but the only way it's been feasible is by having a very narrow scope with assumptions about how to find data within the resource

Joel Natividad @jqnatividad 13:22 Though REEP was domain-specific, it was great at extracting concepts like places and persons.

I hear shades of Linked Data/SemTech :) Maybe this is a way to get out of the "LD Winter" (echoing AI Winter), now that we have all the memory/horsepower required to auto-classification - now only if AWS, Azure, Google Cloud can enable the service ;)

It seems a lot of the existing cloud-based classification services (AWS Macie, etc.) are about classifying data for security. Perhaps, this is a way to not only protect data (for PII, confidential info, etc.) but also to classify the data (e.g. this data is about energy, etc.).

One way to implement it would be to compile descriptive stats about datasets (ckan/ideas-and-roadmap#196) and only doing classification on the table schema and the top N values for each column...

Anyways.... let me take this off gitter and add it to ideas-and-roadmap :)

cc @davidread @hardingalexh

jqnatividad avatar May 10 '19 17:05 jqnatividad