the_od_bods icon indicating copy to clipboard operation
the_od_bods copied to clipboard

Auto-categorise datasets

Open KarenJewell opened this issue 3 years ago • 5 comments

Ideally we would identify the dataset category using keywords in dataset title and description.

Suggested reading: Topic modelling

KarenJewell avatar Oct 02 '22 14:10 KarenJewell

Had a little look into this. Doing unsupervised topic modelling in R:

image

For easier integration, may be best written in python script. And perhaps be a step after merged_data.py

Not sure the category will always be identified consistently. So will compare manual category against topic model to assess this.

fozy81 avatar Nov 10 '22 22:11 fozy81

No limitation to just Python. If R is the best language to use to implement it then feel free to use R 😊

JackGilmore avatar Nov 11 '22 12:11 JackGilmore

Thanks for the steer on this @JackGilmore . Obviously, R is always the best language :wink: but we'll see. Still quite far off preparing a PR and exactly where it'll fit with the existing process. But interesting rabbithole, topic modelling for the win! :1st_place_medal:

fozy81 avatar Nov 11 '22 17:11 fozy81

Just wanted to make sure I wasn't giving you a bum steer after I mentioned last weekend we'd abandoned my wonderful C# code to use Python instead 😉. Looks like some promising work so far. I'm looking forward to seeing more!

JackGilmore avatar Nov 11 '22 18:11 JackGilmore

@KarenJewell @JackGilmore Suggest instead of topic modelling, could use hugging face 'open source' AI to auto-categorize title + description strings based on the default categories.

ODSCategories.json would still provide the broad categories required but no need to maintain keyword list for each category.

Pros:

  • No keywords to maintain
  • AI seems pretty good at using the current default 16 categories and successfully classifying the title + description
  • Could be cleaner approach than topic modelling, which involves keyword lists

Cons:

  • The model is non-commercial - isn't really open-source (but there maybe other options)...
  • Not sure it will use 'uncategorised' if it can't categorize.
  • It'll probably guess the closet matching categories (but will probably get this wrong sometimes)
  • Relies on external API (need to log/notify if not working)
  • Needs password and username to be passed in from github secrets (may make local testing more difficult)

I've tested out some pseudo code which works in principle:

from hugchat import hugchat
from hugchat.login import Login
email = "[email protected]"    # ....pass in secret variable from github
passwd = "your_password"  # ...pass in secret variable from github
sign = Login(email, passwd)
cookies = sign.login()        # Save cookies to usercookies/<email>.json
sign.saveCookies()`
# Create a ChatBot
chatbot = hugchat.ChatBot(cookies=cookies.get_dict()) 
# Create prompt
prompt = "Using only the following categories: 'Food and Environment', 'Council and Government', 
 'Elections / Politics', 'Planning and Development', 'Housing and Estates', 
 'Parks / Recreation', 'Sport and Leisure',  'Education', 'Transportation', 
 'Law and Licensing', 'Business and Economy', 'Arts / Culture / History', 'Tourism', 
'Budget / Finance', 'Health and Social Care', 'Public Safety', 
tell me very briefly only the the categories that best match the following description: '" 
+ str_title_description +  "'."
# Return string from chatbot which should contain the categories we wish to identify,   
# even if the original string (title _ description) didn't mention the categories:
bot_categories = chatbot.chat(prompt, is_retry = True, retry_count = 5)
categories_result = match_categories(bot_categories)
return categories_result```

fozy81 avatar Jun 20 '23 21:06 fozy81