Investigate how to use machine-learning to classify bookmarks
When you submit a link, Bookie will go to the URL and parse data such as:
- Title
- Description
- Image
- Body
The Body contains the actual article text (i.e. if a link to a news article is submitted). Based on this data, it would be cool if Bookie could classify the text/URL and auto tag the bookmark for the user.
For example, I submit a lot of information security URLs, everything from tweets to videos. If Bookie could based on the body of the data, classify a bookmark to be about Hacking or Command & Control, it would create a really cool user experience.
The purpose of this issue is to determine which software to use and how to classify the bookmarks.
Would love to work on this
@AmanPriyanshu Go for it!
I have created a very basic tagger till now which can only classify basic tags (will continue to work on it), however it does not seem to work well for media sites like YouTube, etc. The tags will then be used to classify the bookmarks. My issue is that I cannot work with media sites which could be a huge chunk of bookmarks for some people.
@AmanPriyanshu How does the tagger work? Can you push your code to a branch so I can check it out?
But yea, media sites can be a bit tricky. I was wondering if one could use an algorithm that "sees" that there is a video embedded in the body, thereby tagging the bookmark as "Video". An other possibility is to simply auto-tag certain bookmarks with Video. Youtube links only contain videos, so Bookie could simply tag it immediately with Video (See https://github.com/mjdubell/Bookie/issues/40)
can i have a buffer period of 12 hours since i need to work on this a little more and to make a few adjustments as well. Also I am kind of new to open source so i don't know whether all issues are to be assigned or not
Sure, no problem :) Take your time! I can assign this issue to you.
I'm sorry i haven't been informing but i just finished improving the tagger. I have made it so that it has weights, i will make a cost function and the weights which are already made will change accordingly, I plan on using Leak ReLU or just linear since, i have given them points and it just needs to sort them in order of points. Please review my code and i would love to hear any criticism since this is my first time dealing with open source projects
I'm not that familiar with ML algorithms, could you elaborate on why "Leaky ReLU" is a good choice for classifying text and how it works?
ok so currently as you can see in the code the tags have been given points based on parameters such as frequency distribution of the tag, frequency of synonyms, links which offer the same tags, etc. This allows us to accurately identify the tags. But then what weightage to be given to which parameters requires a lot of manual tweaking. Like if i only have frequency distribution of the tag 'js' rises to the top which wasn't even written on the page but was frequent because of javascript in the web page. So by running it on training cases we can let machine learning decide how many points or weights to add to each paraameter. For finding these weights I'm using leak ReLU, which is used for predicting data which is linearly varied. Like the points given to bookmarks, I have seen other activation functions but this seemed appropriate, however I'd love to incorprate any other activation function if you would like to
Also I understand I had written the code poorly, I will definitely try to improve it's quality; however how would you like me to implement classes, do i create a class with all these functions and i will correct the codes where you have advised.
@AmanPriyanshu I haven't had time yet to fully read up on your algorithm, but I will try to do it next week.
Regarding classes: Try you break down your script to components that does one thing. For example, if your algorithm does some sort of special computation, create a method called def special_computation(args):, do that for all components. The idea is basically to breakdown your code to smaller pieces and from there we can build a class that performs your algorithm.