hntitlenator icon indicating copy to clipboard operation
hntitlenator copied to clipboard

Getting data

Open santiagobasulto opened this issue 5 years ago • 7 comments

Hello, I maintain a Dataset in Kaggle with HN posts and points per category: https://www.kaggle.com/santiagobasulto/all-hacker-news-posts-stories-askshow-hn-polls

It might be useful. The source is available here

santiagobasulto avatar Nov 13 '19 12:11 santiagobasulto

That's awesome. Soon as I can I'll try to retrain the NN with the new data.

victorqribeiro avatar Nov 13 '19 13:11 victorqribeiro

👍 great! I'll update it this afternoon, I run a script periodically to have the latest data in it.

santiagobasulto avatar Nov 13 '19 13:11 santiagobasulto

Boy, my computer is having a hard time processing this much data. I don't think I'll be able to train the NN with such a huge amount of data.

victorqribeiro avatar Nov 13 '19 18:11 victorqribeiro

😂 you can use colab or other platforms with GPU/CPU. What do you need to extract from it?

santiagobasulto avatar Nov 13 '19 20:11 santiagobasulto

I need to extract the title and the score only, then I have to tokenize the words turning them into vectors and only then I need to feed the new data to the NN, I'll take a look a it after I leave work.

victorqribeiro avatar Nov 13 '19 20:11 victorqribeiro

Alright, I'll get that ready for you soon.

santiagobasulto avatar Nov 13 '19 20:11 santiagobasulto

Just created a small version containing only Title, Post Type and Points: https://drive.google.com/file/d/1sZx3zidIwezFx4gNEWZIJ7KpN4V-eBEE/view?usp=sharing

Post Type is encoded: 0 for regular stories, 1 for Ask HN, 2 for Polls and 3 for Show HN.

image

santiagobasulto avatar Nov 13 '19 20:11 santiagobasulto