Justin Yang comments

Results 9 comments of


                                            Justin Yang

Dataset for Epoch2

Apologies for the delayed response. You can download the dataset from the Kaggle Competition, [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data).

可以提供'data/processed/reply/'試玩嗎？謝謝

Hi, 您可以[點此](https://drive.google.com/file/d/11JlbmYmuu00TfmfdAfyoGK_E3VMp8Vd_/view)取得資料集。

請問這個專案有規定什麼版本的python才能執行嗎

這個專案是基於 python3。

請問這個專案有規定什麼版本的python才能執行嗎

> json裡面沒有title 的key 和值請問是以什麼方法找到和使用者輸入相識的標題輸出的標題會放置於 `SegTitles.txt`，比對方式預設為 BM25。 > 為什麼兩邊的回复不一樣跟 [threshold 的設置](https://github.com/zake7749/PTT-Chat-Generator/blob/master/chat.py#L46)有關，如果在語料中找不到相似的標題，那會直接採用預設的回覆。

如何抽取問答配對？

您好，原始檔案我要再找找，如果急用的話，我剛剛稍微從 [chat.py](https://github.com/zake7749/PTT-Chat-Generator/blob/master/chat.py#L46)改了一下，雖然還沒有實際跑過，不過流程應該是對的 ```python def dumpQAPairs(self, path): with open("data/Titles.txt",'r',encoding='utf-8') as data: titles = [line.strip('\n') for line in data] with open(path, 'w', encoding='utf-8') as dump: index = 0 for title in...

如何抽取問答配對？

`Re:` 與 `Fw:` 並沒有被列入，順代一提，以[這些標籤](https://github.com/zake7749/PTT-Chat-Generator/blob/master/data/stopwords/gossiping.tag)開頭的文章也不會被列入，不過行數不一致最主要的原因是兩個專案採用的原始資料不同。 Gossiping-Chinese-Corpus 是抽取自 2015 年末至 2017 年 6 月底的八卦版文章，這個專案中的 Titles.txt 則是 2016 年中至 2016 年 10 月底的八卦與 C_Chat 版文章。

如何抽取問答配對？

您指的是清理後應該要有 418202 行而非 234706 行嗎 ? 如果是**現在**用 PTT-Crawler 從 2015 年開始爬的話，這個結果是正常的，因為 2015 年大部分的文章皆已被系統清除了，詳細說明可以參考[這個 issue](https://github.com/zake7749/Gossiping-Chinese-Corpus/issues/1)。

如何抽取問答配對？

我剛剛用 index 算了一下，八卦版至 8 月 19 號差不多有 24700 頁，每頁有 20 筆標題，也就是文章能全部抓下來的話共有 494,000 篇，瞥開一些在換頁時被洗掉，或是格式有錯的文章，我覺得抓到 466,686 篇挺合理的。因為不是*有些*文章會被刪除，而是*幾乎所有*的文章都會被刪除，目前非 M 文的起點是在第[636](https://www.ptt.cc/bbs/Gossiping/index636.html)頁，其文章標注的時間為 `Sat Feb 4 2017`，也就是說您爬取的時間軸其實是從 2017 年 2...

如何抽取問答配對？

原站點的資料被刪掉後應該就找不回來了，不過有些網站專門在備份 PTT 的文章，或許您能從那邊下手。