dl-models-for-qa icon indicating copy to clipboard operation
dl-models-for-qa copied to clipboard

hello,I want to know how can I train the flashcards-idx file

Open ynuwm opened this issue 7 years ago • 7 comments

when I run the script es-load-flashcards.py get this error:

ConnectionError: ConnectionError(<urllib3.connection.HTTPConnection object at 0x000001974503DC50>: Failed to establish a new connection: [WinError 10061] 由于目标计算机积极拒绝,无法连接。) caused by: NewConnectionError(<urllib3.connection.HTTPConnection object at 0x000001974503DC50>: Failed to establish a new connection: [WinError 10061] 由于目标计算机积极拒绝,无法连接。)

can you give me the flashcards-idx you have trained. Or can you give me some suggestions about how should I set the host and port in code:

python es = elasticsearch.Elasticsearch(hosts=[{ "host": "localhost", "port": "9200" }])

ynuwm avatar Aug 15 '17 13:08 ynuwm

Hi @ynuwm sorry but I don't have the index data anymore. The README.md file has a link to the flashcards data that I used to generate it. You will need to start up an elasticsearch server for this, see https://www.elastic.co/guide/en/elasticsearch/reference/current/_installation.html for instructions. The objective is to create input data that includes "story" data from the flashcards, ie, text from top 10 flashcards that match the question. The output of the add-story.py script is used as input to the qa-[b]lstm-story.py networks.

sujitpal avatar Aug 15 '17 14:08 sujitpal

thank you @sujitpal I have downloaded the data studystack from your link. What confused me is the code, I run code but get error, ie. I don't know how to train flashcards to generate the flashcards-idx. maybe I dont install the software, I'll try your suggestion of the link.

from __future__ import division, print_function
import elasticsearch
import nltk
import os

DATA_DIR = "../data/comp_data"
STORY_FILE = "studystack_qa_cleaner_no_qm.txt"
STORY_INDEX = "flashcards-idx"

es = elasticsearch.Elasticsearch(hosts=[{
    "host": "localhost",
    "port": "9200"
}])

if es.indices.exists(STORY_INDEX):
    print("deleting index: %s" % (STORY_INDEX))
    resp = es.indices.delete(index=STORY_INDEX)
    print(resp)

body = {
    "settings": {
        "number_of_shards": 5,
        "number_of_replicas": 0
    }
}
print("creating index: %s" % (STORY_INDEX))
resp = es.indices.create(index=STORY_INDEX, body=body)
print(resp)

fstory = open(os.path.join(DATA_DIR, STORY_FILE), "rb")
lno = 1
for line in fstory:
    if lno % 1000 == 0:
        print("# stories read: %d" % (lno))
    line = line.strip()
    line = line.decode("utf8").encode("ascii", "ignore")
    fcid, sent, ans = line.split("\t")
    story = " ".join(nltk.word_tokenize(" ".join([sent, ans])))
    doc = { "story": story }
    resp = es.index(index=STORY_INDEX, doc_type="stories", id=lno, body=doc)
    print(resp["created"])
    lno += 1
print("# stories read and indexed: %d" % (lno))
fstory.close()
es.indices.refresh(index=STORY_INDEX)

query = """ { "query": { "match_all": {} } }"""
resp = es.search(index=STORY_INDEX, doc_type="stories", body=query)
print("# of records in index: %d" % (resp["hits"]["total"]))

ynuwm avatar Aug 15 '17 14:08 ynuwm

Hi, @sujitpal I have successfully created the flashcards-idx, but when I run code add-story.py I get the same story. ie, the different question gets the same story, can you give me some advice.

ynuwm avatar Aug 16 '17 03:08 ynuwm

That is expected, as long as it's not the same set of flashcards that get attached to every story. The idea is that you are treating the flashcards as background knowledge. So when a human answers a question, he can draw from multiple sources of prior knowledge. Similarly, he can draw on the same bit of prior knowledge to answer different questions. The add-story just does a search with the question as query to find matching stories, so it is possible that a story could be associated with multiple questions.

However, if you are seeing the exact same stories associated with every question, then that might be a bug. Reopening the issue, please let me know if that's the case and I will investigate.

sujitpal avatar Aug 16 '17 15:08 sujitpal

Maybe I know where the problem is, this morning I have a try on my desktop computer and I get the different stories, this is the output

stories Out[36]: [b'KEEPS THE PLANETS IN ORBIT AROUND THE SUN GRAVITATIONAL PULL', b'the force that keeps planets in orbit around the sun and governs the rest of the motion in the solar system gravity', b'What is the force that keeps the planets in orbit ? Gravity of the sun', b'Explain how inertia and gravity work together to keep the planets in orbit . Inertia keeps planets moving around the sun as gravity keeps it close to the sun .', b'Which of the following correctly describe patterns of motion in the solar system ? Planets closer to the Sun move around their orbits at higher speed than planets farther from the Sun . All the planets ( not counting Pluto ) have nearly circular orbits . All the planets ( not counting Pluto ) orbit the Sun in nearly the same plane .', b'the Sun and planets and the other objects that orbit the Sun solar system', b'contains the sun , the planets , moon , and small objects that orbit the sun solar system', b'the Sun and the planets and other objects that orbit the Sun solar system', b'The sun , planets and all other objects that orbit around the sun Solar System', b'the Sun , planets , moons , and other objects that orbit the Sun solar system']

problem may happen in the code :

story = " ".join(find_stories_for_question(question))

The function return output stories , the stories type is binary-type, while you treat is as string. In general, your code is right, but there is a need to revise in detail. Thank you all the way, I have learned a lot from your code.

in general

ynuwm avatar Aug 17 '17 02:08 ynuwm

Thanks for debugging! Might be worth doing [x.decode("utf8") for x in stories] to convert to string. I don't have access to the data anymore, so can't test, but since you have data, maybe check it out and send me a PR I can apply to the code?

sujitpal avatar Aug 17 '17 17:08 sujitpal

Thank you, my solution is the same to you, like [x.decode("utf8") for x in stories]

ynuwm avatar Aug 18 '17 08:08 ynuwm