neural-editor why training is always killed without any error information

Hi,

My training is always killed without any error information like below.

uncomitted changes being stored as patches
New TrainingRun created at: /data/edit_runs/7
Optimized batches: reduced cost from 45709568 (naive) to 20758016 (0.545871533942% reduction).
Optimal (batch_size=1) would be 20741962.
Passed batching test
Streaming training examples:   6%|5         | 399/7032 [48:47<12:31:31,  6.80s/it]Killed

Mar 03 '18 06:03 jdxyw

I am encountering this issue as well. Running with the edit_logp config, the process is consistently killed at the same point with the following output:

[localhost] local: wc -l /data/yelp_dataset_large_split/train.tsv
Reading data file.:  20%|#############4
Reading data file.:  26%|#################3
Killed

The same issue is occurring with other configs as well.

Mar 13 '18 20:03 adempsey

I have the same issue. Training is consistently killed.

[localhost] local: wc -l /data/onebillion_split/train.tsv Reading data file.: 17%|##############1 | 582582/3506331 [02:43<19:10:00, 42.37it/s]Reading data file.: 17%|##############5 | 594704/3506331 [02:44<39:10, 1238.52it/s] Killed

Mar 14 '18 03:03 yamsgithub

Looks like this is a memory issue. I ran on my cluster and it ran fine.

Mar 14 '18 20:03 yamsgithub

@yamsgithub hello，do you config this project by running "run_docker.py"? Because some network reasons, I can not run it successfully. I install all packages one by one and encounter an issue about git like this:

Traceback (most recent call last): File "textmorph/edit_model/main.py", line 34, in exp = experiments.new(config) # new experiment from config File "/data/User/zpf/neural-editor/gtd/ml/training_run.py", line 145, in new run.record_commit(self._src_dir) File "/data/User/zpf/neural-editor/gtd/ml/training_run.py", line 66, in record_commit self.metadata['commit'] = repo.head.object.hexsha.encode('utf-8') File "/data/Development/anaconda/envs/docker/lib/python2.7/site-packages/git/refs/symbolic.py", line 193, in _get_object return Object.new_from_sha(self.repo, hex_to_bin(self.dereference_recursive(self.repo, self.path))) File "/data/Development/anaconda/envs/docker/lib/python2.7/site-packages/git/refs/symbolic.py", line 135, in dereference_recursive hexsha, ref_path = cls._get_ref_info(repo, ref_path) File "/data/Development/anaconda/envs/docker/lib/python2.7/site-packages/git/refs/symbolic.py", line 184, in _get_ref_info return cls._get_ref_info_helper(repo, ref_path) File "/data/Development/anaconda/envs/docker/lib/python2.7/site-packages/git/refs/symbolic.py", line 167, in _get_ref_info_helper raise ValueError("Reference at %r does not exist" % ref_path) ValueError: Reference at 'refs/heads/master' does not exist

It seems like path problem.However this issue is still existing after I create this master folder in refs/heads/

Mar 19 '18 13:03 Vonzpf

@Vonzpf Yes. I am following the instructions as per the README and didn't have any issues. However without gpu the training has been running for 3 days now and is about 36% complete. So I would recommend using gpus. Hopefully it is faster. This is on the one billion text.

Mar 19 '18 16:03 yamsgithub

@yamsgithub did you load any other modules besides pytorch python when you ran the code on the cluster?

Mar 19 '18 18:03 luciay

@luciay I just used the docker which setup all the dependencies. I didn't have to install anything else except docker on my machine.

Mar 19 '18 18:03 yamsgithub

@luciay if you are running on a cluster I would recommend creating a virtual environment and let the docker install all packages in that env.

Mar 19 '18 18:03 yamsgithub

@yamsgithub Thank you! I had solved that problem luckily. This project need git to record the code's state. I initialize the repo at my folder "/neural-editor/", but I forgot to add and commit the code. So I just need using "git add ." and "git commit" at folder "/neural-editor/" to solve the problem.

Mar 20 '18 03:03 Vonzpf

@yamsgithub I spoke with @luciay and she shared her batch script which runs on the prince cluster with Singularity instead of Docker on CPU. I then made some modifications so it runs with GPU on the Prince cluster. You can see my fork here -> https://github.com/JackLangerman/neural-editor

Hope this helps people!

Mar 26 '18 21:03 JackLangerman

neural-editor neural-editor copied to clipboard

why training is always killed without any error information

neural-editor
neural-editor copied to clipboard