neptune-client
neptune-client copied to clipboard
Cannot resume in offline mode due to lack of `sys/id` field
import neptune.new as neptune
run = neptune.init(mode='offline')
run.sync()
run.wait()
rid = run['sys/id'].fetch()
run = neptune.init(mode='offline', run=rid)
rid = run['sys/id'].fetch()
ends up with:
offline/1b7c5e70-695d-4d1c-8587-a5ca2e3d222c
Traceback (most recent call last):
File "err4.py", line 5, in <module>
run.sync()
File "/home/wojciech/miniconda3/envs/nori/lib/python3.8/site-packages/neptune/new/run.py", line 453, in sync
attributes = self._backend.get_attributes(self._uuid)
File "/home/wojciech/miniconda3/envs/nori/lib/python3.8/site-packages/neptune/new/internal/backends/offline_neptune_backend.py", line 42, in get_attributes
raise NeptuneOfflineModeFetchException
neptune.new.exceptions.NeptuneOfflineModeFetchException:
----NeptuneOfflineModeFetchException---------------------------------------------------
It seems you are trying to fetch data from the server, while working in an offline mode.
You need to work in non-offline connection mode to fetch data from the server.
The thing is that I don't try to fetch data from the server but from the run, whenever it stores its data.
(I've removed my previous comment)
@wjaskowski initially we didn't plan to enable resuming runs in the offline mode. If I may ask why do you need to resume an offline run? Are you working with a multiprocessing / multi-script setup or is there a time break between the execution of the script and it's resume?
The truth is that I just wanted to use resuming in debug mode which initially did not work for me so I tried offline mode, which also failed.
On Mon, 7 Jun 2021 at 14:52, Marcin Mycek @.***> wrote:
(I've removed my previous comment)
@wjaskowski https://github.com/wjaskowski initially we didn't plan to enable resuming runs in the offline mode. If I may ask why do you need to resume an offline run? Are you working with a multiprocessing / multi-script setup or is there a time break between the execution of the script and it's resume?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/neptune-ai/neptune-client/issues/588#issuecomment-855898485, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABFZEHOGCF3OQW6347DHD3TTRS6KVANCNFSM45WOFQ3A .
Switching from spreadssheets to Neptune.ai and How it Pushed...
Switching from spreadssheets to Neptune.ai and How it Pushed...
Hi @Diagrama3
How can I help you?
@Blaizzy I would also like to be able to resume an init_project in debug mode for testing purposes. Can this be achieved?
Hi @ljstrnadiii,
Thanks for reaching out.
Yes, it is.
Example:
import neptune.new as neptune
project = neptune.init_project(mode="debug")
Docs: https://docs.neptune.ai/api/neptune/#init_project
@Blaizzy , I tried to stop and init_project again in a separate process, but the key was not present.
@ljstrnadiii by key you mean api_token, right?
If so, you can read more about setting your api_token here: https://docs.neptune.ai/setup/setting_api_token/
Hey there! Just checking in to see if you still need help with this or if you need help with anything else. Feel free to drop me a message. 😊
@Blaizzy thanks for checking in. What I want to do is use debug mode in two separate processes:
# in one process
import neptune.new as neptune
project = neptune.init_project(mode="debug")
project['key1'] = 1
project.stop()
# then in another process (a test script)
import neptune.new as neptune
project = neptune.init_project(mode="debug")
assert project['key1'] == 1
project.stop()
but this is not possible from what I understand (even though it seems some files get written to tmp somewhere).
In debug mode, no data is stored or sent anywhere. Docs: https://docs.neptune.ai/api/connection_modes/
For the use case you want to test, currently, you have to log metadata to Neptune servers in async or sync mode.
But I can definitely see your point and I'll submit your comment as a feature request to the product team.
Hey @ljstrnadiii!
Just checking in to see if you still need help with this or if you need help with anything else. Feel free to drop me a message. 😊
@Blaizzy that is what I thought. We test in debug mode and use a neptune run in debug mode as a fixture where we can and that works well, but for some e2e tests, we can only pass a reference to a neptune run or project location. We have created a tests project in neptune for our e2e tests to keep things isolated a bit.
Thanks for the clarification!
It's my pleasure :)
You are most welcome @ljstrnadiii!
Your solution is quite interesting, and I would love to learn more about it if you don't mind. I think it could provide us with valuable insight that we can incorporate into the product.
Let me know what you think
The function of resuming offline runs is very useful. Many guys are using commercial GPU servers to train their models, the GPU server often has the longest running time limit for a single run, for example, Kaggle's time limit is 12 hours, so we have to divide the training work into several parts. While using the offline model, the training speed will be faster and the offline mode is preferred. When the work is done, the offline training data will be uploaded to the Neptune server.
For my code run = neptune.init_run( mode="offline", custom_run_id='test-offline', .... }
Neptune will generate several offline outputs to .neptune directory. I use the command: neptune sync --path .neptune --project aaa/bbb --offline-only
It is executed ok, but only the last run is displayed on the website. It seems the last run overwrites the prior one.
Hi @bg4xsd
Thanks for reaching out and sharing your use case!
I have also passed it as feedback to the product team.
Regarding your code, I notice that you are using the custom_run_id argument in offline mode. Currently, offline runs have no sys/id; consequently, custom_run_id doesn't work.
Each time you run that script and then use the neptune sync CLI command, it will create a separate run.
But I can see your point; thanks to your feedback and others, we can now start thinking of a potential solution to this use case.
Hi @Blaizzy ,
Thanks for your quick response.
For the students in University, in the lab, the GPU server always lacks, because training a neural network is time-consuming work, and the training process often is terminated by other students, so I think the function of resume offline run must be useful and popular, :-).
Further, you know that tensorboard's graph and table are ugly and low resolution, they can not be used in the thesis directly. Neptune's beautiful diagrams are welcome and its export function is very easy to use.
Many years before, I have to draw, compare and adjust the graph manually, and now, I am going to move from tensorboard to Neptune this year.
Come on and have a nice day.
Most welcome and thank you for your kind words!
I'm happy you enjoy using Neptune as much as we love making for you :)
I will let you know here once the feature is released.
Other than that, is there anything else I could help you with?
Hi @Blaizzy
Hope to hear from you soon. By now, no more questions.
Anyway, thank you again.
Perfect, have a great week! :)
Hi @Blaizzy ! Is this feature still on the radar? We train on cloud instances that somewhat frequently get interrupted. This prevents us from using offline mode, as we can not resume the same run in offline mode.
Hi @wouterzwerink
This feature is on the radar. However, at the moment, we don't have an ETA for it.
Could you share the tracebacks for the times your training gets interrupted?
Hi @wouterzwerink ,
Do you still need help with this?
The offline resume is useful for offline logging. Using online mode will decrease the long-time training speed. For using cloud GPU services, such as Kaggle, and Google's colab, the training procedure will be interrupted every 10~12 hours, so the offline resume function is meaningful.
@bg4xsd
I understand.
Could you share the tracebacks for the times your training gets interrupted?
@Blaizzy I seem to have missed your question, sorry! The training interruptions are not due to neptune at all! The interruptions are from using spot instances. We train with fault tolerance, so the training continues after the interruption. However, to keep neptune fault tolerant, we have to use async mode instead of offline mode. So I don't need help with this, but thanks for asking! Looking forward to this feature once it is complete
@wouterzwerink great to hear!
If anything pops up feel free to let me know. I'll be happy to help :)
I am interested in this feature. It'd be very useful for multi-script programs.