neptune-client icon indicating copy to clipboard operation
neptune-client copied to clipboard

Cannot resume in offline mode due to lack of `sys/id` field

Open wjaskowski opened this issue 4 years ago • 31 comments

import neptune.new as neptune
run = neptune.init(mode='offline')
run.sync()
run.wait()
rid = run['sys/id'].fetch()
run = neptune.init(mode='offline', run=rid)
rid = run['sys/id'].fetch()

ends up with:

offline/1b7c5e70-695d-4d1c-8587-a5ca2e3d222c
Traceback (most recent call last):
  File "err4.py", line 5, in <module>
    run.sync()
  File "/home/wojciech/miniconda3/envs/nori/lib/python3.8/site-packages/neptune/new/run.py", line 453, in sync
    attributes = self._backend.get_attributes(self._uuid)
  File "/home/wojciech/miniconda3/envs/nori/lib/python3.8/site-packages/neptune/new/internal/backends/offline_neptune_backend.py", line 42, in get_attributes
    raise NeptuneOfflineModeFetchException
neptune.new.exceptions.NeptuneOfflineModeFetchException: 

----NeptuneOfflineModeFetchException---------------------------------------------------

It seems you are trying to fetch data from the server, while working in an offline mode.
You need to work in non-offline connection mode to fetch data from the server.

The thing is that I don't try to fetch data from the server but from the run, whenever it stores its data.

wjaskowski avatar May 28 '21 11:05 wjaskowski

(I've removed my previous comment)

@wjaskowski initially we didn't plan to enable resuming runs in the offline mode. If I may ask why do you need to resume an offline run? Are you working with a multiprocessing / multi-script setup or is there a time break between the execution of the script and it's resume?

Herudaio avatar Jun 07 '21 12:06 Herudaio

The truth is that I just wanted to use resuming in debug mode which initially did not work for me so I tried offline mode, which also failed.

On Mon, 7 Jun 2021 at 14:52, Marcin Mycek @.***> wrote:

(I've removed my previous comment)

@wjaskowski https://github.com/wjaskowski initially we didn't plan to enable resuming runs in the offline mode. If I may ask why do you need to resume an offline run? Are you working with a multiprocessing / multi-script setup or is there a time break between the execution of the script and it's resume?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/neptune-ai/neptune-client/issues/588#issuecomment-855898485, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABFZEHOGCF3OQW6347DHD3TTRS6KVANCNFSM45WOFQ3A .

wjaskowski avatar Jun 07 '21 12:06 wjaskowski

Switching from spreadssheets to Neptune.ai and How it Pushed...

Diagrama3 avatar Oct 04 '22 06:10 Diagrama3

Switching from spreadssheets to Neptune.ai and How it Pushed...

Diagrama3 avatar Oct 04 '22 07:10 Diagrama3

Hi @Diagrama3

How can I help you?

Blaizzy avatar Oct 04 '22 09:10 Blaizzy

@Blaizzy I would also like to be able to resume an init_project in debug mode for testing purposes. Can this be achieved?

ljstrnadiii avatar Dec 12 '22 16:12 ljstrnadiii

Hi @ljstrnadiii,

Thanks for reaching out.

Yes, it is.

Example:

import neptune.new as neptune
project = neptune.init_project(mode="debug")

Docs: https://docs.neptune.ai/api/neptune/#init_project

Blaizzy avatar Dec 12 '22 20:12 Blaizzy

@Blaizzy , I tried to stop and init_project again in a separate process, but the key was not present.

ljstrnadiii avatar Dec 12 '22 21:12 ljstrnadiii

@ljstrnadiii by key you mean api_token, right?

If so, you can read more about setting your api_token here: https://docs.neptune.ai/setup/setting_api_token/

Blaizzy avatar Dec 13 '22 11:12 Blaizzy

Hey there! Just checking in to see if you still need help with this or if you need help with anything else. Feel free to drop me a message. 😊

Blaizzy avatar Dec 17 '22 09:12 Blaizzy

@Blaizzy thanks for checking in. What I want to do is use debug mode in two separate processes:

# in one process
import neptune.new as neptune
project = neptune.init_project(mode="debug")
project['key1'] = 1
project.stop()

# then in another process (a test script)
import neptune.new as neptune
project = neptune.init_project(mode="debug")
assert project['key1'] == 1
project.stop()

but this is not possible from what I understand (even though it seems some files get written to tmp somewhere).

ljstrnadiii avatar Dec 18 '22 15:12 ljstrnadiii

In debug mode, no data is stored or sent anywhere. Docs: https://docs.neptune.ai/api/connection_modes/

For the use case you want to test, currently, you have to log metadata to Neptune servers in async or sync mode.

But I can definitely see your point and I'll submit your comment as a feature request to the product team.

Blaizzy avatar Dec 19 '22 12:12 Blaizzy

Hey @ljstrnadiii!

Just checking in to see if you still need help with this or if you need help with anything else. Feel free to drop me a message. 😊

Blaizzy avatar Dec 21 '22 14:12 Blaizzy

@Blaizzy that is what I thought. We test in debug mode and use a neptune run in debug mode as a fixture where we can and that works well, but for some e2e tests, we can only pass a reference to a neptune run or project location. We have created a tests project in neptune for our e2e tests to keep things isolated a bit.

Thanks for the clarification!

ljstrnadiii avatar Dec 22 '22 13:12 ljstrnadiii

It's my pleasure :)

You are most welcome @ljstrnadiii!

Your solution is quite interesting, and I would love to learn more about it if you don't mind. I think it could provide us with valuable insight that we can incorporate into the product.

Let me know what you think

Blaizzy avatar Dec 27 '22 13:12 Blaizzy

The function of resuming offline runs is very useful. Many guys are using commercial GPU servers to train their models, the GPU server often has the longest running time limit for a single run, for example, Kaggle's time limit is 12 hours, so we have to divide the training work into several parts. While using the offline model, the training speed will be faster and the offline mode is preferred. When the work is done, the offline training data will be uploaded to the Neptune server.

For my code run = neptune.init_run( mode="offline", custom_run_id='test-offline', .... }

Neptune will generate several offline outputs to .neptune directory. I use the command: neptune sync --path .neptune --project aaa/bbb --offline-only

It is executed ok, but only the last run is displayed on the website. It seems the last run overwrites the prior one.

bg4xsd avatar Jan 22 '23 07:01 bg4xsd

Hi @bg4xsd

Thanks for reaching out and sharing your use case!

I have also passed it as feedback to the product team.

Regarding your code, I notice that you are using the custom_run_id argument in offline mode. Currently, offline runs have no sys/id; consequently, custom_run_id doesn't work.

Each time you run that script and then use the neptune sync CLI command, it will create a separate run.

But I can see your point; thanks to your feedback and others, we can now start thinking of a potential solution to this use case.

Blaizzy avatar Jan 23 '23 09:01 Blaizzy

Hi @Blaizzy , Thanks for your quick response. For the students in University, in the lab, the GPU server always lacks, because training a neural network is time-consuming work, and the training process often is terminated by other students, so I think the function of resume offline run must be useful and popular, :-). Further, you know that tensorboard's graph and table are ugly and low resolution, they can not be used in the thesis directly. Neptune's beautiful diagrams are welcome and its export function is very easy to use.
Many years before, I have to draw, compare and adjust the graph manually, and now, I am going to move from tensorboard to Neptune this year. Come on and have a nice day.

bg4xsd avatar Jan 23 '23 10:01 bg4xsd

Most welcome and thank you for your kind words!

I'm happy you enjoy using Neptune as much as we love making for you :)

Blaizzy avatar Jan 23 '23 12:01 Blaizzy

I will let you know here once the feature is released.

Other than that, is there anything else I could help you with?

Blaizzy avatar Jan 23 '23 12:01 Blaizzy

Hi @Blaizzy

Hope to hear from you soon. By now, no more questions.

Anyway, thank you again.

bg4xsd avatar Jan 23 '23 12:01 bg4xsd

Perfect, have a great week! :)

Blaizzy avatar Jan 23 '23 13:01 Blaizzy

Hi @Blaizzy ! Is this feature still on the radar? We train on cloud instances that somewhat frequently get interrupted. This prevents us from using offline mode, as we can not resume the same run in offline mode.

wouterzwerink avatar Jun 02 '23 08:06 wouterzwerink

Hi @wouterzwerink

This feature is on the radar. However, at the moment, we don't have an ETA for it.

Could you share the tracebacks for the times your training gets interrupted?

Blaizzy avatar Jun 02 '23 13:06 Blaizzy

Hi @wouterzwerink ,

Do you still need help with this?

Blaizzy avatar Jun 05 '23 13:06 Blaizzy

The offline resume is useful for offline logging. Using online mode will decrease the long-time training speed. For using cloud GPU services, such as Kaggle, and Google's colab, the training procedure will be interrupted every 10~12 hours, so the offline resume function is meaningful.

bg4xsd avatar Jun 06 '23 01:06 bg4xsd

@bg4xsd

I understand.

Could you share the tracebacks for the times your training gets interrupted?

Blaizzy avatar Jun 06 '23 15:06 Blaizzy

@Blaizzy I seem to have missed your question, sorry! The training interruptions are not due to neptune at all! The interruptions are from using spot instances. We train with fault tolerance, so the training continues after the interruption. However, to keep neptune fault tolerant, we have to use async mode instead of offline mode. So I don't need help with this, but thanks for asking! Looking forward to this feature once it is complete

wouterzwerink avatar Jul 10 '23 20:07 wouterzwerink

@wouterzwerink great to hear!

If anything pops up feel free to let me know. I'll be happy to help :)

Blaizzy avatar Jul 11 '23 09:07 Blaizzy

I am interested in this feature. It'd be very useful for multi-script programs.

pprobst avatar Feb 15 '24 17:02 pprobst