SwanLab icon indicating copy to clipboard operation
SwanLab copied to clipboard

[BUG] swanlab sync error

Open hm1229 opened this issue 6 months ago • 14 comments

🐛 Bug description [Please make everyone to understand it]

When I'm trying to sync an offline log, it may raise error: swanlab/data/porter/datastore.py", line 126, in scan assert pad == pad_check, "invalid padding" ^^^^^^^^^^^^^^^^ AssertionError: invalid padding

This error occurs randomly, while at most of the time, it works.

🧑‍💻 Step to reproduce

swanlab sync ~/swanlog/xxx/

👾 Expected result

Write down the results you expect

🚑 Any additional [like screenshots]

  • SwanLab Version: swanboard 0.1.8b1 swankit 0.2.4 swanlab 0.6.8
  • Platform: Ubuntu 20.04

hm1229 avatar Sep 19 '25 07:09 hm1229

Did you perform a sync operation during the experiment run?

SAKURA-CAT avatar Sep 19 '25 08:09 SAKURA-CAT

Yes, but this always works. Is this action causing the error?

hm1229 avatar Sep 20 '25 04:09 hm1229

Yes, but this always works. Is this action causing the error?

Yes, but we allow this operation, so it's my issue, and I'll fix it soon.

SAKURA-CAT avatar Sep 21 '25 12:09 SAKURA-CAT

Thanks a lot, there's also another feature that impacts the experience: when I sync the same experiment twice, it will generate two experiment-IDs on the web, which is better in one😊

hm1229 avatar Sep 21 '25 12:09 hm1229

Thanks a lot, there's also another feature that impacts the experience: when I sync the same experiment twice, it will generate two experiment-IDs on the web, which is better in one😊

Indeed, this issue arose due to a flaw in our initial design, which we plan to address in the future. Perhaps you could open a new issue for us to track this problem?

SAKURA-CAT avatar Sep 21 '25 12:09 SAKURA-CAT

Sure, I also wonder is there a quick fix way to sync the broken experiment details, cause a training session takes a lot of time.

hm1229 avatar Sep 22 '25 00:09 hm1229

Sure, I also wonder is there a quick fix way to sync the broken experiment details, cause a training session takes a lot of time.

If you use sync with --id, you can resume the training session

docs: https://docs.swanlab.cn/api/cli-swanlab-sync.html#swanlab-sync

SAKURA-CAT avatar Sep 23 '25 03:09 SAKURA-CAT

i tried using swanlab sync ./swanlog/run-xxx --id But still raise: python3.11/site-packages/swanlab/data/porter/datastore.py", line 126, in scan assert pad == pad_check, "invalid padding" ^^^^^^^^^^^^^^^^ AssertionError: invalid padding

should i upgrade swanlab or sth else?

hm1229 avatar Sep 23 '25 04:09 hm1229

But still raise:

I tried to resolve it in #199, here's a whl package for you:

swanlab-0.6.11b0-py3-none-any.whl.zip

You can try extracting it and then install using the following command:

pip install swanlab-0.6.11b0-py3-none-any.whl

Perhaps the error will no longer appear?

SAKURA-CAT avatar Sep 23 '25 06:09 SAKURA-CAT

is it work for the already broken log, or for the new log?

for the before log it still not works: File "/lib/python3.11/site-packages/swanlab/sync/init.py", line 65, in sync proj, exp = porter.parse() ^^^^^^^^^^^^^^ File "lib/python3.11/site-packages/swanlab/data/porter/init.py", line 54, in wrapper return wrapped(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^ File "lib/python3.11/site-packages/swanlab/data/porter/init.py", line 403, in parse for record in self._f: File "lib/python3.11/site-packages/swanlab/data/porter/datastore.py", line 157, in next record = self.scan() ^^^^^^^^^^^ File "lib/python3.11/site-packages/swanlab/data/porter/datastore.py", line 126, in scan assert pad == pad_check, "invalid padding" ^^^^^^^^^^^^^^^^ AssertionError: invalid padding $ pip list | grep swanlab swanlab 0.6.11b0

hm1229 avatar Sep 23 '25 06:09 hm1229

Sorry, I got the wrong version, it's actually this one:

swanlab-0.6.11b1-py3-none-any.whl.zip

It works for broken log. Have a try!

SAKURA-CAT avatar Sep 23 '25 08:09 SAKURA-CAT

Well, it seems sync complete while i have trained for 500steps, the log ends at 210step, why is the left log disappears? i use verl for training, so i think this is not the RL framework's fault?

hm1229 avatar Sep 23 '25 09:09 hm1229

Well, it seems sync complete while i have trained for 500steps, the log ends at 210step, why is the left log disappears? i use verl for training, so i think this is not the RL framework's fault?

The issue likely lies in the line assert pad == pad_check, "invalid padding". In fact, there shouldn't be any problem with this line. Could you package the problematic log files and send them to my email at [email protected]? Perhaps I can debug and identify the issue.

SAKURA-CAT avatar Sep 23 '25 09:09 SAKURA-CAT

Logging a related issue

Sorry, I got the wrong version, it's actually this one:

swanlab-0.6.11b1-py3-none-any.whl.zip

It works for broken log. Have a try!

  • Problem: after the training process was killed unexpectedly (e.g., due to a server crash), the log file was not properly closed, leading to incomplete data. This causes the swanlab sync command to fail when trying to upload the experiment. Image

  • ​​The fix:​​ The upload worked after I installed and used a specific test version: swanlab-0.6.11b1-py3-none-any.whl.zip

  • SwanLab Version: swanboard 0.1.8b1 swankit 0.2.4 swanlab 0.6.8

  • Platform: Ubuntu 20.04

Special thanks to @SAKURA-CAT for the prompt response and support in identifying this solution.

Wentap123 avatar Oct 20 '25 05:10 Wentap123