verl icon indicating copy to clipboard operation
verl copied to clipboard

fix the file lock issue

Open VPeterV opened this issue 2 weeks ago • 2 comments

Previous FileLock in https://github.com/volcengine/verl/blob/c46f403479db5d7afca6388800503a3bfe393bf5/verl/utils/checkpoint/checkpoint_manager.py#L75 may cause some errors when the given path is too long. To fix this issue, use the hash value to replace the original path to avoid the conflict.

For instance, FileExistsEror: lErmno 17] File exists or BlockingIOError: [Errno 11] Resource temporarily unavailable.

After modifying this part, the issue could be avoided.

@staticmethod
    def local_mkdir(path):
        if not os.path.isabs(path):
            working_dir = os.getcwd()
            path = os.path.join(working_dir, path)

        # Using hash value of path as lock file name to avoid long file name
        lock_filename = f"ckpt_{hash(path) & 0xFFFFFFFF:08x}.lock"
        lock_path = os.path.join(tempfile.gettempdir(), lock_filename)
        
        try:
            with FileLock(lock_path, timeout=60):  # Add timeout
                # make a new dir
                os.makedirs(path, exist_ok=True)
        except Exception as e:
            print(f"Warning: Failed to acquire lock for {path}: {e}")
            # Even if the lock is not acquired, try to create the directory
            os.makedirs(path, exist_ok=True)

        return path

VPeterV avatar Feb 12 '25 09:02 VPeterV