verl
verl copied to clipboard
fix the file lock issue
Previous FileLock in https://github.com/volcengine/verl/blob/c46f403479db5d7afca6388800503a3bfe393bf5/verl/utils/checkpoint/checkpoint_manager.py#L75 may cause some errors when the given path is too long. To fix this issue, use the hash value to replace the original path to avoid the conflict.
For instance, FileExistsEror: lErmno 17] File exists or BlockingIOError: [Errno 11] Resource temporarily unavailable.
After modifying this part, the issue could be avoided.
@staticmethod
def local_mkdir(path):
if not os.path.isabs(path):
working_dir = os.getcwd()
path = os.path.join(working_dir, path)
# Using hash value of path as lock file name to avoid long file name
lock_filename = f"ckpt_{hash(path) & 0xFFFFFFFF:08x}.lock"
lock_path = os.path.join(tempfile.gettempdir(), lock_filename)
try:
with FileLock(lock_path, timeout=60): # Add timeout
# make a new dir
os.makedirs(path, exist_ok=True)
except Exception as e:
print(f"Warning: Failed to acquire lock for {path}: {e}")
# Even if the lock is not acquired, try to create the directory
os.makedirs(path, exist_ok=True)
return path