openfl
openfl copied to clipboard
Problem with Openfl Gramine "error: PAL failed at ../../Pal/src/db_main.c:pal_main:513 (exitcode = 4, reason=No 'loader.entrypoint' is specified in the manifest)"
Describe the bug I am attempting to run the example of FL as given in manual and getting this error on the aggregator.
Gramine is starting. Parsing TOML manifest file, this may take some time...
Detected a huge manifest, preallocating 64MB of internal memory.
-----------------------------------------------------------------------------------------------------------------------
Gramine detected the following insecure configurations:
- loader.insecure__use_cmdline_argv = true (forwarding command-line args from untrusted host to the app)
- sgx.allowed_files = [ ... ] (some files are passed through from untrusted host without verification)
Gramine will continue application execution, but this configuration must not be used in production!
-----------------------------------------------------------------------------------------------------------------------
error: PAL failed at ../../Pal/src/db_main.c:pal_main:513 (exitcode = 4, reason=No 'loader.entrypoint' is specified in the manifest)
Also, step 7 in the manual is presented in a bit vague manner for a first-time user. I used the setup given here as a workspace and template. But using this gave above error when I am trying to start federation on aggregator machine.
Hey! Thanks for reporting this. This issue was addressed and I believe it shouldn't be there with the next OpenFL release.
For now, you can try installing openfl from the develop branch:
git clone https://github.com/intel/openfl.git && cd openfl && pip install -e .
But you will need to rebuild your docker images, you could remove the old ones, or just pass --rebuild
to the graminize
command
I agree with your note regarding step 7 in the manual, but it would be also a wrong place to explain the certification process. To make your life easier try using automatically generated certificates from this script, as the manual suggests.
I tried using with the latest build "openfl 1.4" but got a new error. For running openfl-gramine I had to install openfl1.4 in the docker image as well, for which I had to change the file "*venv/lib/python3.8/site-packages/openfl-gramine/Dockerfile.gramine" but there after I got a new error
Traceback (most recent call last):
File "/usr/local/bin/fx", line 8, in <module>
sys.exit(entry())
File "/usr/local/lib/python3.8/site-packages/openfl/interface/cli.py", line 207, in entry
command_group = import_module(module, package)
File "/usr/local/lib/python3.8/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
File "<frozen importlib._bootstrap>", line 991, in _find_and_load
File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 843, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/usr/local/lib/python3.8/site-packages/openfl/interface/director.py", line 17, in <module>
from openfl.component.director import Director
File "/usr/local/lib/python3.8/site-packages/openfl/component/director/__init__.py", line 6, in <module>
from .director import Director
File "/usr/local/lib/python3.8/site-packages/openfl/component/director/director.py", line 15, in <module>
from .experiment import Experiment
File "/usr/local/lib/python3.8/site-packages/openfl/component/director/experiment.py", line 14, in <module>
from openfl.federated import Plan
File "/usr/local/lib/python3.8/site-packages/openfl/federated/__init__.py", line 8, in <module>
from .task import TaskRunner # NOQA
File "/usr/local/lib/python3.8/site-packages/openfl/federated/task/__init__.py", line 14, in <module>
import tensorflow # NOQA
File "/usr/local/lib/python3.8/site-packages/tensorflow/__init__.py", line 41, in <module>
from tensorflow.python.tools import module_util as _module_util
File "/usr/local/lib/python3.8/site-packages/tensorflow/python/__init__.py", line 108, in <module>
from tensorflow.python.platform import test
File "/usr/local/lib/python3.8/site-packages/tensorflow/python/platform/test.py", line 24, in <module>
from tensorflow.python.framework import test_util as _test_util
File "/usr/local/lib/python3.8/site-packages/tensorflow/python/framework/test_util.py", line 37, in <module>
from absl.testing import parameterized
File "/usr/local/lib/python3.8/site-packages/absl/testing/parameterized.py", line 215, in <module>
from absl.testing import absltest
File "/usr/local/lib/python3.8/site-packages/absl/testing/absltest.py", line 225, in <module>
get_default_test_tmpdir(),
File "/usr/local/lib/python3.8/site-packages/absl/testing/absltest.py", line 163, in get_default_test_tmpdir
tmpdir = os.path.join(tempfile.gettempdir(), 'absl_testing')
File "/usr/local/lib/python3.8/tempfile.py", line 286, in gettempdir
tempdir = _get_default_tempdir()
File "/usr/local/lib/python3.8/tempfile.py", line 218, in _get_default_tempdir
raise FileNotFoundError(_errno.ENOENT,
FileNotFoundError: [Errno 2] No usable temporary directory found in ['/tmp', '/var/tmp', '/usr/tmp', '/workspace']
I have tried it for multiple temples and the error is persistent there also.
Hi @gagandeep987123, could you specify the command which lead to the error above? Just looking through the error, looks like either you don't have root permissions or there is no space left on device (df -kh .)
Hi @mansishr I am using a modified script of test_graminize.sh (please change .pdf with .sh) test_graminize.pdf
gsingh@sgx03:~$ df -kh
Filesystem Size Used Avail Use% Mounted on
udev 189G 0 189G 0% /dev
tmpfs 38G 2.7M 38G 1% /run
/dev/mapper/ubuntu--vg-ubuntu--lv 117G 92G 19G 84% /
tmpfs 189G 0 189G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 189G 0 189G 0% /sys/fs/cgroup
/dev/nvme0n1p2 974M 372M 535M 42% /boot
/dev/nvme0n1p1 511M 5.3M 506M 2% /boot/efi
/dev/sda1 880G 28K 835G 1% /mnt/storage
/dev/loop0 56M 56M 0 100% /snap/core18/2560
/dev/loop2 62M 62M 0 100% /snap/core20/1611
/dev/loop1 56M 56M 0 100% /snap/core18/2566
/dev/loop3 64M 64M 0 100% /snap/core20/1623
/dev/loop4 68M 68M 0 100% /snap/lxd/22526
/dev/loop5 68M 68M 0 100% /snap/lxd/22753
/dev/loop6 47M 47M 0 100% /snap/snapd/16292
/dev/loop7 48M 48M 0 100% /snap/snapd/16778
tmpfs 38G 0 38G 0% /run/user/1171
Also I made another changes to the "*venv/lib/python3.8/site-packages/openfl-gramine/Dockerfile.gramine"
ARG BASE_IMAGE=python:3.8
FROM ${BASE_IMAGE}
SHELL ["/bin/bash", "-o", "pipefail", "-c"]
RUN pwd
WORKDIR /openfl
COPY openfl .
RUN pwd
RUN --mount=type=cache,target=/root/.cache/ \
pip install --upgrade pip && \
pip install .
WORKDIR /
RUN pwd
# install gramine
RUN curl -fsSLo /usr/share/keyrings/gramine-keyring.gpg https://packages.gramineproject.io/gramine-keyring.gpg && \
echo 'deb [arch=amd64 signed-by=/usr/share/keyrings/gramine-keyring.gpg] https://packages.gramineproject.io/ stable main' | \
tee /etc/apt/sources.list.d/gramine.list
RUN --mount=type=cache,id=apt-dev,target=/var/cache/apt \
apt-get update && \
apt-get install -y --no-install-recommends \
gramine libprotobuf-c-dev \
&& rm -rf /var/lib/apt/lists/*
# there is an issue for libprotobuf-c in gramine repo, install from apt for now
# graminelibos is under this dir
ENV PYTHONPATH=/usr/local/lib/python3.8/site-packages/:/usr/lib/python3/dist-packages/:
# install linux headers
# WORKDIR /tmp/
# RUN wget -c https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.11/amd64/linux-headers-5.11.0-051100_5.11.0-051100.202102142330_all.deb
# RUN dpkg -i *.deb
# RUN mv /usr/src/linux-headers-5.11.0-051100/ /usr/src/linux-headers-5.11.0-051100rc5-generic/
# WORKDIR /
# ENV LC_ALL=C.UTF-8
# ENV LANG=C.UTF-8
I am using the latest release from GitHub for openfl installation.
Also attaching the output from running the script. (please remove .pdf from output.pdf) output.pdf
No usable temporary directory found in ['/tmp', '/var/tmp', '/usr/tmp', '/workspace']
I also got this. It looks like Tensorflow tries to create a temporary directory somewhere inside an enclave which is not a good idea in the first place.
What makes, it strange the gramine manifest allows to use /tmp
just for this purpose and we still see this error.
In order for TF to create a directory at runtime inside an enclave, we would need to mount that directory from the host area to the enclave, something like this. For the syntax that we currently support with OpenFL, the lines below can be added to the manifest here:
fs.mount.etc.type = "chroot"
fs.mount.etc.path = "/tmp"
fs.mount.etc.uri = "file:/tmp"
In order for TF to create a directory at runtime inside an enclave, we would need to mount that directory from the host area to the enclave, something like this. For the syntax that we currently support with OpenFL, the lines below can be added to the manifest here:
fs.mount.etc.type = "chroot" fs.mount.etc.path = "/tmp" fs.mount.etc.uri = "file:/tmp"
It is strange we need to mount the temp folder and not just allow using it inside an enclave. I am positive it worked before 😅😅
In order for TF to create a directory at runtime inside an enclave, we would need to mount that directory from the host area to the enclave, something like this. For the syntax that we currently support with OpenFL, the lines below can be added to the manifest here:
fs.mount.etc.type = "chroot" fs.mount.etc.path = "/tmp" fs.mount.etc.uri = "file:/tmp"
So I made the following changes
- In test_graminize.sh I added a line
mkdir tmpfs
in ${FED_DIRECTORY} and then added an option in docker run--volume=${FED_DIRECTORY}/tmpfs:/tmp \
- Added the lines to openfl.manifest.template file.
But I am still getting the same error. Is it working for you @igor-davidyuk ? or Am I missing something?
- In test_graminize.sh I added a line
mkdir tmpfs
in ${FED_DIRECTORY} and then added an option in docker run--volume=${FED_DIRECTORY}/tmpfs:/tmp \
- Added the lines to openfl.manifest.template file
It should be done in a different way. We need to add that mounting line to the gramine manifest template. Then we should keep in mind that the enclave is built withing a docker image, so we need that updated manifest in docker image. Thus we should not install openfl from pip in docker base image, but copy the local repository and install from source.
Will try to do this in a separate branch. Yet I am still not sure if it is safe to mount /tmp to the enclave, will ask gramine guys
Try this branch, worked for me! https://github.com/igor-davidyuk/openfl/tree/manifest-gramine-update
Thanks for a working branch @igor-davidyuk. Ideally, it is unsafe to even allow /tmp directory, but since we are putting that as an allowed file in the manifest for this example, it should be okay to mount it as well. We should definitely consult the Gramine team as well.
yes it is not stopping at that step 😄 but stopping a bit further. As of now after using the new branch, aggregator starts but the example it self is giving a error.
[09:20:29] INFO Using TaskRunner subclassing API collaborator.py:253
[09:20:29] INFO Using TaskRunner subclassing API collaborator.py:253
[09:20:37] METRIC Round 0, collaborator one is sending metric for task aggregated_model_validation: acc 0.140000 collaborator.py:415
[09:20:37] INFO Collaborator one is sending task results for aggregated_model_validation, round 0 aggregator.py:515
ERROR Exception calling application: [Errno 2] No such file or directory _server.py:445
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/grpc/_server.py", line 435, in _call_behavior
response_or_iterator = behavior(argument, context)
File "/usr/local/lib/python3.8/site-packages/openfl/transport/grpc/aggregator_server.py", line 222, in SendLocalTaskResults
self.aggregator.send_local_task_results(
File "/usr/local/lib/python3.8/site-packages/openfl/component/aggregator/aggregator.py", line 552, in
send_local_task_results
self.log_metric(tensor_key.tags[-1], task_name,
File "/usr/local/lib/python3.8/site-packages/openfl/utilities/logs.py", line 24, in write_metric
get_writer()
File "/usr/local/lib/python3.8/site-packages/openfl/utilities/logs.py", line 19, in get_writer
writer = SummaryWriter('./logs/tensorboard', flush_secs=5)
File "/usr/local/lib/python3.8/site-packages/tensorboardX/writer.py", line 301, in __init__
self._get_file_writer()
File "/usr/local/lib/python3.8/site-packages/tensorboardX/writer.py", line 349, in _get_file_writer
self.file_writer = FileWriter(logdir=self.logdir,
File "/usr/local/lib/python3.8/site-packages/tensorboardX/writer.py", line 105, in __init__
self.event_writer = EventFileWriter(
File "/usr/local/lib/python3.8/site-packages/tensorboardX/event_file_writer.py", line 105, in __init__
self._event_queue = multiprocessing.Queue(max_queue_size)
File "/usr/local/lib/python3.8/multiprocessing/context.py", line 103, in Queue
return Queue(maxsize, ctx=self.get_context())
File "/usr/local/lib/python3.8/multiprocessing/queues.py", line 42, in __init__
self._rlock = ctx.Lock()
File "/usr/local/lib/python3.8/multiprocessing/context.py", line 68, in Lock
return Lock(ctx=self.get_context())
File "/usr/local/lib/python3.8/multiprocessing/synchronize.py", line 162, in __init__
SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx)
File "/usr/local/lib/python3.8/multiprocessing/synchronize.py", line 57, in __init__
sl = self._semlock = _multiprocessing.SemLock(
FileNotFoundError: [Errno 2] No such file or directory
INFO Response code: StatusCode.UNKNOWN aggregator_client.py:59
INFO Attempting to resend data request to aggregator at localhost:58034 aggregator_client.py:98
INFO Collaborator one is sending task results for aggregated_model_validation, round 0 aggregator.py:515
ERROR Exception calling application: [Errno 2] No such file or directory _server.py:445
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/grpc/_server.py", line 435, in _call_behavior
response_or_iterator = behavior(argument, context)
File "/usr/local/lib/python3.8/site-packages/openfl/transport/grpc/aggregator_server.py", line 222, in SendLocalTaskResults
self.aggregator.send_local_task_results(
File "/usr/local/lib/python3.8/site-packages/openfl/component/aggregator/aggregator.py", line 552, in
send_local_task_results
self.log_metric(tensor_key.tags[-1], task_name,
File "/usr/local/lib/python3.8/site-packages/openfl/utilities/logs.py", line 24, in write_metric
get_writer()
File "/usr/local/lib/python3.8/site-packages/openfl/utilities/logs.py", line 19, in get_writer
writer = SummaryWriter('./logs/tensorboard', flush_secs=5)
File "/usr/local/lib/python3.8/site-packages/tensorboardX/writer.py", line 301, in __init__
self._get_file_writer()
File "/usr/local/lib/python3.8/site-packages/tensorboardX/writer.py", line 349, in _get_file_writer
self.file_writer = FileWriter(logdir=self.logdir,
File "/usr/local/lib/python3.8/site-packages/tensorboardX/writer.py", line 105, in __init__
self.event_writer = EventFileWriter(
File "/usr/local/lib/python3.8/site-packages/tensorboardX/event_file_writer.py", line 105, in __init__
self._event_queue = multiprocessing.Queue(max_queue_size)
File "/usr/local/lib/python3.8/multiprocessing/context.py", line 103, in Queue
return Queue(maxsize, ctx=self.get_context())
File "/usr/local/lib/python3.8/multiprocessing/queues.py", line 42, in __init__
self._rlock = ctx.Lock()
File "/usr/local/lib/python3.8/multiprocessing/context.py", line 68, in Lock
return Lock(ctx=self.get_context())
File "/usr/local/lib/python3.8/multiprocessing/synchronize.py", line 162, in __init__
SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx)
File "/usr/local/lib/python3.8/multiprocessing/synchronize.py", line 57, in __init__
sl = self._semlock = _multiprocessing.SemLock(
FileNotFoundError: [Errno 2] No such file or directory
INFO Response code: StatusCode.UNKNOWN aggregator_client.py:59
INFO Attempting to resend data request to aggregator at localhost:58034 aggregator_client.py:98
INFO Collaborator one is sending task results for aggregated_model_validation, round 0 aggregator.py:515
ERROR Exception calling application: [Errno 2] No such file or directory _server.py:445
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/grpc/_server.py", line 435, in _call_behavior
response_or_iterator = behavior(argument, context)
File "/usr/local/lib/python3.8/site-packages/openfl/transport/grpc/aggregator_server.py", line 222, in SendLocalTaskResults
self.aggregator.send_local_task_results(
File "/usr/local/lib/python3.8/site-packages/openfl/component/aggregator/aggregator.py", line 552, in
send_local_task_results
self.log_metric(tensor_key.tags[-1], task_name,
File "/usr/local/lib/python3.8/site-packages/openfl/utilities/logs.py", line 24, in write_metric
get_writer()
File "/usr/local/lib/python3.8/site-packages/openfl/utilities/logs.py", line 19, in get_writer
writer = SummaryWriter('./logs/tensorboard', flush_secs=5)
File "/usr/local/lib/python3.8/site-packages/tensorboardX/writer.py", line 301, in __init__
self._get_file_writer()
File "/usr/local/lib/python3.8/site-packages/tensorboardX/writer.py", line 349, in _get_file_writer
self.file_writer = FileWriter(logdir=self.logdir,
File "/usr/local/lib/python3.8/site-packages/tensorboardX/writer.py", line 105, in __init__
self.event_writer = EventFileWriter(
File "/usr/local/lib/python3.8/site-packages/tensorboardX/event_file_writer.py", line 105, in __init__
self._event_queue = multiprocessing.Queue(max_queue_size)
File "/usr/local/lib/python3.8/multiprocessing/context.py", line 103, in Queue
return Queue(maxsize, ctx=self.get_context())
File "/usr/local/lib/python3.8/multiprocessing/queues.py", line 42, in __init__
self._rlock = ctx.Lock()
File "/usr/local/lib/python3.8/multiprocessing/context.py", line 68, in Lock
return Lock(ctx=self.get_context())
File "/usr/local/lib/python3.8/multiprocessing/synchronize.py", line 162, in __init__
SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx)
File "/usr/local/lib/python3.8/multiprocessing/synchronize.py", line 57, in __init__
sl = self._semlock = _multiprocessing.SemLock(
FileNotFoundError: [Errno 2] No such file or directory
The above error is happening in loop.
I am currently using TEMPLATE=${3:- 'torch_cnn_histology_gramine_ready'}
in test_graminize.sh.
Also, when I use "torch_unet_kvasir_gramine_ready", I am getting below error:
SystemError: ZIP File hash doesn't match expected file hash.
Hi @gagandeep987123, can you try out the example with "torch_unet_kvasir_gramine_ready"? We are aware of the issue of the hash not being valid (came pretty recently). We'll resolve this soon, but in the meantime, please comment out this line and proceed.
Regarding the multiprocessing issue that you see, it is a known issue that Python's multiprocessing
package requires support for POSIX semaphores, which Gramine does not support or implement. So we'll need to disable multiprocessing (num_workers=0) to make it work.
Hi @mansishr, getting same error multiprocessing
while running "torch_unet_kvasir_gramine_ready"
Hi @gagandeep987123, sorry for a late response. Multiprocessing is getting triggered through the use of tensorboard's summary writer. Please disable write_logs
in plan setting:
aggregator :
defaults : plan/defaults/aggregator.yaml
template : openfl.component.Aggregator
settings :
init_state_path : save/torch_cnn_histology_init.pbuf
best_state_path : save/torch_cnn_histology_best.pbuf
last_state_path : save/torch_cnn_histology_last.pbuf
rounds_to_train : 20
write_logs : false
@mansishr Is it working for you because I am still getting the error. I just changed the file as you suggested via a change in Dockerfile.gramine
Hi @gagandeep987123, could you attach files where you have made the changes? Also, you would need to run through all the steps again and rebuild the image after any changes to the plan.
Hi @gagandeep987123 let us know if the issue got resolved?
It is working. Thanks for the help