quantum icon indicating copy to clipboard operation
quantum copied to clipboard

Kernel freeze at tf.keras.Sequential.fit()

Open rafalpotempa opened this issue 4 years ago • 8 comments

What I did?

Link to Colab: https://colab.research.google.com/drive/1g6BFapSuG0-WCQzxlrDsPKCcmaGemB9f?usp=sharing

Please use emails connected to the GitHub account for request - I'll accept it. Notebook is related to my graduation project and I don't want the work to go fully public yet.

I created a custom layer with quantum circuit in quantum_circuit() to represent 8x8 image - 4 readout qubits with two H gates, connected to 16 qubits by ZZ**(param) gates for each of 4 readouts. (8x8 extension of what can be found in MNIST Classification example.

The image is divided into 4 4x4 pieces, each connected to single readout qubit.

The data is represented similarly to what can be found in the example (X gate if normalized_color > 0.5).

I attached a softmax layer directly to quantum one for classification using tf.keras.Sequential model, since I want to extend it further - up to all 10 digits.

qnn_model = tf.keras.Sequential([
    tf.keras.Input(shape=(), dtype=tf.string, name='q_input'),
    tfq.layers.PQC(model_circuit, model_readout, name='quantum'),
    tf.keras.layers.Dense(2, activation=tf.keras.activations.softmax, name='softmax'),
])
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
quantum (PQC)                (None, 4)                 64        
_________________________________________________________________
softmax (Dense)              (None, 2)                 10        
=================================================================
Total params: 74
Trainable params: 74
Non-trainable params: 0
_________________________________________________________________

I compiled the model and I tried to fit it.

What was expected to happen?

The model should start to iterate over given number of epochs.

What happened?

Epoch 1/10 is displayed, but nothing else happens.

  • The Colab kernel restarts yielding log, that can be found in the Attachements section.
  • Using WSL2 local environment I just encountered something I would call 'a kernel freeze'. The cell was trying to run, but there was nothing happening - no CPU, RAM usage. The operation could not have been interrupted - only kernel restart worked.

Environment

tensorflow          2.3.1
tensorflow-quantum  0.4.0

for both:

  • Google Colab
  • Windows Subsystem Linux 2 (Ubuntu 20.04.1 LTS; Windows 10 Pro, build 20270)

No GPU involved.

What I found out?

When I try to run the notebook with compressed_image_size = 4 everything works as intended. I've checked my quantum_circuit() and it seems to be working as intended for version 8x8 - it generates circuit with desired architecture.

When I tried to trace down the error I found out that:

data_adapter.py: enumerate_epochs() yields correct epoch, but the tf.data.Iterator data_iterator has AttributeErrors like

AttributeError: 'OwnedIterator' object has no attribute '_self_unconditional_checkpoint_dependencies'

in

  • _checkpoint_dependencies
  • _deferred_dependencies
AttributeError: 'OwnedIterator' object has no attribute '_self_name_based_restores'
  • _name_based_restores

and:

AttributeError("'OwnedIterator' object has no attribute '_self_unconditional_checkpoint_dependencies'")
AttributeError("'OwnedIterator' object has no attribute '_self_unconditional_dependency_names'")
AttributeError("'OwnedIterator' object has no attribute '_self_update_uid'")

I'm not sure if this is relevant.

Attachments

colab-jupyter.log

Dec 15, 2020, 10:41:32 AM | WARNING | WARNING:root:kernel b6193863-8d44-476f-b8cc-eadbe7129967 restarted
Dec 15, 2020, 10:41:32 AM | INFO | KernelRestarter: restarting kernel (1/5), keep random ports
Dec 15, 2020, 10:40:56 AM | WARNING | 2020-12-15 09:40:56.133076: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
Dec 15, 2020, 10:40:56 AM | WARNING | 2020-12-15 09:40:56.133022: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1b91640 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
Dec 15, 2020, 10:40:56 AM | WARNING | 2020-12-15 09:40:56.131837: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2199995000 Hz
Dec 15, 2020, 10:40:56 AM | WARNING | To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Dec 15, 2020, 10:40:56 AM | WARNING | 2020-12-15 09:40:56.125112: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 FMA
Dec 15, 2020, 10:40:56 AM | WARNING | 2020-12-15 09:40:56.124271: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (0071d832075f): /proc/driver/nvidia/version does not exist
Dec 15, 2020, 10:40:56 AM | WARNING | 2020-12-15 09:40:56.123595: E tensorflow/stream_executor/cuda/cuda_driver.cc:314] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Dec 15, 2020, 10:40:56 AM | WARNING | 2020-12-15 09:40:56.109400: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
Dec 15, 2020, 10:40:53 AM | WARNING | 2020-12-15 09:40:53.250994: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Dec 15, 2020, 10:37:53 AM | WARNING | WARNING:root:kernel b6193863-8d44-476f-b8cc-eadbe7129967 restarted
Dec 15, 2020, 10:37:53 AM | INFO | KernelRestarter: restarting kernel (1/5), keep random ports
Dec 15, 2020, 10:36:24 AM | WARNING | 2020-12-15 09:36:24.601416: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
Dec 15, 2020, 10:36:24 AM | WARNING | 2020-12-15 09:36:24.601370: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x20c3640 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
Dec 15, 2020, 10:36:24 AM | WARNING | 2020-12-15 09:36:24.600345: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2199995000 Hz
Dec 15, 2020, 10:36:24 AM | WARNING | To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Dec 15, 2020, 10:36:24 AM | WARNING | 2020-12-15 09:36:24.593357: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 FMA
Dec 15, 2020, 10:36:24 AM | WARNING | 2020-12-15 09:36:24.592695: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (0071d832075f): /proc/driver/nvidia/version does not exist
Dec 15, 2020, 10:36:24 AM | WARNING | 2020-12-15 09:36:24.592632: E tensorflow/stream_executor/cuda/cuda_driver.cc:314] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Dec 15, 2020, 10:36:24 AM | WARNING | 2020-12-15 09:36:24.531111: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
Dec 15, 2020, 10:36:20 AM | WARNING | 2020-12-15 09:36:20.926549: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Dec 15, 2020, 10:36:01 AM | INFO | Adapting to protocol v5.1 for kernel b6193863-8d44-476f-b8cc-eadbe7129967
Dec 15, 2020, 10:33:42 AM | INFO | Adapting to protocol v5.1 for kernel b6193863-8d44-476f-b8cc-eadbe7129967
Dec 15, 2020, 10:33:41 AM | INFO | Kernel started: b6193863-8d44-476f-b8cc-eadbe7129967
Dec 15, 2020, 10:33:13 AM | INFO | Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
Dec 15, 2020, 10:33:13 AM | INFO | http://172.28.0.2:9000/
Dec 15, 2020, 10:33:13 AM | INFO | The Jupyter Notebook is running at:
Dec 15, 2020, 10:33:13 AM | INFO | 0 active kernels
Dec 15, 2020, 10:33:13 AM | INFO | Serving notebooks from local directory: /
Dec 15, 2020, 10:33:13 AM | INFO | google.colab serverextension initialized.
Dec 15, 2020, 10:33:13 AM | INFO | Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
Dec 15, 2020, 10:33:13 AM | WARNING | Config option `delete_to_trash` not recognized by `ColabFileContentsManager`.

rafalpotempa avatar Dec 15 '20 11:12 rafalpotempa

There is really a lot going on in the code. Do you have any ideas where I could place my breakpoints and focus? Is there any easier way to trace the source of this bug?

rafalpotempa avatar Dec 15 '20 11:12 rafalpotempa

I've just sent a request (using my @google.com email). Will be able to look more closely into things once you can share the notebook with me. I will be sure to not share any details of the code here and just focus on the bug itself.

MichaelBroughton avatar Dec 16 '20 22:12 MichaelBroughton

Thanks for the interest!

I've just given you the editor permissions. If you have any questions or concerns fell free to ask.

rafalpotempa avatar Dec 16 '20 22:12 rafalpotempa

No problem. So at first glance I think you've solved your own problem in your comment on the side there.

The compressed_image_size is too big with a value of 8. Quick review on quantum circuit simulation:

Simulating n qubits takes 2^n memory. So looking at your code:

compressed_image_size=8 => compressed_image_shape = (8,8)

Then in the line: qubits = cirq.GridQubit.rect(*compressed_image_shape) => len(qubits) == 64

Mathing that out really quick gives us a state vector with 2^64 complex amplitudes where one amplitude is 64 bits means you requested 147 Exabytes of RAM. A bit too much :). In general simulations cap out around 30 qubits unless you've got some serious hardware and you might be able to push things up to 35-40.

My guess is that the malloc call didn't fail gracefully on that size which is a bug we should probably look into. Does this help clear things up ?

MichaelBroughton avatar Dec 16 '20 22:12 MichaelBroughton

Yeah. This totally explains the behavior. This was the first thing that came to my mind, but I couldn't find any errors related to hardware, so I assumed everything was correct.

Nevertheless some error message would be really helpful here. It shouldn't pass silently :)

Thanks!

rafalpotempa avatar Dec 17 '20 07:12 rafalpotempa

I wanted to contribute and add the error handling, but I got lost in the codebase... 🤯

Anyway... I finished and published the thesis. It even got highlighted by IEEE and there is a followup paper presented on CORES'21 going public soon.

https://www.researchgate.net/publication/353074126_Simulation_of_quantum_neural_network_with_evaluation_of_its_performance

I hope you enjoy it. In case of questions or anything, please contact my by the GitHub email :)

rafalpotempa avatar Jul 22 '21 17:07 rafalpotempa

That's awesome! Always happy to see more publications making use of TFQ!

MichaelBroughton avatar Jul 22 '21 18:07 MichaelBroughton

Any updates on this issue @rafalpotempa or can it be closed?

lockwo avatar Aug 24 '22 18:08 lockwo