pyTsetlinMachine icon indicating copy to clipboard operation
pyTsetlinMachine copied to clipboard

Pickling state / pickling whole class issue

Open devop01 opened this issue 3 years ago • 10 comments

As far as I understand there are problems with saving state of learned TM. This is important for research development, as checkpoint can be created in a easy way.

I tried to pickle MultiClassTsetlinMachine (from pyTsetlinMachineParallel.tm ) as well as pyCUDA version and both failed.

Then I tried to pickle just TM state (from MNIST example): eg: tm = MultiClassTsetlinMachine(2000, 50, 10.0) f = open.. pickle.dump(tm.get_state(),f)

#new session f = open... state = pickle.load(f)

tm2 = MultiClassTsetlinMachine(2000, 50, 10.0) tm2.set_state(state) Result: Traceback (most recent call last): File "<pyshell#15>", line 1, in tm.set_state(state) File "/.../TsetlinMachine/pyTsetlinMachineParallel/tm.py", line 314, in set_state for i in range(self.number_of_classes): AttributeError: 'MultiClassTsetlinMachine' object has no attribute 'number_of_classes'

if I update number of classes: tm2.number_of_classes = 10 tm2.set_state(state) program hangs, and after some time I got:

tm2.set_state(state)

=============================== RESTART: Shell ===============================

devop01 avatar Nov 20 '20 17:11 devop01

Hello, the question sounds very interesting and is also very relevant for me.

I would also like to do the calculation, the fit separately from the predict method, among other things also because the calculation takes some time. Based on the previous post, I have adapted an example to show my approach. I would be grateful for any hints on how to do it correctly.

from pyTsetlinMachine.tm import MultiClassTsetlinMachine
from pyTsetlinMachine.tools import Binarizer
import numpy as np

from sklearn import datasets
from sklearn.model_selection import train_test_split

import pickle
import pickletools 

breast_cancer = datasets.load_breast_cancer()
X = breast_cancer.data
Y = breast_cancer.target

b = Binarizer(max_bits_per_feature = 10)
b.fit(X)
X_transformed = b.transform(X)

tm = MultiClassTsetlinMachine(800, 40, 5.0)

print("\nMean accuracy over 10 runs:\n")
tm_results = np.empty(0)
for i in range(5):
 	X_train, X_test, Y_train, Y_test = train_test_split(X_transformed, Y, test_size=0.2)

 	tm.fit(X_train, Y_train, epochs=25)
 	with open('dat_' + str(i) + '.pickle', 'wb') as handle:
 		p =	pickle.dump(tm.get_state(), handle, protocol=pickle.HIGHEST_PROTOCOL)
		#pickletools.dis(p)


 	tm_results = np.append(tm_results, np.array(100*(tm.predict(X_test) == Y_test).mean()))
 	print("#%d Average Accuracy: %.2f%% +/- %.2f" % (i+1, tm_results.mean(), 1.96*tm_results.std()/np.sqrt(i+1)))


for i in range(5):
	with open('dat_' + str(i) + '.pickle','rb') as pickle_file:
		state= pickle.load(pickle_file)

	tm2 = MultiClassTsetlinMachine(800, 40, 5.0)             
	tm2.set_state(state)

	tm2_results = np.append(tm_results, np.array(100*(tm2.predict(X_test) == Y_test).mean()))
	print("#%d Average Accuracy: %.2f%% +/- %.2f" % (i+1, tm2_results.mean(), 1.96*tm2_results.std()/np.sqrt(i+1)))

My error message:

Mean accuracy over 10 runs:

 File "/mnt/c/ProjectsGit/tm-segment/BreastCancerDemo_pkl.py", line 42, in <module>
    tm2.set_state(state)
  File "/home/unix/miniconda3/lib/python3.8/site-packages/pyTsetlinMachine/tm.py", line 351, in set_state
    for i in range(self.number_of_classes):
AttributeError: 'MultiClassTsetlinMachine' object has no attribute 'number_of_classes'

andife avatar Feb 07 '21 10:02 andife

Just to link things together, that last post is now asked over at StackOverflow: https://stackoverflow.com/q/66099424/10495893

bmreiniger avatar Feb 08 '21 17:02 bmreiniger

Hi @bmreiniger and @andife! Thanks for bringing up this issue.

The get- and set state methods only operate on the state of the Tsetlin Automata from the C-part of the code. It is the fit-function on the Python side that sets the other crucial parameters and creates the actual Tsetlin Machine structure in C, if self.mc_tm == None:

if self.mc_tm == None:
			self.number_of_classes = int(np.max(Y) + 1)

			if self.append_negated:
				self.number_of_features = X.shape[1]*2
			else:
				self.number_of_features = X.shape[1]

			self.number_of_patches = 1
			self.number_of_ta_chunks = int((self.number_of_features-1)/32 + 1)
			self.mc_tm = _lib.CreateMultiClassTsetlinMachine(self.number_of_classes, self.number_of_clauses, self.number_of_features, 1, self.number_of_ta_chunks, self.number_of_state_bits, self.T, self.s, self.s_range, self.boost_true_positive_feedback, self.weighted_clauses)

As you see from the above code, the number of features and classes is obtained automatically from the input data X and Y. So, one possible trick is to call fit on the training data first, with epochs=0. Then fit will set the number of classes and features and create the TM in C, without running any epochs over the data. After calling fit, you can call set_state to initialize the Tsetlin Automata from the saved state.

I am planning to add methods for this. Let me know if it works out!

olegranmo avatar Feb 08 '21 17:02 olegranmo

Hello @olegranmo, @bmreiniger;

Thanks for the solution for this issue. It works! (I used the code below for testing)

In general, I'm wondering if I really need/should load the original dataset, or if it's not a synthetic one that just contains the same number of features and classes. This would also correspond to the approach of https://github.com/cair/pyTsetlinMachine/pull/4/files.

from pyTsetlinMachine.tm import MultiClassTsetlinMachine
from pyTsetlinMachine.tools import Binarizer
import numpy as np

from sklearn import datasets
from sklearn.model_selection import train_test_split

import pickle
import pickletools 

breast_cancer = datasets.load_breast_cancer()
X = breast_cancer.data
Y = breast_cancer.target

b = Binarizer(max_bits_per_feature = 10)
b.fit(X)
X_transformed = b.transform(X)

tm = MultiClassTsetlinMachine(800, 40, 5.0)

print("\nMean accuracy over 10 runs:\n")
tm_results = np.empty(0)
tm2_results = np.empty(0)
for i in range(5):
 	X_train, X_test, Y_train, Y_test = train_test_split(X_transformed, Y, test_size=0.2,random_state=i)

 	tm.fit(X_train, Y_train, epochs=25)
 	with open('dat_' + str(i) + '.pickle', 'wb') as handle:
 		p =	pickle.dump(tm.get_state(), handle, protocol=pickle.HIGHEST_PROTOCOL)
		#pickletools.dis(p)


 	tm_results = np.append(tm_results, np.array(100*(tm.predict(X_test) == Y_test).mean()))
 	print("#%d Average Accuracy: %.2f%% +/- %.2f" % (i+1, tm_results.mean(), 1.96*tm_results.std()/np.sqrt(i+1)))

del tm

for i in range(5):
	with open('dat_' + str(i) + '.pickle','rb') as pickle_file:
		state = pickle.load(pickle_file)

	X2_train, X2_test, Y2_train, Y2_test = train_test_split(X_transformed, Y, test_size=0.2,random_state=i)
	#X2_train = X2_train + 100
	tm2 = MultiClassTsetlinMachine(800, 40, 5.0)   
	tm2.fit(X_train, Y_train, epochs=0) # X_train does not change, only for setting the correct dimensions

	tm2.set_state(state)

	tm2_results = np.append(tm2_results, np.array(100*(tm2.predict(X2_test) == Y2_test).mean()))
	print("sec2 #%d Average Accuracy: %.2f%% +/- %.2f" % (i+1, tm2_results.mean(), 1.96*tm2_results.std()/np.sqrt(i+1)))

andife avatar Feb 09 '21 17:02 andife

Great that it works, @andife! Yes, I guess you could for instance just use one example X with the correct number of features and the largest y-value (number of classes - 1).

olegranmo avatar Feb 09 '21 17:02 olegranmo

@olegranmo It seems this approach is not working for CUDA version. For following script (Parallel version):

from pyTsetlinMachineParallel.tm import MultiClassTsetlinMachine
import numpy as np
from time import time

from keras.datasets import mnist
import pickle    

(X_train, Y_train), (X_test, Y_test) = mnist.load_data()

X_train = np.where(X_train.reshape((X_train.shape[0], 28*28)) > 75, 1, 0) 
X_test = np.where(X_test.reshape((X_test.shape[0], 28*28)) > 75, 1, 0) 

tm = MultiClassTsetlinMachine(2000, 50, 10.0)

print("\nAccuracy over 10 epochs:\n")
for i in range(10):
    start_training = time()
    tm.fit(X_train, Y_train, epochs=1, incremental=True)
    stop_training = time()

    start_testing = time()
    result = 100*(tm.predict(X_test) == Y_test).mean()
    stop_testing = time()

    print("#%d Accuracy: %.2f%% Training: %.2fs Testing: %.2fs" % (i+1, result, stop_training-start_training, stop_testing-start_testing))

print("Saving TM state...")
    
with open('tm001.pkl', 'wb') as handle:
    pickle.dump(tm.get_state(), handle, protocol=pickle.HIGHEST_PROTOCOL)    

print("Loading TM state...")
tm2 = MultiClassTsetlinMachine(2000, 50, 10.0)
tm2.fit(X_train, Y_train, epochs=0)

with open('tm001.pkl','rb') as pickle_file:
    state = pickle.load(pickle_file)
tm2.set_state(state)

print("Continue training")
for i in range(5):
    start_training = time()
    tm2.fit(X_train, Y_train, epochs=1, incremental=True)
    stop_training = time()

    start_testing = time()
    result = 100*(tm2.predict(X_test) == Y_test).mean()
    stop_testing = time()

    print("#%d Accuracy: %.2f%% Training: %.2fs Testing: %.2fs" % (i+1, result, stop_training-start_training, stop_testing-start_testing))

results of reloaded TM are the same as at the end of training: Accuracy over 10 epochs:

#1 Accuracy: 94.27% Training: 35.89s Testing: 21.07s #2 Accuracy: 95.53% Training: 26.94s Testing: 21.09s #3 Accuracy: 95.97% Training: 25.63s Testing: 21.27s #4 Accuracy: 96.56% Training: 24.98s Testing: 21.72s #5 Accuracy: 96.72% Training: 24.10s Testing: 21.48s #6 Accuracy: 96.77% Training: 23.22s Testing: 21.47s #7 Accuracy: 96.88% Training: 23.08s Testing: 21.86s #8 Accuracy: 96.87% Training: 22.60s Testing: 21.47s #9 Accuracy: 97.10% Training: 22.38s Testing: 21.59s #10 Accuracy: 97.16% Training: 22.14s Testing: 21.63s Saving TM state... Loading TM state... Continue training #1 Accuracy: 97.18% Training: 21.44s Testing: 21.19s #2 Accuracy: 97.25% Training: 21.31s Testing: 21.22s #3 Accuracy: 97.22% Training: 20.92s Testing: 20.78s #4 Accuracy: 97.10% Training: 21.36s Testing: 21.17s #5 Accuracy: 97.49% Training: 20.65s Testing: 21.22s

but for CUDA version:

from PyTsetlinMachineCUDA.tm import MultiClassTsetlinMachine
import numpy as np
from time import time

from keras.datasets import mnist
import pickle

(X_train, Y_train), (X_test, Y_test) = mnist.load_data()

X_train = np.where(X_train.reshape((X_train.shape[0], 28*28)) > 75, 1, 0) 
X_test = np.where(X_test.reshape((X_test.shape[0], 28*28)) > 75, 1, 0) 

tm = MultiClassTsetlinMachine(2000, 50*16, 10.0, max_weight=16)

print("\nAccuracy over 10 epochs:\n")
for i in range(10):
    start_training = time()
    tm.fit(X_train, Y_train, epochs=1, incremental=True)
    stop_training = time()

    start_testing = time()
    result = 100*(tm.predict(X_test) == Y_test).mean()
    stop_testing = time()

    print("#%d Accuracy: %.2f%% Training: %.2fs Testing: %.2fs" % (i+1, result, stop_training-start_training, stop_testing-start_testing))

print("Saving TM state...")
    
with open('tm002.pkl', 'wb') as handle:
    pickle.dump(tm.get_state(), handle, protocol=pickle.HIGHEST_PROTOCOL)    

print("Loading TM state...")
tm2 = MultiClassTsetlinMachine(2000, 50*16, 10.0, max_weight=16)
tm2.fit(X_train, Y_train, epochs=0)

with open('tm002.pkl','rb') as pickle_file:
    state = pickle.load(pickle_file)
tm2.set_state(state)

print("Continue training")
for i in range(5):
    start_training = time()
    tm2.fit(X_train, Y_train, epochs=1, incremental=True)
    stop_training = time()

    start_testing = time()
    result = 100*(tm2.predict(X_test) == Y_test).mean()
    stop_testing = time()

    print("#%d Accuracy: %.2f%% Training: %.2fs Testing: %.2fs" % (i+1, result, stop_training-start_training, stop_testing-start_testing))

there is no effect after reload, it's like script is starting from zero: Accuracy over 10 epochs:

#1 Accuracy: 92.80% Training: 9.59s Testing: 1.08s #2 Accuracy: 94.21% Training: 8.10s Testing: 1.15s #3 Accuracy: 95.56% Training: 7.85s Testing: 1.07s #4 Accuracy: 96.01% Training: 7.44s Testing: 1.07s #5 Accuracy: 96.38% Training: 7.38s Testing: 1.07s #6 Accuracy: 96.61% Training: 7.63s Testing: 1.12s #7 Accuracy: 96.69% Training: 7.79s Testing: 1.13s #8 Accuracy: 96.94% Training: 7.74s Testing: 1.14s #9 Accuracy: 97.06% Training: 7.50s Testing: 1.14s #10 Accuracy: 97.09% Training: 7.80s Testing: 1.06s Saving TM state... Loading TM state... Continue training #1 Accuracy: 92.97% Training: 9.77s Testing: 1.02s #2 Accuracy: 94.23% Training: 8.06s Testing: 1.15s #3 Accuracy: 95.66% Training: 8.17s Testing: 1.14s #4 Accuracy: 96.08% Training: 7.42s Testing: 1.06s #5 Accuracy: 96.56% Training: 7.88s Testing: 1.06s

devop01 avatar May 10 '21 10:05 devop01

Hi @devop01 - thanks for reporting! I have started adding pickle support, just completed for PyTsetlinMachine.

olegranmo avatar May 29 '21 21:05 olegranmo

Hi again @devop01, just added pickle support for PyTsetlinMachineCUDA!

olegranmo avatar Jun 19 '21 11:06 olegranmo

Hi @olegranmo

Thank you for new version :) I reinstalled it, and results of training after loading a pickle didn't improve unfortunately. I'm not sure if this is issue with my CUDA setup, or something else is missing. Comparison of TM before and after a pickle could improve investigation, so I think overriding eq operator could be helpful.

Accuracy over 25 epochs:

#1 Accuracy: 92.89% Training: 9.78s Testing: 1.06s #2 Accuracy: 94.30% Training: 7.76s Testing: 1.11s #3 Accuracy: 95.64% Training: 7.80s Testing: 1.10s #4 Accuracy: 96.06% Training: 7.67s Testing: 1.10s #5 Accuracy: 96.37% Training: 7.62s Testing: 1.11s #6 Accuracy: 96.67% Training: 7.89s Testing: 1.15s #7 Accuracy: 96.77% Training: 7.79s Testing: 1.10s #8 Accuracy: 96.95% Training: 7.57s Testing: 1.11s #9 Accuracy: 97.06% Training: 7.54s Testing: 1.09s #10 Accuracy: 97.23% Training: 7.50s Testing: 1.09s #11 Accuracy: 97.23% Training: 7.47s Testing: 1.09s #12 Accuracy: 97.32% Training: 7.46s Testing: 1.14s #13 Accuracy: 97.36% Training: 7.70s Testing: 1.14s #14 Accuracy: 97.48% Training: 7.62s Testing: 1.09s #15 Accuracy: 97.54% Training: 7.40s Testing: 1.08s #16 Accuracy: 97.60% Training: 7.40s Testing: 1.09s #17 Accuracy: 97.54% Training: 7.39s Testing: 1.09s #18 Accuracy: 97.53% Training: 7.38s Testing: 1.09s #19 Accuracy: 97.60% Training: 7.38s Testing: 1.09s #20 Accuracy: 97.50% Training: 7.37s Testing: 1.09s #21 Accuracy: 97.63% Training: 7.37s Testing: 1.09s #22 Accuracy: 97.75% Training: 7.37s Testing: 1.09s #23 Accuracy: 97.70% Training: 7.36s Testing: 1.09s #24 Accuracy: 97.67% Training: 7.36s Testing: 1.09s #25 Accuracy: 97.70% Training: 7.37s Testing: 1.09s Saving TM state... Loading TM state... Comparing tm1 and tm2 False Continue training #1 Accuracy: 92.95% Training: 9.86s Testing: 1.05s #2 Accuracy: 94.16% Training: 7.77s Testing: 1.11s #3 Accuracy: 95.64% Training: 7.83s Testing: 1.10s #4 Accuracy: 96.21% Training: 7.70s Testing: 1.10s #5 Accuracy: 96.43% Training: 7.63s Testing: 1.10s

Below is a full code I used:

from PyTsetlinMachineCUDA.tm import MultiClassTsetlinMachine
import numpy as np
from time import time

from keras.datasets import mnist
import pickle

(X_train, Y_train), (X_test, Y_test) = mnist.load_data()

X_train = np.where(X_train.reshape((X_train.shape[0], 28*28)) > 75, 1, 0) 
X_test = np.where(X_test.reshape((X_test.shape[0], 28*28)) > 75, 1, 0) 

tm = MultiClassTsetlinMachine(2000, 50*16, 10.0, max_weight=16)

print("\nAccuracy over 25 epochs:\n")
for i in range(25):
    start_training = time()
    tm.fit(X_train, Y_train, epochs=1, incremental=True)
    stop_training = time()

    start_testing = time()
    result = 100*(tm.predict(X_test) == Y_test).mean()
    stop_testing = time()

    print("#%d Accuracy: %.2f%% Training: %.2fs Testing: %.2fs" % (i+1, result, stop_training-start_training, stop_testing-start_testing))

print("Saving TM state...")
    
with open('tm005.pkl', 'wb') as handle:
    pickle.dump(tm, handle, protocol=pickle.HIGHEST_PROTOCOL)

print("Loading TM state...")

with open('tm005.pkl','rb') as pickle_file:
    tm2 = pickle.load(pickle_file)

print("Comparing tm1 and tm2")
print(tm == tm2)

print("Continue training")
for i in range(5):
    start_training = time()
    tm2.fit(X_train, Y_train, epochs=1, incremental=True)
    stop_training = time()

    start_testing = time()
    result = 100*(tm2.predict(X_test) == Y_test).mean()
    stop_testing = time()

    print("#%d Accuracy: %.2f%% Training: %.2fs Testing: %.2fs" % (i+1, result, stop_training-start_training, stop_testing-start_testing))

devop01 avatar Jun 22 '21 18:06 devop01

Hi @devop01, this happens because the local voting tallies used for asynchronous parallel learning is not stored as part of the state. Everything is reinitialized when you start training again. Will fix this in the next update!

olegranmo avatar Jun 27 '21 09:06 olegranmo