pomegranate
pomegranate copied to clipboard
Training a HMM from samples (supervised training)
Hi ! First of all, thank you for developing this library! I want to train a HMM from given samples of observations and corresponding labels. The labels are the hidden states which the hmm will have.
I have a dataset which looks like this (just a sample here):
timestamp sensor1 sensor2 sensor3 sensor4 action
1 0.05 0.04 0.10 0.39 A1
2 0.25 0.14 0.11 0.34 A2
3 0.15 0.34 0.13 0.36 A3
.......
So, as seen above, I have 4 sensor values for each timestamp and in the annotated dataset I also have the action (A1-A4). That basically means that for my supervised problem each of my observations is a 4-dimensional feature vector which is annotated with an action label. I saw that in pomegranate, we can create the model from given samples. I tried running a supervised training procedure, but for some reason I am getting an error ( ValueError: zero-dimensional arrays cannot be concatenated).
# import some libraries
import numpy as np
import pomegranate as pg
# each observation consists of the values of four different sensors: A1, A2, A3, A4
# we have three different states S1, S2, S3
# an observation sequence is given below - each list element is a vector where each dimention corresponds to sensor A1 - A4 respectively
obs_seq = np.array([[0.4, 0.32, 0.56, 0.7],[0.4, 0.82, 0.96, 0.47],[0.43, 0.12, 0.56, 0.27],[0.4, 0.9, 0.46, 0.1],[0.2, 0.32, 0.36, 0.1],[0.14, 0.267, 0.68, 0.57], [0.34, 0.762, 0.76, 0.73], [0.4, 0.22, 0.56, 0.47], [0.43, 0.12, 0.56, 0.27], [0.24, 0.19, 0.84, 0.1], [0.22, 0.32, 0.61, 0.7], [0.94, 0.234, 0.83, 0.77],
[0.34, 0.52, 0.89, 0.4],[0.9, 0.72, 0.56, 0.17],[0.43, 0.12, 0.56, 0.27], [0.64, 0.69, 0.48, 0.1],[0.25, 0.362, 0.16, 0.6],[0.34, 0.214, 0.18, 0.67],
[0.64, 0.72, 0.77, 0.1],[0.3, 0.62, 0.76, 0.37],[0.43, 0.12, 0.56, 0.27],[0.74, 0.52, 0.96, 0.1],[0.22, 0.342, 0.46, 0.5],[0.54, 0.63, 0.67, 0.27],
[0.14, 0.38, 0.26, 0.5],[0.5, 0.52, 0.12, 0.657],[0.43, 0.12, 0.56, 0.27],[0.33, 0.26, 0.93, 0.1],[0.432, 0.32, 0.66, 0.3],[0.74, 0.07, 0.43, 0.47],
[0.24, 0.22, 0.36, 0.6],[0.67, 0.32, 0.16, 0.26],[0.43, 0.12, 0.56, 0.27],[0.67, 0.22, 0.90, 0.1],[0.22, 0.314, 0.42, 0.2],[0.84, 0.17, 0.13, 0.67]])
obs_states = ["A1", "A3", "A1", "A1", "A1", "A3",
"A3", "A2", "A2", "A1", "A3", "A1",
"A3", "A3", "A1", "A1", "A1", "A3",
"A2", "A2", "A1", "A1", "A3", "A1",
"A2", "A3", "A1", "A1", "A1", "A3",
"A2", "A2", "A1", "A1", "A3", "A2",
]
states_names = ["A1", "A2", "A3"]`
#building the markov model from the samples
model = pg.HiddenMarkovModel.from_samples(pg.NormalDistribution,
n_components = 3,
state_names = states_names,
X = obs_seq,
labels= obs_states,
algorithm='labeled')
obs_seq represents one long sequence where each vector has 4 dimensions representing the values measured by the sensors. in the obs_states variable I have the corresponding labels to each of the 4-dimensional vectors in obs_seq.
I also created a google colab notebook so that you can run the example by yourself.
https://colab.research.google.com/drive/10ZwBef9SsF5I5i3SnPr4dyXwr8z8jdm1?usp=sharing
Thank you for your help!
I work on the same type of subject, I also encounter this problem. Did you find a solution?
Sorry you encountered issues. Multivariable data needs to have three dimensions either as a fixed-dimension array with dimensions (n_samples, n_observations, n_dimensions) or as a list of 2D arrays where each array is (n_observations, n_dimensions). Even if you only have a single example you need to either have the data in a list with a single element or as a numpy array with the first dimension being 1. Same goes for the labels.
Here is the code I got to run:
obs_seq = np.array([[[0.4, 0.32, 0.56, 0.7],[0.4, 0.82, 0.96, 0.47],[0.43, 0.12, 0.56, 0.27],[0.4, 0.9, 0.46, 0.1],[0.2, 0.32, 0.36, 0.1],[0.14, 0.267, 0.68, 0.57], [0.34, 0.762, 0.76, 0.73], [0.4, 0.22, 0.56, 0.47], [0.43, 0.12, 0.56, 0.27], [0.24, 0.19, 0.84, 0.1], [0.22, 0.32, 0.61, 0.7], [0.94, 0.234, 0.83, 0.77],
[0.34, 0.52, 0.89, 0.4],[0.9, 0.72, 0.56, 0.17],[0.43, 0.12, 0.56, 0.27], [0.64, 0.69, 0.48, 0.1],[0.25, 0.362, 0.16, 0.6],[0.34, 0.214, 0.18, 0.67],
[0.64, 0.72, 0.77, 0.1],[0.3, 0.62, 0.76, 0.37],[0.43, 0.12, 0.56, 0.27],[0.74, 0.52, 0.96, 0.1],[0.22, 0.342, 0.46, 0.5],[0.54, 0.63, 0.67, 0.27],
[0.14, 0.38, 0.26, 0.5],[0.5, 0.52, 0.12, 0.657],[0.43, 0.12, 0.56, 0.27],[0.33, 0.26, 0.93, 0.1],[0.432, 0.32, 0.66, 0.3],[0.74, 0.07, 0.43, 0.47],
[0.24, 0.22, 0.36, 0.6],[0.67, 0.32, 0.16, 0.26],[0.43, 0.12, 0.56, 0.27],[0.67, 0.22, 0.90, 0.1],[0.22, 0.314, 0.42, 0.2],[0.84, 0.17, 0.13, 0.67]]])
obs_states = np.array([["A1", "A3", "A1", "A1", "A1", "A3",
"A3", "A2", "A2", "A1", "A3", "A1",
"A3", "A3", "A1", "A1", "A1", "A3",
"A2", "A2", "A1", "A1", "A3", "A1",
"A2", "A3", "A1", "A1", "A1", "A3",
"A2", "A2", "A1", "A1", "A3", "A2",
]])
n_samples
what's the difference between n_samples and n_observations and n_dimensions ?
HMMs can be trained on one sequence or on multiple sequences. n_samples
is the number of sequences and n_observations
is the number of elements in the sequence. n_dimensions
is the number of dimensions these elements have.
@jmschrei , thank you for mentioning also the training on multiple sequences (I have a scenario, where the same experiment is performed by different people and I thus the sensor measurements and the sequence of actions might be different).
I have 2 other questions now - In my example each observation consists of 4 sensor values. Can I somehow set the distribution and the value range for each of the values (in my case all the values of the 4-dimensional vector are in a range 0 - 1)so that they can be considered by the HMM ? My second question is - how to do the prediction of a new sensor observation sequence : the way I tested it was calling model.predict([[0.4, 0.32, 0.56, 0.7],[0.4, 0.82, 0.96, 0.47],[0.43, 0.12, 0.56, 0.27]])
and I think this is the right way, since the output was [1, 0, 0]
(btw, how to see the actual labels and what is the ordering? - my labels are "A1", "A2", "A3", not 0,1,2).
Thank you for opening an issue. pomegranate has recently been rewritten from the ground up to use PyTorch instead of Cython (v1.0.0), and so all issues are being closed as they are likely out of date. Please re-open or start a new issue if a related issue is still present in the new codebase.