What Landscaping Project Should Be the First on Your List Come Spring?

The long winter is coming to an end and you can hardly wait to get your hands dirty in the yard work. You love being outdoors, with the sun shining, the birds singing, and growing things. For some…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




HMM for Multidimensional Categorical Data

Multidimensional Scotland (taken by me)

Are you dealing with sequential data? Do you like classical machine learning? Did you hear about Hidden Markov Model (HMM) but didn’t get a chance to try it on your data? Are you having trouble finding a decent HMM package in Python? If you answered ‘yes’ to at least 1 of the questions above, I invite you to read this article.

I will first explain a bit about HMM model and then present a great Python package with code examples.

HMM is a model that allows you to find the most probable sequence of states, given the data you have (if it is not clear, follow the example). The model is widely used in various domains: sound processing, language models, genetics and many more.

In the following example I will show how HMM was used to predict user behaviour — segments of walking or driving activities. The prediction is based on data from cellular sensors, such as steps counter, GPS locations, WiFi connection etc.

To use a HMM we first need to formalise a statistical model of the world, and all we know about the world will be summarised in 2 small matrices: Transition Matrix and Emission Matrix. Here is a simplified example of how to build these matrices and use them for prediction of user behaviour from cellular sensors:

Let’s assume a user can be in one of 3 states - Driving, Walking or In Place (Place). And let’s assume we have labeled data, where users reported minute by minute what state they were in during the day. For example:

A transition probability between state A and state B is the probability that a person, being in state A, will change to state B. We can find the transition probabilities from our labeled data. For example, the transition probability between place and walk will be the number of place minutes followed by walk divided by the total number of place minutes in the data.

We can describe the transition probabilities in one of two ways — a diagram or a Transition Matrix:

We will start the chain with one of the events, let’s choose place. Then we will ‘roll a place dice’, meaning we have 90% chance of getting place again, 7.5% chance of getting drive and 2.5% chance of getting walk. Because of the probabilities, we will probably get place again, and we keep rolling the ‘place dice’ until we get to another state, let’s say drive. Then we put the ‘place dice’ aside and use the ‘drive dice’: 80% to continue driving, 15% the get place and 5% to get to walk.

At the end of the day, when we rolled the different dices 1440 times (number of minutes per day) we will get a chain of events:

This chain is not yet linked to any information we have from our features (remember we have data from cellular sensors) but it already stores very basic insights about the nature of our data, like the typical duration of each state and the typical sequence of states.

In the world of our problem, the states are hidden and we need to find them based on the observed features, which are the cellphone sensors output (movement, step counter, connection to wifi, etc.).

To include the features in our model we will build the Emission Matrix.

Here is an example of a probability matrix for binary features (features that can only get True/False values) :

The way to build the matrix is also very straight forward — for example, the probability to see wifi connection in place state is the number of minutes with wifi connection while a user was in place state divided by all place state minutes.

Now that we have both transition and emission matrices we can move on to the practical python part!

Our data here is a dummy dataset that I created, based on real data. There are 98 minutes and for every minute we have a step counter. I created 4 bins of step counts [0,50,100,150] steps per minute. The goal in this example is to classify the minutes into 2 states: walk and still.

Setting the distribution for the steps channel for each state:

Here, I also added start probabilities, which are the probabilities to start the chain from each state.

Now we can add some complexity and add one more channel to our data (and add as many as we wish later on, using the same code).

In addition to the step counter, we will add the distance (in meters) of the user’s location between the previous minute and the current one. Let’s break the distances into 4 bins: [0,20,50,75].

Our Data will look like this now:

Here we create separate distributions for each feature and each state and then combine the features using IndependentComponentsDistribution.

Exactly the same…

The prediction takes into account data from both features: steps and distance.

Note that we don’t have to bin our data — the package allows us to use continuous distributions like Gaussian distribution. We can even combine discrete distributions for some features and continuous for other features.

I hope you’ve found this post clear and useful and I’d love to hear from your if you have any questions!

I would like to thank

Add a comment

Related posts:

Why do I write?

I started writing poems when I was a child. I lived in a small town where like-minded people were nowhere to find. Growing up, I found myself struggling with thoughts, emotions and feelings which I…

The Mess

The words fell out of my mouth Like shit plumetting from twenty stories Hitting with a sharp snapping splat In the middle of the stunning white floor Of the womb of our friendship It tumbled from my…

Protecting the collection of spans

The Jaeger Collector is the component responsible for receiving the spans that were captured by the tracer and writing them to a persistent storage like Cassandra or Elasticsearch. This component is…