I know what you did last session: clustering users with ML
A (musical) journey into our first Data Science project
In the process of starting my career as a Data Scientist at Stoodi — an ed-tech company based in São Paulo — I understood something really important as I was listening to that Prince song.
Data Science doesn’t have to be cool to rule my world.
Even though we love to use the latest model architectures and frameworks, Data Science projects don’t need to use the latest technology to deliver value.
Some of us are lucky enough to be working at companies where Data Science is already established as a core competence of business and where there are huge teams working on multiple problems. The reality is that most of us are working in scenarios with small teams, scarce resources, but Homeric expectations of what can be done.
In this kind of context, projects that can deliver fast results and inform decisions may be of great value: the low hanging fruits.
In our company, one of these clear low hanging fruits was to understand our users behaviour right after sign up. This article will talk about our process, our main learnings and how you can go about doing it yourself.
As said on that James Brown’s song:
This is a platform’s world.
This is a platform’ s, platform’s world.
The internet is ruled by platforms. Stoodi is one of those. We assist students all over the country to prepare for ENEM (the Brazilian cousin of the SAT’s) with videos, exercises, essay corrections and a personalized study plan.
Even though we design our systems to help users see value in our products and use features in a certain way, there’s a gap between theory and reality. Truth is there are endless ways a user can engage with our product.
Given that we can’t analyze users one by one, we can harness the power of clustering algorithms to find patterns and therefore understand their behaviour in a more concise way.
By now, you’re probably convinced that understanding your users behaviour is a great idea and you’re eager to try it for yourself. The next logical step in the process is to choose which data to analyze.
To help you understand that, I can only quote one of the best pop songs in history:
Tell me what you want. What you really, really, want.
As in any Data Science project, it all comes down to the data you have available. If your data is consistent and correct — and that’s a big “if” — you can start analyzing what should and shouldn’t be included in your model.
When you start a project around your users behaviour, it’s paramount that you have a clear objective. Given that there are an infinite set of actions your users can take, what are the ones that you want to understand in a deeper level? What are you trying to find out?
In our case, we didn’t understand how our users were engaging right after they signed up, and if they were doing the basic activities our platform supports. Even though we could analyze tracking events, their order and duration, we chose to simplify the analysis to go faster. Understanding what they were actually doing could help us create better user flows and design a process in which we could show them value early on.
One of the things we understood was that our project should be aimed for simplicity. We decided that the activities we would include needed to be open to all users, subscribers or not.
One other thing we prioritized was using data that could produce results understandable by everyone in the company. In this case, the amount of exercises submitted, videos started, videos finished and studyplans configured during the analyzed time — and not some crazy ratios or super complex made up metrics.
In our search for simplicity, we decided that would be interesting to use PCA — a dimension reduction technique — to check if there was unnecessary features and help us get rid of redundancy. In this step, we discovered that 2 features could explain more 80% of our data variance, so we narrowed it down to those 2.
After all the data was ready I was feeling just like Carly Rae Jepsen:
Hey, I just met you and this is crazy, but here’s my data
Run an algorithm on it maybe?
When analyzing our options, we opted for two different approaches.
The first one, called Density-based spatial clustering of applications with noise — or DBSCAN — sounded great. Why?
DBSCAN needs only two parameters:
- How close points should be to each other to be considered part of the same group (Epsilon)
- How many points are necessary to create a group. (minPoints)
There are some tips to define the values of the parameters, but in our case, we decided to test a range for all of them hoping to find beautiful clusters. If only we knew Coldplay’s advice…
Nobody said clustering was easy.
No one ever said it would be this hard.
This is how our cluster looked like:
Looking at this, one can understand the main pitfalls and advantages of DBSCAN.
DBSCAN is great to cluster data with outliers. The problem is when the reality of your data make it looks like “doing something” is the outlier behaviour.
There was a huge amount of users who didn’t do any of the basic activities we were considering, and to the algorithm, the main separation among points was basically “active”, “inactive” and outlier behaviour.
Another thing to consider is that DBSCAN has a hard time clustering data that has different large density differences. Think about it: there’s only one way of not doing anything, but there are multiple ways of engaging in activities, which creates different density distributions of points in the n-dimensional space analyzed.
After our initial delusion with DBSCAN, we decided to go the classical way: K-Means. For it to run, it’s necessary to define how many groups you’ll want to find. Given that we had no idea, we decided to test values ranging from 2 to 20 clusters and see what happened.
In K-Means, every point is classified. Given that the center of each cluster is calculated using the mean of all points in it, the existence of outliers creates obvious problems.
The solution we found was to filter users with unusual behaviour — which in our case consisted of values achieved only by the most 1% active users — to avoid getting clusters not close to reality.
After running the algorithm, there seemed to be promising results. But just as Whitney Houston once, I was asking myself:
How will I know? Don’t trust your feelings.
Clustering can be deceiving.
In an unsupervised algorithm there are no labels with the right value. Analyzing if your results are good or no depends on what that means.
For that purpose, there are two main metrics to understand the relationship among the clusters found.
Inertia gives you an understanding of how far are the points within a cluster? You probably wants to points in the same cluster to be close right? In that sense, there’s a balance to achieve between choosing too few clusters that have points that are too far apart and choosing too many clusters that are too specific and difficult to understand.
In this sense, we usually aim for the elbow of the inertia curve. In our case, that meant choosing somewhere between 4 or 6 clusters.
Another interesting metric is the Silhouette Index. This metric helps us understand how far points from one cluster are from the other clusters.
In this case, bigger values are better, indicating a conciseness of the clusters found. To our data, the biggest value we found was to 5 clusters — which was in line with the best results for inertia.
Another step we took was to analyze the results for different cohorts, to understand if the results looked similar. After checking consistency, there seemed to be, in fact, good results!
Once the metrics from the model made some sense, I may have pulled the P!nk too early:
So, so what?
I’m a data star, I got my data moves
And I don’t need you
After all, I was done — or so I thought. After running the algorithm, this is what success looked like.
Even though these values give us a sense of what was going on, it’s impossible to determine if these results can generate actual knowledge without context. Only someone with the domain expertise can tell if the results make any kind of sense.
After getting a table like this and summarizing the results using box-plots and checking the data distribution, we understood something of actual value.
There seemed to be five types of users in the first day after sign-up on our platform.
Imagine me, a Data Scientist new to the field, working in his first project and getting awesome results. That day I was feeling just like Britney Spears:
I wanna scream
and shout
and let it all out
More often than not, results of Data Science projects stay confined to Notebooks on github repositories, to which only few people have access to. When we forget this important step of translating project results to accessible knowledge, the entire business loses.
So we did a presentation building personas around the discoveries, gave them names, colors and a family to help everyone understand what was happening, who our users are and of course, think together of how we can apply — each one in its context — these discoveries to make a better product.
After this presentation, we saw the impact even of the constructed vocabulary when talking about users on a strategically level.
Seeing this kind of impact is always a huge win to Data Science teams and I strongly recommend your team going the extra mile to help people understand what you are doing. These helps everyone understand the importance of such initiatives and builds leeway to future projects.
I hope you have learned something from our experience. Feel free to contact me for feedback, doubts and suggestions.
Thank U, (and see you in the) Next!
Obviously, this post would have a playlist. Tune in to listen to all the amazing songs in this journey to clustering our users!
This is the summary version of my PAPIs LATAM 2019 presentation. When the video is available I’ll post it here. Thank you!