During the last weeks, we have been talking with prospective users and gained both encouraging and valuable feedback - thank you all! In addition to learning about the features our users would appreciate (some of which we are already implementing, more on this in the next product update!) we have received a lot of interest towards how we model churn. In this post I will address this topic from a practical point of view, i.e. no previous knowledge of data science is required. Let's get to it!
Note about the terminology in this post: from here on, a customer refers to our customer, i.e. a Kirnu user, while a user refers to our customer’s, a SaaS company’s, user to avoid confusion.
The single most important aspect in modelling is to define what it is that we are trying to figure out. In our case we are interested in churn, which has a binary state: either a user has churned or not. Observing the current state is easy: simply consult your local account manager or CRM to see how an account is doing.
Often however, we are more interested in knowing about the future: will the customer remain a paying customer after the end of the current contract renewal period? In answering this, many organizations rely on the accumulated expertise and assessment of the account / customer success manager. This is completely fine as the sales / customer success personnel have in depth knowledge of their clients, but once you need a more robust, data-driven approach to estimating future churn, Kirnu is your best friend.
After having figured out what we are trying to achieve (estimating the probability of churn after the current contract period), we need to find the data, from which our model will learn. In the context of machine learning, the training data comprises the features and the label for each observation1. For our training data, one observation is produced for each user every day. So for example, if our customer (a SaaS company) has 500 users, Kirnu will generate 15,000 observations over a period of 30 days.
High-level overview of our model. Click here to see a higher resolution image (opens in new tab).
While collecting the (event) data is a relatively straight-forward technical task, preprocessing the data and feature extraction / engineering is where the magic happens. Our preprocessing algorithms take the raw data and extract meaning out of it to prepare the features (inputs) to our model. As data scientists, it is our job to figure out what this meaning should be and instruct the preprocessing algorithm to mold the raw data into a representation of this.
For example, if we considered a user's activity level as an important factor in determining churn probability, we could use the number of events for the specific user as proxy and use that as an input for our model. In this kind of an approach, we would lose a lot of potentially important information, such as the types of events / actions the user has performed. Preprocessing the data in such a manner would consider a user who has clicked the same button 1,000 times over and over again comparable to a user that generated 1,000 events using the platform's actual features as intended. To enhance the quality of information available to the model, perhaps we could also throw at it the distribution of events a user has performed, and so on. With this kind of an incremental approach, we build more meaning into the data given to the model to learn from.
In preprocessing the data, we have to balance delicately between various aspects, including complexity and granularity. We are striving to find a metric or a combination of metrics that separates users' churn probability from one another, while maintaining predictability. At one extreme we could input the algorithm with a user's latest 1,000 events in a sequential order. This would be a sure way to differentiate users from one another: it is very unlikely that two users would have the same exact 1,000 events. However, such a micro level approach would most likely make inference challenging. At the other extreme could be the previously mentioned example of counting the user's events within a given time period: many users could have the same number of events, or at least be very close in terms of their number, making them hard to distinguish from one another. And like the example from the other extreme, such an approach would likely also suffer from weak statistical inference.
We are currently experimenting with two very different alternatives to extracting features from (preprocessing) the event data. Although we are very excited about our two approaches and hence would be thrilled to discuss them in more detail, we have decided to not do so publicly, as they constitute a trade secret. We do shed a little bit more light on them with our customers as we experiment, as they often are able to provide valuable insights. What we are certain of however, is that these alternatives provide very robust approaches to churn modelling.
Through our API, our users are able to provide:
As mentioned previously, one observation is generated for each user every day. Given our definition, churn is only observable after the fact (ex-post). Therefore observations for a given user can only be generated after their state of churn, and the label for the observation is known.
Now that we have the data set up, let’s briefly discuss our model architecture.
Our model is based on a neural network architecture. The benefit from going with a neural network as opposed to other modelling techniques, such as a (linear) regression model, is that neural networks are better able to capture non-linear relationships between features. When modelling a complex phenomenon such as churn, where the relationship between input features is not known in advance, a neural network is more apt to capture the hidden, non-linear relationships between features. Sure, we could manually add the non-linear relationships to a regression model, but the amount of possible non-linear relations is endless, and would introduce other challenges to our model.
Illustration of our neural network. Click here to see a higher resolution image (opens in new tab).
The main practical drawback from using a neural network is their loss of interpretability in the sense that we are not able to describe the relation between individual features and the model’s output. This is a direct consequence of the complexity of neural networks, the same characteristic which makes them so powerful. Recent advances in data science have introduced some interpretability to certain types of neural networks. For example, a recent paper introduced a technique called concept whitening to build meaning into convolutional neural networks used in image recognition. This however is not suitable for our purpose of churn detection.
A careful reader might have also noted that while we mention that Kirnu estimates the probability of churn the labels in our input data are binary (i.e. churn or no churn), without any probabilities. The probability is a product of the neural network as well: by applying a logistic activation at the last layer of our network, the model is instructed to output a probability instead of a binary (churn or no churn) value.
Now that the training data and the model have been introduced, it is time for our model to learn. The model learns by being exposed to as many observations as possible. Having enough data is crucial, and the accuracy of the model increases as more observations accumulate. How much is enough is impossible to pinpoint, and varies case by case. As Kirnu generates an observation for each user each day, the more users our customer has, the faster the critical threshold for sufficient data is reached. In the meantime, our customers will benefit from our other features, such as health metrics and trigger-based notifications.
In action, we will input a user’s input data to the model without the label (as it is not known at the time) to get an output, i.e. a prediction. Kirnu will generate new churn probability estimates daily, to match the frequency of input data. Kirnu is a prime example of circular economy: once the data has been used for making a prediction, it is recycled into the training data. Once the model is retrained on the new training data, the cycle repeats.
This concludes our introduction to modelling churn. I hope this overview has taught you something new or given new insights! While this introduction summarizes our algorithm as of April 2021, we are constantly making minor improvements to increase its robustness. Who knows what the future holds?
While Kirnu is primarily powered by the event data it collects on its own, we are currently working towards making other event data collected by our customers compatible with Kirnu. This would speed up the learning process for our algorithm.
If you are interested in how Kirnu could help your SaaS business, please get in touch by contacting me at email@example.com. We are currently welcoming early adopters to our private beta, which we expect to be ready by the end of May 2021.
1 In the context of regression models, features are referred to as the independent variables or explanatory variables and labels as dependent or response variables
Enter your email below to hear from us (maximum once a week) and to secure your place among early users! 🚀
P.S. we expect to be beta test ready during Spring 2021 and are looking for test users. Drop us a line at firstname.lastname@example.org if you are interested in reducing churn in your SaaS application and getting exclusive access to Kirnu.