24.05.22
Making useful clusters using SHAP values
Cluster analysis is an approach to finding meaningful subgroups in a population. It is usually an 'unsupervised' method - machine learning jargon for a method that does not require labels in order to work.
The usefulness of cluster analysis can be limited as the results are often ambiguous and dependent on manually inputting parameter settings. Once produced, it can be difficult to establish the quality of the clusters without a method to assess them.
Adian Cooper has written an article about how to use SHapley Additive exPlanations (SHAP) values to do what he calls supervised clustering.
The approach starts by building a prediction model on an appropriate variable of interest. In his example, he uses COVID-19 infection status as the target variable and patient symptoms as the inputs.
Once you have a predictive model you can generate SHAP values for each variable and use those to build clusters. SHAP values are a way of measuring the influence of each variable in a model on the outcome. For example, if an individual has a persistent cough and the model predicted that they had COVID, the SHAP value for the 'cough' variable would be high.
The advantage of this approach is that you get well differentiated clusters where each variable is weighted according to its relationship with the target variable.
I really like this approach; in fact, I independently used this approach recently for a client and am happy to see somebody else is using it.
One limitation should be noted - the clusters are heavily dependent on the target variable, so if you don't have an obvious target, you can't use this method of supervised clustering. But I think most times you will have a clear target, so this shouldn’t be an issue.