Cluster Analysis

Overview

When dealing with complex data in can be useful to explore whether the data can be grouped, or clustered, into similarly behaving samples.

Version: p:IGI+ 2.0+ (Sept 2021)

Usage: Data --> New Clustering... , right click in page, Create new Clustering...

How to use in practice

Clustering can be a useful technique to group together similar samples, for instance when looking for oil families. In p:IGI+ we currently have implemented the standard k-means clustering. This allows the user to explore a range of clustering options. A more complete explanation of the use of clustering is provided in the software training courses.

Selecting the 'training data'

Clustering requires a set of data on which to 'learn' the clusters. We will call this the training set. We create a training set for any model using a page, and a sample set. The page defines the properties (columns) that will form the inputs - non-numeric inputs will be ignored. The sample set defines the subset of the data that will be used to 'train' the model.

Using a page, and applying the sample set to that page allows you to view the data that will be used for training before you attempt to learn the model. This is good practice and allows you to understand for example the degree and possible pattern of missing data. Once you have created the page and applied that sample set, you can right-click on the page and select Create new Clustering... from the menu options. This will prompt you to name the new model.

We suggest using as meaningful a name as possible, possibly including the word cluster. Selecting to Create the model brings up the clustering artefact, with the training set information populated:

At this point you have several decisions to make regarding how to pre-process the data. If using raw GC data it will be important to normalise the data per sample, to account for potentially very different instrument sensitivities / injection volumes if using unknown, height or area data. This should not be necessary and may not be desirable, if you are using concentration or ratio data. Normalisation is applied per sample and ensures that the measurements sum to one (accounting for the number of missing values) - this puts emphasis on the 'shape' of the n-alkane profile.

You also need to decide what to do about missing values. The default (and safest) option is to remove any samples with missing values, but if you have a significant number of missing values, this might result in a rather small training set, which is also undesirable. The other options are:

replace missing values with the mean value for that property in data that is present
replace missing values with the minimum value for the property in the data that is present (assuming you believe the values are missing because the peak was very small)
replace missing value with a user given value (again here it is likely you might select a very small value reflecting a missing value due to a very small / unseen peak).

Having decided on the above the final choice is whether to standardise the resulting data. The data will be centred in any case (this means that each property will have the data set mean subtracted to produce an overall zero mean - this will not affect the clusters but improves numerical stability). You should only standardise data if you are concerned different properties are measured across different scales but you want them to have roughly the same weight in the clustering. In general we would recommend avoiding standardisation of normalised data.

Once you have selected the data pre-processing, select Build training set. This will apply the selected options and create the training set.

Training the model

Once you have selected the training set options you can choose the training options. For k-means clustering the only choice is the number of clusters to look for. Ideally the geological or physical context will suggest a number, but often in practice you might want to explore several numbers of clusters and select the one that best explains the data given your contextual knowledge. Press the Train button to create a model.

The k-means model is randomly initialised so training a model again can result in a different clustering - it is possible to train and apply the model several times to explore the variability of the clustering found. This is an inherent modelling ambiguity and not an artefact of the implementation - clustering is an unsupervised technique and without labels there will always be uncertainty about the number of clusters, and their structure.

Once the model is trained, the model use section is also enabled:

This section provides information on the training choices made (when an by whom, which can be helpful if you save the Clustering artefact as a template and provide this to a colleague. There is also a cluster error metric, which at present is not especially informative, but ideally would be small.

Below the model summary is the section which defines the data to which you wish to Apply the model, and how you wish to handle missing values when clustering your data (that is assigning each sample to a cluster). By default we anticipate you would wish to apply the model to the training set sample set, but you can apply a model trained on a specific subset to a different subset should you choose.

When applying the model it is possible to select a different treatment of missing values. In general we advise using the same choices for training and 'prediction', however there are situations where it makes sense to e.g. remove samples with missing values when training to minimise the influence of the strategy used to replace missing values, but accept that once the model was trained using all samples with complete information, it could be applied to all samples selecting to replace missing values (once the model is fixed, treatment of missing values will have less influence on the overall modelling outcome).

The cluster labels (which take the form 'Cluster 1', 'Cluster 2', etc) will be written to a text property of your choice. If you want to use an existing property you can, but you can also readily create a text project property (Data -> New project property... and select Text Property).

Using Clustering results

Once the cluster model is applied, the cluster labels will be calculated and written to the selected property. On the first application of the cluster model a colour palette is automatically created with entries for each cluster label. This palette can be applied to graphs, maps and other artefacts to explore the meaning of the different clusters, including for example showing the clustering calculated on 11 properties on a scatterplot of the first and second principal components calculated on the same inputs (see below). The cluster assignments are calculated on demand only - this is true for all models - and will delete all other values in the target property to ensure the results are relevant to only one model.

At present (version 2.0) the cluster assignments are styled as given (not calculated) values, as the values are not dynamically updated if input data changes (this is a change of behaviour from version 1.28). In future releases we might add a predicted value type.

Using the cluster artefact

Clustering is a fully fledged artefact. This means you can have several cluster artefacts open alongside graphs, maps etc. This means you can explore the effect of, for example different training options on the clustering, and any other artefacts using these will update for you. Don't forget to Train the model and then (if happy with the validation / diagnostics) Apply it.

If you have closed the clustering artefact you will need to Build training set before you can re-train the model. This will also be true if you want to re-train a model sent to you as an artefact. However you will be able to apply a trained model provided to you as an artefact to assign your data to clusters determined by the model creator.

You can have multiple cluster artefacts in your project (you can Clone the models) so it is possible to compare different models trained on different subsets of data, or with different data pre-processing choices.