Principal Component Analysis

Overview

When dealing with high dimensional data in can be useful to explore whether a smaller number of variables (principal components) can be used to summarise information contained in the much larger number of properties.

Version: p:IGI+ 2.0+ (Sept 2021)

Usage: Data --> New PCA... , right click in page, Create new PCA...

How to use in practice

Principal Components Analysis is the standard linear method for reducing the dimension (number of properties) of multi-variate data. Classically in oil and gas geochemistry it is applied to Gas Chromatography data where the abundance of a large number of molecules is often measured. It can be challenging to visualise and understand this high dimensional data, so PCA is one option to simplify the data.

Mathematically PCA calculates a decomposition of the data covariance matrix to find a series of orthogonal (at right angles to each other) axes which represent the most variable (maximum variance) directions in the data space. The hope is that axes of significant variance correspond to signal, and lower variance direction mainly capture noise, although this is not guaranteed to be true. A more complete explanation of the use of PCA is provided in the software training courses.

Selecting the 'training data'

PCA requires a set of data on which to 'learn' the projection. We will call this the training set. We create a training set for any model using a page, and a sample set. The page defines the properties (columns) that will form the inputs - non-numeric inputs will be ignored. The sample set defines the subset of the data that will be used to 'train' the model.

Using a page, and applying the sample set to that page allows you to view the data that will be used for training before you attempt to learn the model. This is good practice and allows you to understand for example the degree and possible pattern of missing data. Once you have created the page and applied that sample set, you can right-click on the page and select Create Model then click PCA from the menu options.

Selecting to Create the model brings up the PCA artefact, with the training set information populated. The artefact name is pre-populated from the name of the input page but this can be changed and we suggest using as meaningful a name as possible.

At this point you have several decisions to make regarding how to pre-process the data. If using raw GC data it will be important to normalise the data per sample, to account for potentially very different instrument sensitivities / injection volumes if using unknown, height or area data. This should not be necessary and may not be desirable, if you are using concentration or ratio data. Normalisation is applied per sample and ensures that the measurements sum to one (accounting for the number of missing values) - this puts emphasis on the 'shape' of the n-alkane profile.

You also need to decide what to do about missing values. The default (and safest) option is to remove any samples with missing values, but if you have a significant number of missing values, this might result in a rather small training set, which is also undesirable. The other options are:

replace missing values with the mean value for that property in data that is present
replace missing values with the minimum value for the property in the data that is present (assuming you believe the values are missing because the peak was very small)
replace missing value with a user given value (again here it is likely you might select a very small value reflecting a missing value due to a very small / unseen peak).

Having decided on the above the final choice is whether to standardise the resulting data. The data will be centred in any case (this means that each property will have the data set mean subtracted to produce an overall zero mean, which is important to generate interpretable principal components). You should only standardise data if you are concerned different properties are measured across different scales but you want them to have roughly the same weight in PCA. In general we would recommend avoiding standardisation of normalised data.

Once you have selected the data pre-processing, select Build training set. This will apply the selected options and create the training set.

Training the model

Once you have selected the training set options, there are really no significant PCA options to choose - it is a simple linear method. Press the Train button.

This brings up the model validation / diagnostics section, which shows the scree plot (a plot of the proportion of variance explained by each PC, alongside a table showing the actual numbers. You can also view the component loading plot from this section - the plot shows the importance of the input properties (columns) in terms of their contribution to the selected principal components and can help you understand which properties influence which components most.

Once the model is trained, the model use section is also enabled:

This section provides information on the training choices made (when an by whom, which can be helpful if you save the PCA artefact as a template and provide this to a colleague. Below the model summary is the section which defines the data to which you wish to Apply the model, and how you wish to handle missing values when projecting your data onto the PC's (that is calculating the component scores for each sample). By default we anticipate you would wish to apply the model to the training set sample set, but you can apply a model trained on a specific subset to a different subset should you choose.

When applying the model it is possible to select a different treatment of missing values. In general we advise using the same choices for training and 'prediction', however there are situations where it makes sense to e.g. remove samples with missing values when training to minimise the influence of the strategy used to replace missing values, but accept that once the model was trained using all samples with complete information, it could be applied to all samples selecting to replace missing values (once the model is fixed, treatment of missing values will have less influence on the overall modelling outcome).

You must also select the number of principal components to create and populate ('predict'). The default is two, but the number should be guided by the scree plot and your understanding of the data. Since the PCA model will be a full artefact you can later change your mind and predict more, or fewer PCs. This will automatically create and calculate the component scores (PCs) which will be named Model_PCn.PCA on the Apply button is pressed (there will be a slight delay here as the project properties are created and the values calculated).

Using PCA results

Once the PCA model is applied, the projection onto the select PCs will be calculated. These are project properties which are dimensionless (Euc unit by default), and can easily be found in property selectors typing either the model name or simply e.g. 'PC1' into the property box of the Property selector. These properties can be used across the system in pages, graphs, palettes, maps, sample sets, statistics and as inputs to other models, for example cluster models. The PCs (scores) are calculated on demand only - this is true for all models - and will delete all other values in the target properties to ensure the results come from only one model.

At present (version 2.0) the scores are style as given (not calculated) values, as the values are not dynamically updated if input data changes (this is a change of behaviour from version 1.28). In future releases we might add a predicted value type.

Using the PCA artefact

PCA is a fully fledged artefact. This means you can have the PCA artefact open alongside graphs, maps etc. This means you can explore the effect of, for example different training options on the PCs, and any other artefacts using these will update for you. Don't forget to Train the model and then (if happy with the validation / diagnostics) Apply it.

If you have closed the PCA artefact you will need to Build training set before you can re-train the model. This will also be true if you want to re-train a model sent to you as an artefact. However you will be able to apply a trained model provided to you as an artefact to project your data onto the principal components determined by the model creator.

You can have multiple PCA artefacts in your project (you can Clone the models) so it is possible to compare different models trained on different subsets of data, or with different data pre-processing choices.

If you rename your PCA model, the project properties associated with it will also be renamed (although if applied to e.g. a graph, the axes labels will not automatically update).

When a PCA model is deleted the project properties to store the PCs (component scores) created by it are not automatically deleted - this would be done using Data -> Delete project property... menu if necessary.