Find good data set.

Overview

This tool helps identify candidate properties and samples to create a dense dataset for learning and use in Machine Learning models.

Version: p:IGI+ 2.5+ (Oct 2024)

Usage: Model --> Find good data set...

How to use in practice

This p:IGI+ feature was specifically designed to help users find a suitable set of properties and samples within a project (data matrices) useful for interpretation or as an input to the machine learning tools.

Once opened, the tool comprises several steps indicated by a snaking sequence of light blue arrows.

Upper Left Box - Filtering

It is recommended here to a) select a page to start, esp. if you have a subset of properties in mind you wish to gather together, and b) limit the samples through the use of a sampleset. Additional filtering options are available through the "advanced property filtering," e.g., if you only wish to consider ratios.

- Leaving the initial filter option blank can be a good way to explore what you have in your project, but in large projects, it can take a long time to build the full data map, so working with pages and sample sets is beneficial.

Select to Proceed with selected data

Lower Left Box - Analysis & indicator Selection

Here, all analyses and indicators with data, along with counts for the number of samples (s) and properties (p), as well as the percentage of empty/missing values, will be shown.

Note that as combinations of analysis groups and indicators are selected, the "number of samples x the number of properties" shown in each blue arrow updates dynamically.

Upper Middle Box - Target Property Selection (Optional - leave no target)

If compiling a data matrix of samples and properties around a particular area of the property model is desired, then a target property can be selected to focus the selection of properties. There are several options which allow the data matrix to be oriented towards an area of the property model:

Mutual Information

Mutual information is a statistical measure used for feature selection, dependency analysis, and understanding relationships in your data ahead of training a machine learning model. It is a measure of dependence between two variables (in this case, between property [X] and the selected target value [Y]), quantifying how much information [X] provides about [Y].

The mutual information feature uses Kullback–Leibler divergence (KL) to quantify how different the actual joint relationship is from a world where X and Y are completely independent.

Mathematically, MI is defined as:

I(X;Y) =KL(p(X,Y) ∥ p(X)p(Y)

This compares two probability distributions:

p(X,Y) → the joint distribution: how X and Y actually vary together
p(X)p(Y) → the product of independent distributions: how X and Y would behave if they were unrelated

The KL divergence measures the similarity of two probability distributions. It is zero when the distributions are identical, and positive if they are not.

If X tells you nothing about Y, then the mutual information value is zero. If these two distributions differ greatly, then X tells us a lot about Y. The maximum value of mutual information is unbounded, but we scale the values so that one is the maximum. In general, values of 0.1 or less can still indicate knowing X can be useful.

Right Box - Dataset Preferences

Finally, options are provided to optimise the collated data matrix.

Rows vs Columns (properties)

- In general, for statistical modelling tasks, we would prefer to have more samples (rows) to train our model. If the slider is placed in the middle, the tool attempts to balance the two.

The other two sliders define the minimum number of values required for a sample (row) or property (column) to be accepted into the final matrix (data set). Default starter values are provided.

Once ready, select Find well-occupied dataset. You can adjust the preference sliders and re-run the Find well-occupied dataset several times to maximise the number of samples and properties while minimising empty cells.

Data Set Creation

To create a data set (a combination of a page artefact and static sample set) provide a data set name and click Create page and static sample set.