GC file transformation
Summary of how it works
The GC Transformer works in three main steps:
- Classify the file structure
- Read the context data at the top of the sheet
- Read the pages of measurement data
- Classification - based on analysis of several dozen historical print formatted GC reports we recorded where the analysis is specified, the columns where the context headers and data are located and other variables. Based on this we can identify whether the file is presented in a familiar structure and know where to find the context data (depth, well name etc).
- Context data - We capture information such as Company, location, well name, depth etc from the block near the top of the file. This is typically structured in two header columns and two columns used for values:
- Page data - The bulk of the data is then captured in repeating cycles with context data followed by a table of measurements e.g.
Each page starts with the repeated block of context data, then has headers followed by data. Whilst iterating through the rows we can pick up indicators that pages are starting or finishing such as after the context data block we expect to see empty rows or the headers and after the headers we expect to see page data.
We use the Peak Label (and Ion channel if given) to identify the properties and use the Area, Height etc to identify the indicators and units of measure used.
Because the sheets are formatted for presentation rather than ease of processing the header columns do not always correspond with the value columns even if they align visually e.g. it's common for the value to have a larger range of merged columns than the header, but be right aligned so they match up visually. We use consistency checks to ensure that the number of value columns match with the number of header columns (correcting for multiple values within merged cells where needed).
Rows that do not meet this convention such as subheadings are disregarded e.g.
Value Selection
In some cases columns are included for standard and response factor corrected (for the same indicator). Where both are given the standard values are preferred.
We also do not currently capture the retention time or the full compound name (unless the peak label is unspecified).
Suggested output checks
The first thing to check is that all of the sheets that you expected were transformed. If a file contains multiple sheets and some but not all are a recognised structure, we only include the sheets that could be transformed in the output.
Another useful check is that you are happy with the indicators and units of measure that have been assigned. We do our best to capture appropriate values for these, but have seen a range of different abbreviations, so they may not be mapped appropriately if unfamiliar headings are given. If you check the first set of values after the context data, the same indicators and units of measure will be repeated for each property:
Unsupported files / sheets
You may come across examples of files or individual sheets that are not yet supported by the service. Whilst we cannot guarantee that every GC file can be converted due to the wide variety of structures seen we would like to support as wider a range as we can (especially commonly occurring file structures). If you find any files that are not yet supported please consider submitting this to IGI (anonymised if you wish) for investigation: dataservice@igiltd.com.
Known Issues
- Any context data that does not appear in the section at the top of the file is not currently captured. An example we have seen on some files is that a "File Name" field is present in the repeating context data at the start of each printable GC page but is not included with the context data at he top of the worksheet.
© 2024 Integrated Geochemical Interpretation Ltd. All rights reserved.