Cpact home page CPACT Logo  
-
welcome
Robust Calibration of Spectral Data, using Small Data Sets
Project Objectives

  • To develop strategies that produce robust calibration models from small data sets for monitoring chemical processes.
  • To develop sample selection techniques to reduce the number of samples required to produce a robust and effective calibration model.
  • To develop variable selection techniques to improve the predictive ability of a calibration model.

Achievements

One area in which this project has contributed has been in the need to develop some commonly accepted methods for reporting of results, to allow for the comparison of different modelling techniques applied to the same data (but the analysis being carried out by different researches in different institutions) which has been fed into the SMT Project "European Network for the Intercomparison of Chemometric Software and Methods" (UK Co-ordinator Dr Walmsley, University of Hull).

All calibration models are sample dependent, which is why the implementation of experimental design prior to the modelling is so key, however in most industrial application an experimental design approach is not viable (i.e. there is little or no opportunity to vary the plant in such a way as to provide data sufficient to meant the stringent needs of the DoE) and as such most plant data is often quite low in variability. As spectroscopy is a commonly applied chemical analysis method, this type of data has been used through this project, focussing on NIR, Raman, UV/Vis and NMR, which then produces issues for treating this type of data in the same way as one would model plant data. These areas have been investigated with a key number of projects.

Initial work reported in [1] set the basis for determining the most suitable methods for method comparisons, which included defining a set of standard result reporting methodologies. It was used for the case study a simple NIR data set that had been sourced from industry. The work concluded that (i) variable selection techniques applied to spectroscopic data provided significantly better results than using the entire spectra (ii) the key to implementation of PLS modelling is determining objectively the appropriate number of latent variables to use in the model and (iii) statistical methods such as Ridge Regression produce comparable result to PLS, but are computationally exhaustive, however the possibility of combining variable selection with RR methods might produce an effective modelling alternative to PLS.

There are currently many approaches to Variable Selection, however a novel method of utilising the experimental design nature of the data was developed in [2]. The key to this work is that the response to the mixture is a linear sum of the contributions of the absorbing species present. If one uses an orthogonal experimental design, the it is possible to hold all concentrations (and thus contributions) for all components except the target constant. Using this ‘Linear Interval’ it is possible to categorise the variables in the spectra into ‘good’ and ‘poor’ predictors for the target. The results obtained where superior to literature VS based methods, but this method is only applicable where orthogonal designs are used.

The other significant strand of work was to investigate optimal methods for sample selection, and this has been reported in [3,4]. One of the key issues in statistical modelling is the model update (and possible maintenance) when changes occur in the initial conditions. In the laboratory one can rely upon appropriate experimental design to produce the best possible model, but with an industrial process the development of the initial calibration model is much more difficult, and as such the calibration tends to have low variability and low robustness, which can cause it to age significantly over time. One approach to this problem of aging is to include more samples over time to reflect the changing conditions (which was the approach in use at this time). However, there is scope for examining this historical data and developing suitable methods to extract the samples which contain the greatest variation in the data (this is analogous to DoE methods, in which the levels of the samples and their combinations are designed to provide the greatest variation within the design space). The approach used in [3] uses the statistical method of ‘Variance of Prediction’ in which the model computes whether the sample is contained within the variance already described by some initial samples (in which case the sample is not used in the calibration model) or not, in which case the sample is added to the calibration model. Using the historical data (and maintaining the time series) it was possible to develop a more robust calibration utilising only 30% of the samples. This approach was considered to be invaluable at process start-up, as samples can be collected off-line, the spectra taken, the model can then determine whether or not that sample contains any new information (if so then a reference analysis can be performed). This approach would be a significant cost saving. An alternative approach was also developed in [4] in which PLS and PCA were used to determine the usefulness of the samples in time series based upon the PCA Q and T2 statistics. (the Q statistic can be described as the distance from the PCA scores plane and the T2 as the distance from the origin of the PCA scores). Taking some appropriate initial samples, it is possible to statistically determine the significance of subsequent Q and T2 values. This approach required only 20% of the initial data to produce an appropriate model.

Most industries now use MSPC methods for their plant control for both monitoring and control. However, this type of modelling has often focussed upon process variables rather than chemical ones. One of the rezones is that a spectra for example consists of many variables, but the contribution of any portion of the data to the overall process model is complex. There is a growing need to model spectral data in the same way as process data (i.e. not requiring any offline reference methods) and to display this data in a appropriate manner (as a process control chart for example). One approach to this problem [5] is to use the prediction variance, which is a single number, to represent the changes in a new spectra compared to existing ‘typical’ spectra. This approach requires no reference data, just the spectra of the samples, and as commonly required in MSPC models, a defined set of samples that are known to be typical or representative of the product required. In this way the spectra can now be treated statistically in a standard control chart, or used in an MSPC model. This problem of including spectral data, in which all the variance about the sample is captured, without the need for either a large number of spectral wavelengths or an independent calibration is a significant step forward in merging engineering MSPC approaches with chemical specific data.

Working without calibration has emerged as a common theme over the course of the project, and it was felt that some work should be carried out in this area [5,6]. Approaches used include kinetic monitoring using curve resolution methods, to enable accurate end-point determination (which only relies on knowing the reaction order and pure spectral profiles of the components rather than a calibration). The advantage of using this method is that the end-point can be determine much faster than with traditional methods, with a fewer number of samples. Other application have included EWFA, which is a useful technique when data has an intrinsic order [6]. EWFA tracks the singular values of a moving window in a data set and these singular values are plotted as a function of time, thus making it easy to see sudden changes in time. Research has shown that a narrower window will provide a better resolution but with less signal to noise ratio.

Other output from this project was in the area of workshops and tutorials as requested by the membership. Two useful tutorials were in the areas of Wavelet Transformation and Experimental Design, both of which covered the background details, but then had a very strong application to real process analytical problems [7,8]. This work with the associated workshops have enabled the membership to apply algorithms developed in this project to their own process analytical problems.

The project team has also influenced work on other projects, providing chemometric expertise and support for both Projects 1 and 3, and work on developing data analysis tools novel process analysers has been reported [9].

Deliverables

Development and Maintenance of a Calibration Model for Process Analysis
Document Ref: 00/P2/1
Issued: 1 June 2000

 Abstract
One of the key issues in statistical modelling is the model updating and maintenance when possible changes in the initial conditions occur. In laboratory situations one can usually rely upon good experimental design methods to ensure that the best possible model is developed. This is not usually the case in process situations where variability in variables is low, compared with situations where the data is collected by designed experiments. In these situations, development of the calibration model is much more difficult, requiring the ability to construct a model using the fewest number of possible process samples, as off-line process measurements can be both time consuming and costly. The work presented here shows a method for the sequential updating of a calibration model, based upon sample selection. The data used as an example consists of NIR spectra and GC measurements of the feed, heads and tails, of a process reactor taken over a period of about 1 year. The model has been developed using PLS latent variable reduction of NIR spectra. The advantages of the proposed algorithm are that it is not dependent upon the nature of the process and that it is robust in the presence of changes in process conditions. The results here have shown that the proposed procedure requires less than 30% of the measurements needed to produce a model that is comparable to one developed using all measurements.

Variable Selection for PLS Calibration of NIR Data from Orthogonally Designed Experiments
Document Ref: 01/P2/1
Issued: 28 February 2001

Summary
Introduction
For many types of modern instrumentation, the number of measured variables is often very large. In many cases, not all the measured variables contain useful information for development of a calibration model, e.g. they may contain only background noise. Many researchers have suggested that significant improvements in the prediction accuracy of a model may be achieved through careful selection of the variables used to form the calibration model. Moreover, variable selection can result in a reduction in multicolinearity. Co-linearity, which means that some of the variables are linear combination of the other variables, is the main cause of instability in the calculation of the regression coefficient. Variable selection procedures comprise decisions as to whether to include or exclude a particular predictor variable on the basis of one or more criteria determined by the researcher. The selection of an appropriate criterion and strategy for applying that criterion are the main issues in variable selection. In this report, a new approach to variable selection, Variable Selection over Linear Intervals (VSLI) is described and applied to NIR data for mixture of three chemicals.

The VSLI Algorithm
According to the Beer-Lambert law the observed absorbance at any given wavelength (l ) is a linear sum of the products of the individual concentrations and absorption coefficients (e l ) for each one of the absorbing species present. Therefore, if the concentrations of the components in a mixture vary linearly, the near infrared (NIR) absorbance values should also show linear variation. Moreover, if the concentrations of all components in the mixture except the target component are kept constant, which can be achieved by an orthogonal design, such as factorial or partial factorial design, the variation in the absorbance at any given wavelength corresponds to the variation in the concentration of the target component. Based on this idea, VSLI initially examines the individual variables and divides the set of the predictor variables into ‘Good’ and ‘Poor’ variables with the ‘Good’ variables being ranked according to their properties. The separation and ranking of the variables is based upon the satisfaction of a linearity criterion that is defined over the design of the calibration samples in the experimental design. A forward selection strategy is then employed to determine the optimum number of ‘Good’ variables to define the best calibration model.

Results
The VSLI algorithm was tested on a NIR spectroscopic data set, which was supplied by Zeneca Agrochemicals. It consists of a mixture of 3 components identified as components A, B, and C. The sum of the concentrations (% w/w) of the three components was always 100%. A 4-level 2-factor orthogonal design was used to define the calibration data with all sample concentrations determined gravimetrically using a four-place analytical balance. A set of 9 ‘‘test’’ samples, arranged randomly throughout the experimental design, was used to test the prediction ability of the resulting calibration models. Significant improvements in the prediction accuracy of PLS models were observed for the different components in the chemical mixture when the selected wavelengths in the NIR spectra used rather than the full spectrum (Fig. 1). Also the results produced by the proposed method were equally superior to those obtained using a spectral variance based variable selection method, which was reported in the literature.

Maintenance of a NIR Calibration Model by a Combined Principal Component Analysis/Partial Least Squares Approach
Document Ref: 01/P2/2
Issued: 28 February 2001

Summary
Introduction
Many industries require the analysis of similar samples on a routine basis, for example, quality assurance of raw materials or the analysis of final products. In general, due to the repetitive nature of these analyses, the use of a fast, accurate analytical methodology is highly desirable. Near infrared (NIR) spectroscopy, coupled with multivariate calibration methods, has considerable potential in such circumstances. In many cases, however, there is considerable difficulty in obtaining, and ensuring long-term validation, of a representative calibration model. Moreover, it is often necessary to maintain the initially developed model using some of the future samples. The correct selection of samples to update, or maintain, the existing calibration model is a key step in the long-term implementation of multivariate calibration procedures. In this report, the development of a new algorithm to select samples when the size and future population boundary are unknown is described. These samples are then used to update a calibration model in order to maintain its predictive ability.

Problem description and objectives:
BP provided the data set used in this study. It involves the routine determination of the concentration of two components, Component A and Component B, in a raw material. The data set consisted of the near infrared (NIR) spectra of 102 samples collected over a one-year period, along with the corresponding reference concentrations of the two analytes determined by gas chromatography (GC). The data set was collected to determine the possibility of replacing the GC analysis with a faster NIR spectroscopic procedure.

The Sample Selection and Model Maintenance Algorithm
The algorithm is based on both principal component analysis (PCA) and partial least squares (PLS) multivariate procedures. PLS-1 is used to build the calibration models required to predict the concentrations of different components in a new sample. The algorithm defines how "similar" the new sample is to the samples currently defining the calibration data set. This step is performed by residual analysis, following PCA, which takes place using the Q and T2 statistics. If the new sample is considered to have a spectrum "similar" to previously available spectra, then the model is assumed able to predict the analyte concentration. On the other hand, if the new sample is considered "dissimilar", then there is new information in this sample, which is unknown to the calibration model, and the new sample is added to the calibration set in order to improve the model.

Results
The algorithm produced an accurate calibration model for each target component starting with the first 4 samples and only required a further 17 reference measurements to maintain the model for the whole sampling sequence, conducted over a one-year period (Figure 1).

 

 

 

Contact us
Introduction
Partners
Research areas
News
Events
Vacancies
Search this site
Members only area
CPACT Logo
CPACT home
CPACT home
-