- To develop strategies
that produce robust calibration models from small data sets for
monitoring chemical processes.
- To develop sample
selection techniques to reduce the number of samples required
to produce a robust and effective calibration model.
- To develop variable
selection techniques to improve the predictive ability of a calibration
model.
Achievements
One area in which this
project has contributed has been in the need to develop some commonly
accepted methods for reporting of results, to allow for the comparison
of different modelling techniques applied to the same data (but
the analysis being carried out by different researches in different
institutions) which has been fed into the SMT Project "European
Network for the Intercomparison of Chemometric Software and Methods"
(UK Co-ordinator Dr Walmsley, University of Hull).
All calibration models
are sample dependent, which is why the implementation of experimental
design prior to the modelling is so key, however in most industrial
application an experimental design approach is not viable (i.e.
there is little or no opportunity to vary the plant in such a way
as to provide data sufficient to meant the stringent needs of the
DoE) and as such most plant data is often quite low in variability.
As spectroscopy is a commonly applied chemical analysis method,
this type of data has been used through this project, focussing
on NIR, Raman, UV/Vis and NMR, which then produces issues for treating
this type of data in the same way as one would model plant data.
These areas have been investigated with a key number of projects.
Initial work reported
in [1] set the basis for determining the most suitable methods for
method comparisons, which included defining a set of standard result
reporting methodologies. It was used for the case study a simple
NIR data set that had been sourced from industry. The work concluded
that (i) variable selection techniques applied to spectroscopic
data provided significantly better results than using the entire
spectra (ii) the key to implementation of PLS modelling is determining
objectively the appropriate number of latent variables to use in
the model and (iii) statistical methods such as Ridge Regression
produce comparable result to PLS, but are computationally exhaustive,
however the possibility of combining variable selection with RR
methods might produce an effective modelling alternative to PLS.
There are currently many
approaches to Variable Selection, however a novel method of utilising
the experimental design nature of the data was developed in [2].
The key to this work is that the response to the mixture is a linear
sum of the contributions of the absorbing species present. If one
uses an orthogonal experimental design, the it is possible to hold
all concentrations (and thus contributions) for all components except
the target constant. Using this ‘Linear Interval’ it is possible
to categorise the variables in the spectra into ‘good’ and ‘poor’
predictors for the target. The results obtained where superior to
literature VS based methods, but this method is only applicable
where orthogonal designs are used.
The other significant
strand of work was to investigate optimal methods for sample selection,
and this has been reported in [3,4]. One of the key issues in statistical
modelling is the model update (and possible maintenance) when changes
occur in the initial conditions. In the laboratory one can rely
upon appropriate experimental design to produce the best possible
model, but with an industrial process the development of the initial
calibration model is much more difficult, and as such the calibration
tends to have low variability and low robustness, which can cause
it to age significantly over time. One approach to this problem
of aging is to include more samples over time to reflect the changing
conditions (which was the approach in use at this time). However,
there is scope for examining this historical data and developing
suitable methods to extract the samples which contain the greatest
variation in the data (this is analogous to DoE methods, in which
the levels of the samples and their combinations are designed to
provide the greatest variation within the design space). The approach
used in [3] uses the statistical method of ‘Variance of Prediction’
in which the model computes whether the sample is contained within
the variance already described by some initial samples (in which
case the sample is not used in the calibration model) or not, in
which case the sample is added to the calibration model. Using the
historical data (and maintaining the time series) it was possible
to develop a more robust calibration utilising only 30% of the samples.
This approach was considered to be invaluable at process start-up,
as samples can be collected off-line, the spectra taken, the model
can then determine whether or not that sample contains any new information
(if so then a reference analysis can be performed). This approach
would be a significant cost saving. An alternative approach was
also developed in [4] in which PLS and PCA were used to determine
the usefulness of the samples in time series based upon the PCA
Q and T2 statistics. (the Q statistic
can be described as the distance from the PCA scores plane and the
T2 as the distance from the origin of the PCA
scores). Taking some appropriate initial samples, it is possible
to statistically determine the significance of subsequent Q and
T2 values. This approach required only 20% of
the initial data to produce an appropriate model.
Most industries now use
MSPC methods for their plant control for both monitoring and control.
However, this type of modelling has often focussed upon process
variables rather than chemical ones. One of the rezones is that
a spectra for example consists of many variables, but the contribution
of any portion of the data to the overall process model is complex.
There is a growing need to model spectral data in the same way as
process data (i.e. not requiring any offline reference methods)
and to display this data in a appropriate manner (as a process control
chart for example). One approach to this problem [5] is to use the
prediction variance, which is a single number, to represent the
changes in a new spectra compared to existing ‘typical’ spectra.
This approach requires no reference data, just the spectra of the
samples, and as commonly required in MSPC models, a defined set
of samples that are known to be typical or representative of the
product required. In this way the spectra can now be treated statistically
in a standard control chart, or used in an MSPC model. This problem
of including spectral data, in which all the variance about the
sample is captured, without the need for either a large number of
spectral wavelengths or an independent calibration is a significant
step forward in merging engineering MSPC approaches with chemical
specific data.
Working without calibration
has emerged as a common theme over the course of the project, and
it was felt that some work should be carried out in this area [5,6].
Approaches used include kinetic monitoring using curve resolution
methods, to enable accurate end-point determination (which only
relies on knowing the reaction order and pure spectral profiles
of the components rather than a calibration). The advantage of using
this method is that the end-point can be determine much faster than
with traditional methods, with a fewer number of samples. Other
application have included EWFA, which is a useful technique when
data has an intrinsic order [6]. EWFA tracks the singular values
of a moving window in a data set and these singular values are plotted
as a function of time, thus making it easy to see sudden changes
in time. Research has shown that a narrower window will provide
a better resolution but with less signal to noise ratio.
Other output from
this project was in the area of workshops and tutorials as requested
by the membership. Two useful tutorials were in the areas of Wavelet
Transformation and Experimental Design, both of which covered the
background details, but then had a very strong application to real
process analytical problems [7,8]. This work with the associated
workshops have enabled the membership to apply algorithms developed
in this project to their own process analytical problems.
The project team has
also influenced work on other projects, providing chemometric expertise
and support for both Projects 1 and 3, and work on developing data
analysis tools novel process analysers has been reported [9].
Deliverables
Development and Maintenance
of a Calibration Model for Process Analysis
Document
Ref: 00/P2/1
Issued: 1 June 2000
Abstract
One
of the key issues in statistical modelling is the model updating
and maintenance when possible changes in the initial conditions
occur. In laboratory situations one can usually rely upon good experimental
design methods to ensure that the best possible model is developed.
This is not usually the case in process situations where variability
in variables is low, compared with situations where the data is
collected by designed experiments. In these situations, development
of the calibration model is much more difficult, requiring the ability
to construct a model using the fewest number of possible process
samples, as off-line process measurements can be both time consuming
and costly. The work presented here shows a method for the sequential
updating of a calibration model, based upon sample selection. The
data used as an example consists of NIR spectra and GC measurements
of the feed, heads and tails, of a process reactor taken over a
period of about 1 year. The model has been developed using PLS latent
variable reduction of NIR spectra. The advantages of the proposed
algorithm are that it is not dependent upon the nature of the process
and that it is robust in the presence of changes in process conditions.
The results here have shown that the proposed procedure requires
less than 30% of the measurements needed to produce a model that
is comparable to one developed using all measurements.
Variable Selection for PLS Calibration of
NIR Data from Orthogonally Designed Experiments
Document Ref: 01/P2/1
Issued: 28 February 2001
Summary
Introduction
For many types of modern instrumentation,
the number of measured variables is often very large. In many cases,
not all the measured variables contain useful information for development
of a calibration model, e.g. they may contain only background noise.
Many researchers have suggested
that significant improvements in the prediction accuracy of a model
may be achieved through careful selection of the variables used
to form the calibration model. Moreover, variable selection can
result in a reduction in multicolinearity. Co-linearity, which means
that some of the variables are linear combination of the other variables,
is the main cause of instability in the calculation of the regression
coefficient. Variable selection procedures comprise decisions as
to whether to include or exclude a particular predictor variable
on the basis of one or more criteria determined by the researcher.
The selection of an appropriate criterion and strategy for applying
that criterion are the main issues in variable selection. In this
report, a new approach to variable selection, Variable Selection
over Linear Intervals (VSLI) is described and applied to NIR data
for mixture of three chemicals.
The
VSLI Algorithm
According to the Beer-Lambert
law the observed absorbance at any given wavelength (l ) is
a linear sum of the products of the individual concentrations and
absorption coefficients (e l
) for each one of the absorbing species present. Therefore, if the
concentrations of the components in a mixture vary linearly, the
near infrared (NIR) absorbance values should also show linear variation.
Moreover, if the concentrations of all components in the mixture
except the target component are kept constant, which can be achieved
by an orthogonal design, such as factorial or partial factorial
design, the variation in the absorbance at any given wavelength
corresponds to the variation in the concentration of the target
component. Based on this idea, VSLI initially examines the individual
variables and divides the set of the predictor variables into ‘Good’
and ‘Poor’ variables with the ‘Good’ variables being ranked according
to their properties. The separation and ranking of the variables
is based upon the satisfaction of a linearity criterion that is
defined over the design of the calibration samples in the experimental
design. A forward selection strategy is then employed to determine
the optimum number of ‘Good’ variables to define the best calibration
model.
Results
The VSLI algorithm was tested
on a NIR spectroscopic data set, which was supplied by Zeneca Agrochemicals.
It consists of a mixture of 3 components identified as components
A, B, and C. The sum of the concentrations (% w/w) of the three
components was always 100%. A 4-level 2-factor orthogonal design
was used to define the calibration data with all sample concentrations
determined gravimetrically using a four-place analytical balance.
A set of 9 ‘‘test’’ samples, arranged randomly throughout the experimental
design, was used to test the prediction ability of the resulting
calibration models. Significant improvements in the prediction accuracy
of PLS models were observed for the different components in the
chemical mixture when the selected wavelengths in the NIR spectra
used rather than the full spectrum (Fig. 1). Also the results produced
by the proposed method were equally superior to those obtained using
a spectral variance based variable selection method, which was reported
in the literature.
Maintenance of a NIR Calibration Model by
a Combined Principal Component Analysis/Partial Least Squares Approach
Document Ref: 01/P2/2
Issued: 28 February 2001
Summary
Introduction
Many industries require the
analysis of similar samples on a routine basis, for example, quality
assurance of raw materials or the analysis of final products. In
general, due to the repetitive nature of these analyses, the use
of a fast, accurate analytical methodology is highly desirable.
Near infrared (NIR) spectroscopy, coupled with multivariate calibration
methods, has considerable potential in such circumstances. In many
cases, however, there is considerable difficulty in obtaining, and
ensuring long-term validation, of a representative calibration model.
Moreover, it is often necessary to maintain the initially developed
model using some of the future samples. The correct selection of
samples to update, or maintain, the existing calibration model is
a key step in the long-term implementation of multivariate calibration
procedures. In this report, the development of a new algorithm to
select samples when the size and future population boundary are
unknown is described. These samples are then used to update a calibration
model in order to maintain its predictive ability.
Problem description and objectives:
BP provided the
data set used in this study. It involves the routine determination
of the concentration of two components, Component A and Component
B, in a raw material. The data set consisted of the near infrared
(NIR) spectra of 102 samples collected over a one-year period, along
with the corresponding reference concentrations of the two analytes
determined by gas chromatography (GC). The data set was collected
to determine the possibility of replacing the GC analysis with a
faster NIR spectroscopic procedure.
The
Sample Selection and Model Maintenance Algorithm
The algorithm is based on both principal component analysis (PCA)
and partial least squares (PLS) multivariate procedures. PLS-1 is
used to build the calibration models required to predict the concentrations
of different components in a new sample. The algorithm defines how
"similar" the new sample is to the samples currently defining
the calibration data set. This step is performed by residual analysis,
following PCA, which takes place using the Q and T2
statistics. If the new sample is considered to have a spectrum "similar"
to previously available spectra, then the model is assumed able
to predict the analyte concentration. On the other hand, if the
new sample is considered "dissimilar", then there is new
information in this sample, which is unknown to the calibration
model, and the new sample is added to the calibration set in order
to improve the model.
Results
The algorithm produced an
accurate calibration model for each target component starting with
the first 4 samples and only required a further 17 reference measurements
to maintain the model for the whole sampling sequence, conducted
over a one-year period (Figure 1).
|