88.

Teófilo R. F., Martins J. P. A., Ferreira M. M. C., "Ordered Predictors Selection: an intuitive method to find the most relevant variables in multivariate regression". Águas de Lindóia, SP, Brazil, 10-15/09/2006: 10th International Conference on Chemometrics in Analytical Chemistry (CAC-2006, CAC-X), Book of Abstracts (2006) P066. Poster 066.


10th International Conference on Chemometrics in Analytical Chemistry P066

Ordered Predictors Selection: an Intuitive method to find the
most relevant variables in multivariate regression

Reinaldo F. Teófilo, João Paulo Ataide Martins*, Márcia M. C. Ferreira

Laboratório de Quimiometria Teórica e Aplicada - Instituto de Química - Universidade Estadual de
Campinas

Keywords: predictor selection, multivariate calibratioin, chemometrics
_____________________________________________________________________________________
 

      Multivariate  regression  techniques  are  widely  used  to   model   chemical,   physical,   sensory  data,
besides quantitative structure-activity and -property relationships. i.e.  QSAR and QSPR, respectively. The
quality of a multivariate calibration model depends,  among others,  on the quality of the data  (objects  and
also variables).  Several strategies are available to  evaluate the predictive ability of a regression model by
measuring  the  error on objects  that  were  not  used when building the regression model.   In  the original
Partial Least Squares  (PLS)  and   Principal Component Regression  (PCR)  methods,  all  variables  were
used  and  hence  they  were denominated full-spectrum methods1.   However,  it is well known  that  better
results could  be  obtained when  only  the  most  important variables were selected and applied.   Variable
selection is crucial, especially in QSAR/QSPR  studies.
      This work introduces a  simple and intuitive method for feature selection.   In this method,  the variables
are sorted in a decreasing order with respect  to  its importance  to the model.   The  ordered variables  are
then evaluated using increments over a window previously defined.  The root mean square errors of  cross-
validation  (RMSECV)  and  the  correlation coefficient of cross-validation  (rcv)  values  are sorted  to  each
analyzed window. The best variables are indicated by lower RMSECV  and higher  rcv.   As a consequence,
the  algorithm was named  Ordered Predictor Selection  (OPS).  Several vectors or  their combinations can
be used to order  the  variables,  as the correlation vector,  regression vector,  loadings vector combination,
modeling power vector,  the difference between  the spectra relative to  the higher and lower concentration
values, and so on.   The choice will depend on  the data set under study.  The advantages  of  this method
are:  (i) applicability to highly correlated data set  (spectroscopy, voltammetry, process control), and to data
that present lower correlation among the variables  (QSAR,  QSPR,  mass spectrometry,  nuclear magnetic
resonance);   (ii)  objective on selecting variables that present the relevant chemical information,  since  the
vectors chosen  for  variable sorting  are selected  too;   (iii)  requirement  of  few input parameters  for  the
selection  method,  being  necessary  only  the  independent variable  matrix  and  the  dependent  variable
vector.
      The algorithm was written  in  Matlab  code  and its performance was evaluated on three data sets,  i.e.
QSPR  data (Set A),  Mid-infrared data  (Set B)  and  Voltammetric data  (Set C).   The  table  presents  the
results for the full data st and  OPS  data.   The regression method used was the PLS.   It is concluded that
the selection of informative variables  improved  the  predictive ability,   besides  making  the  models  more
interpretative and parsimonious,  since a lower number of variables was selected.
_____________________________________________________________________________________

                                                        Full data                                                        OPS data
                             _____________________________________________________________________
              Factors    nVars     rmsecv      rcv      rmsep     rp       nVars      rmsecv      rcv     rmsep       rp
Set A               2              677            0.199          0.42        0.183      0.79            5                 0.151         0.73        0.083        0.99
Set B               4            3351           34.68           0.59      22.770     0.88           30                25.03         0.81          6.7          0.99
Set C               4             353             0.027          0.95        0.016      0.97           30                0.009         0.99      0.00035     0.99
_____________________________________________________________________________________
 

Acknowledgment: To CNPq and FAPESP for their financial support.
 

_____________________________________________________________________________________

References

1 Martens H.; Naest T. Multivariate Calibration (2nd edn), vol. 1. Wiley: Chichester, UK, 1989, 35-70.