QSRR Analysis in Characterization of Some Benzimidazole Derivatives

In this paper, quantitative structure-retention relationship study has been applied in order to correlate obtained retention parameter RM 0 and two groups of molecular descriptors, for eleven investigated benzimidazole derivatives. Principal component analysis (PCA), followed by hierarchical cluster analysis (HCA), linear regression (LR) and multiple linear regression (MLR), was applied in order to identify the most important molecular descriptors. Mathematical models were established and the best models were further validated by leave-on-out (LOO) technique as well as by the calculation of the statistical parameters. Statistically significant models were established.


Introduction
Benzimidazoles, as biologically active compounds, are frequently studied group of molecules.5][6] Because of different range of their activities, the chromatographic behavior and physicochemical characteristics of a number of benzimidazole derivatives were studied, applying thin-layer chromatography (TLC). 7,80][11][12] For understanding the chromatographic processes, it is very convenient to establish mathematical models.Quantitative structure-retention relationship (QSRR) is a useful technique for determining relationships between chromatographic properties of investigated molecules and molecular descriptors.Established QSRR models can be widely applied for identification of the most useful structural descriptors, prediction of the retention of new synthesized molecules and identification of unknown analytes. 13In QSRR analysis, correlation between retention data (R M 0 values) and structural parameters (molecular descriptors), can be examined by linear regression (LR) and multiple linear regression (MLR), principal component regression (PCR), partial least squares regression (PLS) and automated neural networks (ANN).
In this study, R M 0 values were correlated with two groups of descriptors, molecular and in silico ADME (absorption, distribution, metabolism and excretion) descriptors.LR and MLR were used for establishing the equations and principal component analysis (PCA) and hierarchical component analysis (HCA) were carried out for data overview.
The goals of this study were to evaluate the retention data by multivariate statistical methods and to find the possible relationship between retention characteristics and molecular and in silico ADME descriptors of the investigated benzimidazole derivatives.

Material and Methods
The steps in QSRR analysis were: molecular structure optimization using the computer software, molecular Karad`i} et al.: QSRR Analysis in Characterization ... descriptors computation, selection of molecular descriptors, generation of structure-retention models using LR and MLR method and statistical validation.

1. Studied Compounds
The chemical structures of investigated benzimidazole derivatives are presented in Table 1.Compounds are divided in three groups: molecules 1-4, 5-8 and 9-11, according to their chemical structure.The compounds were synthesized by a procedure described elsewhere. 14Experimental procedure of RP TLC separation with C 18 silica gel plates and obtained retention data (R M 0 ) of analyzed compounds were reported previously. 7

Molecular Modeling and Molecular Descriptors
Two groups of descriptors, molecular and in silico ADME descriptors were derived from the chemical struc-ture.Modeling of studied compounds was performed by ChemBioDraw Ultra 12.0 for 2D structures and Chem-Bio3D for 3D molecular structures. 15Derived 3D molecular structures were subjected to the energy minimization using molecular mechanics force field method (MM2).The minimization was performed until the root mean square gradient (RMS) reached a value smaller than 0.1 kcal/Åmol.Three types of molecular descriptors were derived (Table 2): variables that describe the physicochemical properties of the whole molecules such as molar refractivity (MR), molar volume (MV), hydration energy (HE) and surface area grid (SAG); total energy (TE) that is a quantum chemical property; polarizability (P) and dipole momentum (DM) as electronic features of the molecules.In silico ADME descriptors were calculated on the basis of 2D structures, using the Molinspiration online program. 16Calculated in silico ADME descriptors are (Table 2): G protein-coupled receptors ligand (GPCR), ion channel modulator (ICM), kinase inhibitor (KI), nuclear receptor ligand (NRL), protease inhibitor (PI) and enzyme inhibitor (EI).

3. Chemometric Methods
In QSRR analysis, correlation between retention data and various empirical, semi-empirical and non-empirical structural parameters, are usually examined by the MLR. 13 The main aim in QSRR analysis is to reduce the number of variables and to detect structure in the relationships between variables, by various statistical methods of explorative analysis, classification methods and regression methods. 17,18CA is a useful statistical technique for reducing the amount of data when there is correlation present, retaining as much as information as possible.This statistical technique calculates new, latent variables by a combination of the original variables.The data are projected into a few principal components (PCs) that are linear combinations of the original variables.Each PC is characterized by sco-  res that are the new coordinates of the projected objects and loadings that reflect the direction with respect to the original variables. 19CA is a method for dividing a group of objects into clusters so that similar objects are in the same cluster.This type of analysis searches for objects which are close together in the variable space.Cluster hierarchy is displayed as a tree diagram named dendrogram, where the horizontal axis represents the distance or dissimilarity between the clusters.
LR is used for establishing the relationship between dependent variable and just one independent variable.It attempts to model the relationship between two variables by fitting a linear equation to observed data.General LR model can be written using following equation: where y is dependent variable (quantitative property to predict), a the slope, x an independent variable (descriptor) and b the intercept.MLR is used for quantification of the relationship between more than one independent variables and a dependent variable.A great problem in MLR modeling is how to avoid multicollinearity.As the diagnostic tool, variance inflation factor (VIF) is used to check the impact of multicollinearity in the MLR models.In the literature it is considered that VIF factor greater than 10 indicates multicollinearity. 13,20Very important aspect of QSRR study is model validation.Standard statistical parameters for model validation were used: Pearson's correlation coefficient (r), F-test (Fisher's value) and standard error of estimation (s), and cross-validation parameters (crossvalidation coefficient of determination (r 2 cv ), adjusted coefficient of determination (r 2 adj ), predicted residual sum of squares (PRESS), total sum of squares (TSS) and stan-dard deviation based on predicted residual sum of squares (S PRESS ). 21High values of these statistical characteristics (r 2 cv , r 2 adj > 0.5) indicate high predictive power of the equations. 22

1. PCA
PCA was performed on both sets of molecular descriptors in order to reveal some similarities among studied molecules.The analysis was carried out by Statistica v. 10 program. 23For molecular descriptors PCA resulted in a model that explains 89.78% of total variance with two significant PCs.The first principal component accounted for 77.05% of data variance and the second one for 12.73% (Figure 1a).As it can be observed from the loading graph (Figure 2a), all descriptors have a significant negative influence on PC1 while only DM has a high positive influence.Along the PC2 axis, TE descriptor has the most positive influence while DM has the highest negative influence.From score plots, any type of grouping of the molecules cannot be observed along the PC1 or PC2 axis.
For in silico ADME descriptors, the model explains 89.25% of total variance, also with two significant PCs.The first principal component accounted for 64.26% of data variance and the second one for 24.99% (Figure 1b).As it can be observed from the loading graph (Figure 2b), all descriptors have a significant negative influence on PC1.Along the PC2 axis, NRL descriptor has the most positive influence while ICM has the highest negative influence.From score plot for molecular descriptors, any type of grouping of the molecules cannot be observed along the PC1 or PC2 axis.On score plot for in silico ADME descriptors, two outliers can be observed, molecules 2 and 5.

2. HCA
Clustering is based on Ward's linkage method and Euclidean distance.HCA was conducted by using NCSS 2007 and GESS 2006 software. 24Dendrogram based on molecular descriptors (Figure 3a) shows two well-separated clusters.One cluster consists of basic molecules in every group (5, 9, 1) that have hydrogen in position R 1 .Their molar refractivity is significantly different from the other molecules, as it is confirmed by calculated values.Second cluster contains compounds that have alkyl groups (ethyl, butyl and hexyl group) in position R 1 .It can be concluded that obtained dendrogram is the same as on the PC1-PC2 score plot (Figure 1a).Dendrogram based on in silico ADME descriptors resulted in two main clusters (Figure 3b).The first cluster consists of molecules 10, 9, 6 and 2, that have the highest enzyme inhibition ability and in second cluster compounds with lower values of enzyme inhibition ability are positioned.Compounds in HCA are grouped same as in PCA (Figure 1b).

LR and MLR
LR and MLR were conducted using NCSS 2007 and GESS 2006 software. 21For MLR models, two molecular descriptors that have the low value of intercorrelation coefficient were used.Each constructed LR and MLR model had to be statistically valid.In the present study, models that contain two independent variables were chosen, a) b)  according to the number of studied compounds.Established LR and MLR equations, with both sets of descriptors, free of multicollinearity (VIF < 10) and statistically significant are presented (Table 3 and 4).The statistical quality of the generated models was determined by r, s and F for statistical significance.Equations 1-4 were cross-validated by leave-oneout method (Table 5).High values of r 2 cv and r 2 adj (r 2 cv , r 2 adj > 0.5) and PRESS values significantly less than TSS for all four models indicates that these models have very good predictive power. 25In equations 1-4, all descriptors have a positive influence on the retention.Usefulness of the established models can be confirmed by the plots of predicted versus experimentally observed R M 0 values and the plots of residual values versus the experimentally observed R M 0 values (Figure 4).The plots of residual values versus the experimentally observed R M 0 values shows that the residuals are randomly distributed around the y = 0 axis.On the result of given cross-validation parameters and plots, it can be concluded that better models are obtained with molecular than with in silico ADME descriptors.The best models are obtained with equations 1 and 2 and based on the same criteria, models 3 and 4 are satisfactory.

Conclusion
The aim of this study was to evaluate the retention data, obtained by RP TLC, by multivariate statistical methods and to find the best established models.PCA and HCA were carried out and mathematical models were de-veloped.PCA did not show grouping among the studied molecules with both sets of molecular descriptors.The results of HCA showed two well-separated clusters in both cases.The usefulness of the established models was confirmed by standard and cross-validation statistical parameters.Comparison of the experimental and predicted, and experimental and residual values confirmed that established MLR models can be successfully used in the prediction of R M 0 values.In addition, on the basis of presented results it can be concluded that the molecular and in silico ADME descriptors could be successfully used for predicting of the retention parameters obtained by RP TLC.Predictive ability of presented models allows us to estimate the retention behavior for structurally similar compounds and reduces the analysis time of investigated compounds.

Figure 1 .
Figure 1.Score plots of molecular (a) and in silico ADME (b) descriptors.

Figure 4 .
Figure 4. Plots of predicted versus experimentally observed R M 0 values and plots of residual values versus the experimentally observed R M 0 values.
These results are the part of the projects No. 31055, No. 172012 and No. 172014, supported by the Ministry of Education, Science and Technological Development of the Republic of Serbia and the project No. 114-451-1156/2014-02 and No. 114-451-3503/2011-2014, financially supported by the Provincial Secretariat for Science and Technological Development of Vojvodina.

Table 1 .
Chemical structures of eleven studied benzimidazole derivatives.

Table 2 .
The values of the molecular and in silico ADME descriptors.

Table 3 .
Statistical parameters for linear dependence between R M 0 and calculated descriptors.

Table 4 .
Statistical parameters for multilinear dependence between R M 0 and calculated descriptors.