This article was originally published in the Internet Journal of Chemistry, (1999), 2, Article No. 14.
The Journal is, unfortunately, no more available on-line.
I want to thank Steve Bachrach, the former Journal editor for permission to serve this article locally.
Definition of an Optimal Subset of Organic Substituents. Interactive Visual Comparison of Various Selection Algorithms.
Uwe Eichlera, Peter Ertla, Alberto Gobbia and Bernd Rohdeb
aNovartis Crop Protection AG and bNovartis Pharma AG, CH-4002 Basel, Switzerland
selection of an optimal subset,
Publication date: April 28 1999 22:09:00 GMT
Selection of a subset of molecules that should represent the whole structural
and physicochemical space of the original pool is a problem frequently needed in
various areas of chemistry including QSAR analysis, combinatorial synthesis and
rational purchase of molecules for screening. This led to the development of a large
number of various selection techniques. This contribution compares results of
eight selection algorithms on a "classical" problem, namely selection of a
representative subset of organic substituents based on their physicochemical
properties. A Java applet is included, enabling interactive visual comparison of
different selection methods.
Methods for selecting a representative subset from a pool of available
molecules were used for a long time mainly in QSAR analysis and in rational drug
. Recently, however, these methods found a new broad application area.
High throughput screening programs in all major pharmaceutical and agrochemical
companies have increased the number of compounds tested each year to a 6 and
sometimes even 7 digit number per company. Although this number is still increasing,
the number of potential screening candidates is even larger. This is due to
combinatorial chemistry and automated parallel synthesis which both allow for the
fast synthesis of a huge number of compounds. These techniques are applied also by
commercial compound suppliers, so that an extremely high number of compounds is
available at a relatively moderate price on the market. Consequently, the number of
compounds that can be screened is still small in comparison to the number of
compounds which are available. This is especially true if also compounds
which could potentially be synthesized (virtual libraries) are taken into account.
Therefore an intelligent choice of compounds for screening can considerably increase
the value of the screening results.
Although a diverse selection will not increase the number of active compounds
found in the first screening round, it will prevent redundancy and cover a broader
range of the chemical space. Hits that result from a diverse selection are therefore
much better starting points for lead optimization.
The situation described above has led to the development of various new
strategies enabling selection of a representative subset from a large pool of available
. In this article, various selection strategies are compared using the
selection of a representative subset of organic substituents as an example. This is a
"classical" problem from the early days of QSAR analysis and rational drug design.
Numerous approaches have been used to address it, starting from the classical works
and Hansch et al.
. These and similar approaches are well
covered in the standard monographs
. Recent contributions to this field are based on the pattern
, nonlinear mapping
, genetic algorithms
Generation of a Pool of Organic Substituents
To generate a basic pool of small organic substituents, molecules from the NCI
(237 771 molecules) were "cut into pieces" and monovalent substituents
with up to eight non-hydrogen atoms were collected. To provide chemically
reasonable cleavage rules the cleaved bond had to be single and not in a ring. Also
either a heteroatom or a branched carbon atom was required on at least one side of the
cleaved bond. This procedure yielded 24 125 unique substituents. These were sorted
according to their abundance and the 200 most frequent were selected and used for
this study. All extraction and manipulation of substituents was done by using the in-house molecular development kit written in Java.
200 Substituents Used in the Study
Properties of substituents were characterized by their hydrophobicity, electron-accepting power and by their size. Hydrophobic properties were represented by the
octanol-water partition coefficient (logP) and molar refractivity. These
parameters were calculated according to the methodology of Ghose and Crippen
based on the sum of atomic hydrophobicity contributions.
The electron-donating and -withdrawing power of substituents was characterized by theoretical
parameters compatible with the Hammett sigma constants. These are calculated from
simple quantum chemical data according to the methodology developed in-house
Both methodologies have been tested on a set of more than 300 common organic substituents and the agreement with experimental substituent parameters was very good
Steric properties of substituents were
represented simply by their topological size (number of nonhydrogen atoms) and
maximal topological length. All parameters were scaled to have zero mean and unit
Principal component analysis was applied to reduce the dimensionality of the
problem to enable simple visual examination of physicochemical space covered by
the substituents. The first two principal components capture 86% of the
information contained in the whole dataset. The first component (x-axis on the
images below) represents hydrophobicity and size, and the second component (y-axis) represents electron-donating and -withdrawing properties of substituents.
Comparison of Selection Algorithms
As mentioned in the introduction, requirements of high throughput screening and
combinatorial chemistry led to the development of many new selection techniques.
Eight of them are briefly described in the following section. The list of algorithms
presented here is by no means complete and is biased according to interests and needs
of the authors. In particular, algorithms based on clustering and Monte Carlo
techniques are not considered in this comparison.
The simplest selection method is the random selection. The only
computational requirement of this approach is necessity to check that the particular
molecules are not selected more than once. Although results based of the random
selection often look quite reasonable and there are articles reporting better results for
random selection than for sophisticated selection techniques, the authors think that we
have better possibilities than to be at the mercy of a random number generator.
The dissimilarity selection is one of the most commonly used methods for the
diversity-based selection of molecules
The algorithm is fast and is internally consistent, i.e. smaller selection is always subset of a larger one.
The selection process is organized in such a way that the next compound to be selected is always
as distant as possible from
already selected molecules. In principle, the algorithm takes always the compound which is in
the largest unselected void of the compound set. In high dimensional spaces the
algorithm prefers the selection of outliers (compounds at the edge of the property space).
Sphere Exclusion Method
The sphere exclusion algorithm implemented here is based on the work of
Hudson et al.
and Wooton et al.
. First the radius of an exclusion sphere is
defined. The selection starts with the molecule having the largest sum of distances
from all other molecules. Then all molecules that are closer to the selected
molecule(s) then the exclusion radius are deleted from the list. A new molecule is
selected which is most distant from any already selected molecule. The whole process
is repeated until no more molecules remain. The selection starts in the edge regions of
property space and then moves to the center. The number of selected molecules
depends on the exclusion radius, so when a predefined number of points is needed the
whole procedure must be iteratively repeated with changing exclusion radius. (In the
applet we have predefined the exclusion radius to provide 10, 20, 40 and 80 molecules to
enable comparison with other methods).
Most Descriptive Compound Method
The aim of this algorithm is to select a subset of compounds which most
effectively represents the compounds in the original population
. The information value
of a compound is evaluated as a sum of reciprocal values of ranks of distances to other compounds.
This is calculated in a following way: The Euclidean distances from the molecule 1 to all other molecules are calculated, distances are ranked and the reciprocal values of rank are added to the information value of each molecule (e.g. 1/2 for the closest neighbor, 1/3 for the second closest etc.).
After processing all molecules the compound with the largest sum (most
descriptive compound) is selected. The whole procedure is repeated until the required
number of compounds has been selected. This approach assures that the more dense
regions of property space are well represented.
Selection by Partitioning
The partition technique
is based on the partition of the property space into a set
of multi-dimensional cells, by dividing each axis into several
bins. A representative compound is then selected from each occupied cell. This is
done mostly according to the distance from the center of the cell, but also another
criteria (i.e. cost, or synthetic feasibility) are possible. Partition selection method is
intuitive, very fast and easy to implement, but due to the explosive increase of the
number of cells applicable only for low-dimensional representations of the property
space. Another disadvantage of the method is the fact, that the number of
molecules to be selected can not be set in advance. It depends not only on the number of
bins, but also on the distribution of points in space (edge cells are usually empty).
Similarity Stress Method
The similarity stress method developed in-house defines a pairwise stress function that increases sharply with increasing similarity of two molecules.
The objective of the method is to select a subset of molecules in such a way that the total stress function becomes minimal.
The method starts by selecting a random subset of molecules.
Then it cycles through the set by adding a candidate, computing the stress increment incurred by each molecule and removing the worst of the set, i.e. the one with the largest stress increment.
This process is continued until a full cycle through the set of nonselected molecules does not yield any improvement.
In practice, five to ten optimization cycles are sufficient.
Two features of the method are worth noting: Firstly, because the stress function is a sum of pairwise components, the procedure can be implemented by incrementally updating and removing stress increments as a new molecule is added or removed, which greatly reduces the number of operations required.
Secondly, the stress functions can contain also a factor that increases with the molecular complexity, its cost, or some other measure of its value.
This would drive the selection towards lead compounds that are cheaper and easy to synthesize.
Statistical designs like the D-optimal design have been developed to select an
optimal set of points for model building. The D-optimal design
selects points so
that the potential errors in descriptors have minimal impact on an assumed (usually linear)
model. Possible model errors are not accounted for.
Therefore, the selected points will always be on the outer surface of the space
occupied by the candidates. A close distance of the any two points does not decrease
the D-optimal quality of the design. In the case of chemistry, however, a linear model
is often not appropriate and D-optimal design is not the best choice
Local D-Optimal Design
The local D-optimal design was developed in-house to overcome the above
problems of D-optimal design. In the LD-optimal design, each selected point is evaluated
by the D-optimal design criterion relative to its m nearest neighbors where m is equal
to the dimensionality of the descriptor space plus one. The sum of these D-optimal
design scores is used as the quality criterion for the LD-optimal design.
The algorithm therefore
selects points which are far apart from each other and represent space better than points selected by the D-optimal design.
Interactive Comparison of Selection Algorithms
To enable a visual comparison of particular selection algorithms, as well as an analysis of
the substituents selected by different methods a Selector applet was created. The applet displays the distribution of
substituents in physicochemical space. The x-axis represents substituents
hydrophobicity and size, and the y-axis their electron-donating and -withdrawing
properties. After clicking at a point in the graph, the structure of the respective
substituent is shown. A simple menu incorporated in the HTML page enables
interaction with the applet, i.e. choice of the selection method and the number of
points to select. Two selection modes are possible. The fast mode displays results
immediately, while the slow mode shows the progress of the selection process by
displaying selected points and related substituents one-by-one with an interval of ca 1
second (this option is available only for algorithms which are sequentially-based). By pressing the button it is possible to display
selected substituents on a separate page.
Try Various Selection Algorithms with this Interactive Selector Applet
Click on the point to see the corresponding substituent.
The results of various algorithms for the selection of an optimal subset
of molecules were compared. The goal of the article was not to identify the
"best" selection method. This would not be even possible, because the choice of the proper
methodology depends always on the particular problem. The goal of this work was to provide
readers with possibility to "play" with various selection techniques and thereby
become more familiar with this important part of cheminformatics. We also wanted to
illustrate here the advantages of an interactive electronic journal.
The results of the algorithms are visualized using the selection of a
representative subset of organic substituents as a testcase. The possible applications of
such a subset are really numerous. Applications in the area of combinatorial chemistry
were already discussed. A little overshadowed by this booming field is the classical
use of organic substituents in QSAR studies and lead optimization. Another
interesting application field for a system enabling selection of substituents from a
chosen area in property space is an automatic design of molecules with desired
physicochemical properties, either as a mimic of a known lead, or based on the QSAR
model. Such a procedure using genetic algorithms is under development at Novartis.
1. Pleiss, M; Comprehensive Medicinal Chemistry 1990, 561
2. Agrafiotis, D; Encyclopedia of Computational Chemistry 1998
3. Craig, P; Journal of Medicinal Chemistry 1971, 14, 680
4. Topliss, J; Journal of Medicinal Chemistry 1972, 15, 1006
5. Hansch, C; Journal of Medicinal Chemistry 1973, 16, 1217
6. Hansch, C; Exploring QSAR, Fundamentals and Applications in Chemistry and Biology 1995
7. Hansch, C; Exploring QSAR, Hydrophobic, Electronic and Steric Constants 1995
8. Plummer, E; Reviews in Computational Chemistry 1990, 119
9. Van Der Waterbeemd, H; Journal of Computer-Aided Molecular Design 1989, 3, 111
10. Domine, D; Journal of Medicinal Chemistry 1994, 37, 973
11. Domine, D; Journal of Medicinal Chemistry 1994, 37, 981
12. Devillers, J; Computer-Assisted Lead Finding and Optimization 1997, 109
13. Domine, D; Quantitative Structure-Activity Relationships 1996, 15, 395
14. Anon; http://dtp.nci.nih.gov/docs/dtp_data.html
15. Viswanadhan, V; Journal of Chemical Information and Computer Sciences 1989, 29, 163
16. Ertl, P; Quantitative Structure-Activity Relationships 1997, 16, 377
17. Ertl, P; Proceedings of the 13th Conference on QSAR, to be published 1999
18. Lajiness, M; Perspectives in Drug Discovery and Design 1997, 7(8), 65
19. Hudson, B; Quantitative Structure-Activity Relationships 1996, 15, 285
20. Wooton, R; Journal of Medicinal Chemistry 1975, 18, 607
21. Mason, J; Perspectives in Drug Discovery and Design 1997, 7(8), 85
22. Mitchell, T; Technometrics 1974, 16(203), 211
23. Higgs, R; Journal of Chemical Information and Computer Sciences 1997, 37, 861