Treasure Trove of MathematicsSMILES CHEMICAL REACTION DATABASE LINKSWELCOME
The SMILES Chemical Reaction Database is a set of files containing structural information about pairs of
Sheet make_na server
reactant(s) and product(s) of two mil ion different chemical reactions. The simplified molecular-input line-entry
system (SMILES) of representing molecular structures is used to represent molecular connectivity and
stereochemical relationships as strings of characters, and indeed chemical reactions as wel . These SMILES string
representations inspired the creation of machine learning computer programs that learn the input/output relationship
Legal (c) Notice
that exists between reactant space
and product space
, using novel string transformation algorithms (implemented
within the book A New Kind of Chemistry (c) 2012
, scheduled to be released in the Fal of 2012 on Amazon.com,
using the Mathematica programming language).
Applications: Chemical Reaction Outcome Prediction, QSARs and Retrosynthetic Analysis.
Gene Therapy Net
As a demonstration of the use of SMILES strings to represent the connectivity and steric geometry of chemical
structures and reactions, and the utility of the machine learning technique, consider the fol owing two verified results
which were correctly predicted by a mathematical model derived from a dataset of 100,000 reactions (of which
these two reactions were excluded) possessing reactant profiles (structural and stoichiometric) somewhat similar
(very similar cases were excluded for purposes of testing) to each of the novel test cases:
Genes and Disease
Treasure Trove of
The Gamma Function OUR
The Gamma Function
Of course, the machine learning technique is equal y applicable to retrosynthetic analysis - having a target product in mind,
one is able to predict the structure of successful starting materials for the prior synthetic step. Many tentative starting
materials, or leads, for a synthetic step can be obtained by computing different predictive models, themselves obtained by
basing each of the new models on different subsets of the database. Such subsets can be chosen on some selection
criteria, or randomly, but in this case each training subset must be entirely composed of reactions having unique sets of
reactants to avoid multivalued data.
Reaction prediction is a one-to-one (1:1) relationship whereas retrosynthetic analysis concerns a one-to-many (1: M)
relationship. In the case of retrosynthetic analysis, this situation is dealt with by decreasing the size of the training data set to
the point where the resulting model makes incorrect suggestions a good fraction of the time. Having not incorporated a
significant amount (and possibly type) of knowledge from the database, the model has room to get creative sort of speak.
Yet by subsequently running the results through a wel -trained reaction prediction model, we borrow back definitiveness, and
thereby confirm whether the suggested reactions are feasible or not.
Machine learning of chemical reactions can be distinguished from the more orthodox approaches in three very important
ways: First, the work is entirely non-reductionist, explaining chemical reactivity not as the result of the behaviors of the
constituent subatomic particles, but rather as the result of higher mathematical conservation laws.
To understand why conservation laws, which represent mathematical symmetries, are used consider any set of non-col inear
data points in the Cartesian plane. The number of possible curves which could pass through those data points is infinite. It is
highly presumptuous and almost certainly in error to naively assume that a smooth curve connecting the data points would
represent the intermediate points correctly given an arbitrary curvy data set. Data fitting, which in essence even includes
techniques such as neural networks, in and of itself simply cannot be used to generalize data generical y. The fact remains
that at least one condition must be applied to the curve which would distinguish the curve as the
solution. And this requires
prior knowledge of a model. Data fitting, in any form, is only properly used to tweak the parameters of a model, not to derive
a model. This is a very common oversight that plagues much research in the field of computational intel igence.
In this work, we instead search for what is mathematical y conserved to within a proportionality factor. The mathematical
conservation law H is isomorphic to the linear relationship y=bx, such that H(m(D ))=H(m(D )) where the D are empirical
data points, 1 is a proportionality factor and m(*) is the chemical metric
. Given that the space is discrete and finite, we may
legitimately conclude, under the conditions of a sufficiently simple function H, sufficiently large i, and a wel -chosen metric,
that a mathematical conservation law has been determined, and that the values of the novel points [H(m(d )),H(m(d ))]
between the empirical points [H(m(D )),H(m(D ))] also lie along the straight line connecting the empirical points. The map
can then be considered completed and the d can be numerical y solved for. The whole point of linearization is that there
are aleph-2 possible different curves, a bigger infinity than that of the set of real numbers, aleph-1. But the set of linear rays
bound to a particular point is aleph-1, depending only upon the real value of .
H is searched for through a process of evolution. Random functional forms are generated, put through rounds of crossover,
mutation, simplification and selection. Both task performance and functional simplicity are applied as selective pressures.
Simplicity is sought such that we find true conservation functions. An unreasonable effectiveness of the function at task
completion is the goal.
When we apply our mathematical model-building technology to the mathematical analogue of the SMILES Reaction
Database or any subset thereof, we are applying the very same logic to a subset of chemical space
- the discrete space of
al molecular structures.
The second distinguishing factor is that the high-level mathematical conservation laws we use to predict reactions are based directly
* Experimental reaction data - the reaction database stores two mil ion reaction strings.
* Unique string representations of chemical graphs -- SMILES.
* Unique, uniformly-sized, order-dependent and reversible mathematical representations of strings as the product of
matrix (non-commutative) multiplication using a character-to-matrix substitution.
* Data splicing - defined as data fusion through the discovery of mathematical conservation laws.
* Evolution of simplest possible function H is key.
* H is a scalar function, while m is a matrix function.
* The functional form of H is dependent upon the functional form of m, the value of and the Di,k.
* Chemical metric - a scalar-valued matrix function based on an advanced theory of prototypicality.
Since the strings are represented by matrices while m(*) is a scalar, we are essential y assigning multidimensional data
points to points on the real line. This does not
lead to the assignment of more than one multidimensional data point to a
single point on the real line. In fact the size of the infinity representing al of the points in the plane and the size of the infinity
representing al of the points on the real line are the same. Thus unique assignments of al n-dim data points to points on the
real line are possible, which is provable. Take a point on a two-dimensional plane (x,y). We can take the digits which we
would use to write down x and y and simply interleave them. This interleaving technique results in a real number for every
possible point, and no two points on the plane map to the same number. This same argument can be extended to any
number of dimensions, as long as we have a finite number of dimensions. The concept of dimension has no effect on the
size or cardinality of an infinite space; dimensions are cardinal y meaningless. Yet here we are dealing with a discrete
hypervolume, a countable infinity if the whole volume is considered, but in this case - a very large finite number. The total
number of possible smal organic molecules alone that populate 'chemical space' has been estimated to exceed 1060.
Reaction space is thus unfathomably large, yet finite.
The third distinguishing factor is that the machine learning technique is both more definitive, more efficient and more capable
than the traditional approaches when applied to chemical reaction questions. For example, traditional quantum reactive
scattering calculations are typical y limited to reactions involving less than six atoms to within any degree of accuracy.
Reactive scattering problems involving more than six atoms become effectively intractable due to the combinatoric increases
in the number of operations that must be performed on the mathematical objects inherited from quantum theory to get at a
String transformations have many valuable applications in mathematics and physics as wel (for example, the formal
technique known as term rewriting
is used in the field of computer algebra systems).ABOUT THE SMILES REACTION DATABASE
In 2007, rapid work at TTM began on the assemblage of a human-reviewed chemical reaction database, soon after the
development of the supporting image knowledge-extraction and spidering software was final y achieved. The SMILES
Reaction Database is now 186.8 MB in size, and it contains two mil ion reactant-product pairs extracted from thousands of
respected journals and patents, contained in six files. The reaction data entries in each file of the database occur on
consecutive lines of the file, which are delineated by newline characters.
OBTAINING THE SMILES REACTION DATABASE
Pricing Information Sheet
Purchase an immediate download:
1 2 3 4 5 6 7 8 9 10
(Select purchasing option)
You may download a maximum of three times, so please save your files to a removable disc and store it safely.
Questions or Comments? Email Us
SMILES Reaction Database is Copyrighted (c) 2012 by Treasure Trove of Mathematics. All rights are reserved worldwide.