Journal of Mathematical Biology manuscript No.
(will be inserted by the editor)
Calvin Ahlbrandt1 · Gary Benson2 · William Casey3
Minimal Entropy Probability Paths Between Genome Families
the date of receipt and acceptance should be inserted later
Composed: S A T U R D A Y 2 5 J A N U A R Y 2 0 0 3
Abstract. We develop a metric for probability distributions with applications to biological sequence analysis. Our distance
metric is obtained by minimizing a functional defined on the class of paths over probability measures on N categories. The
underlying mathematical theory is connected to a constrained problem in the calculus of variations. The solution presented is
a numerical solution, which approximates the true solution in a set of cases called rich paths where none of the components of
the path is zero. The functional to be minimized is motivated by entropy considerations, reflecting the idea that nature might
efficiently carry out mutations of genome sequences in such a way that the increase in entropy involved in transformation
is as small as possible. We characterize sequences by frequency profiles or probability vectors, in the case of DNA where
N is 4 and the components of the probability vector are the frequency of occurrence of each of the bases A, C, G and T.
Given two probability vectors a and b, we define a distance function based as the infimum of path integrals of the entropy
function H(p) over all admissible paths p(t), 0 ≤ t ≤ 1, with p(t) a probability vector such that p(0) = a and p(1) = b.
If the probability paths p(t) are parameterized as y(s) in terms of arc length s and the optimal path is smooth with arc
length L, then smooth and “rich” optimal probability paths may be numerically estimated by a hybrid method of iterating
Newton’s method on solutions of a two point boundary value problem, with unknown distance L between the abscissas,
for the Euler–Lagrange equations resulting from a multiplier rule for the constrained optimization problem together with
linear regression to improve the arc length estimate L. Matlab code for these numerical methods is provided which works
only for “rich” optimal probability vectors. These methods motivate a definition of an elementary distance function which is
easier and faster to calculate, works on non–rich vectors, does not involve variational theory and does not involve differential
equations, but is a better approximation of the minimal entropy path distance than the distance ||b − a||2. We compute
minimal entropy distance matrices for examples of DNA myostatin genes and amino-acid sequences across several species.
Output tree dendograms for our minimal entropy metric are compared with dendograms based on BLAST and BLAST
identity scores.
1. Introduction
Sequence alignment and analysis are among the most important and widely used techniques in computational
molecular–biology. The realm of problems for which sequence analysis plays a part is extensive, but includes homol-
ogy searching, gene finding, detection of open reading frames, protein structural/functional prediction, population
and species structure inference.
One of the most commonly used techniques for sequence alignment is the Smith–Waterman dynamic program-
ming algorithm for local alignment. Designed to detect highly conserved fragments in two sequences, it is highly
Calvin
Ahlbrandt:
Department
of
Mathematics,
University
of
Missouri,
Columbia,
MO
65211–0001,
calvin@math.missouri.edu
Gary Benson: Department of Biomathematical Sciences, Mount Sinai School of Medicine,1 Gustave L. Levy Place,New
York. NY 10029,benson@ecology.biomath.mssm.edu
William Casey: Courant Institute, New York University, 251 Mercer St, NYC, NY-10012, wcasey@cims.nyu.edu
Key words: ACGT sequences – entropy – probability vectors– probability paths – distance between genome families –
constrained variational problems – Euler-Lagrange multiplier rules.
Mathematics Subject Classification (1991): 92B05, 92D20
2
Ahlbrandt, Benson & Casey
informative but in general it has difficulty identifying two sequences which share a sufficiently long fragment with
more than 70% similarity ([3]). This problem has led to innovative ideas including ‘n-mers’ and randomized hashing
techniques ([9]). Local Sequence Alignment algorithms employing dynamical programming are similarity measures
but not distance metrics. Global Sequence Alignment can use either similarity or distance measures. At the cen-
ter of a similarity score is the substitution matrix, gap opening penalty, and gap extension penalty. A statistical
theory relating the structure of the substitution matrix to identification of similarity in sequences is developed for
un–gaped alignment ( gap penalty = ∞ ). (See [8,1].)
In addition to dynamic programming alignment algorithms several similarity scores and metrics, which are
independent of the order of characters, are in common use. These scores and distances may depend on frequency
profile, size of sequence or a combination of both.
The technique developed in this paper is a distance metric on frequency profiles. This allows us to compare
sequences based on their composition as a whole, rather than by sequence alignment.
Alignment between two profiles requires a function for comparing distributions. Such a scoring function F , given
two probability distributions d1 and d2, returns a measure of the similarity or distance between them. Functions F
are called divergence measures. Preferred functions will take into account the ‘conserved nature’ of the distributions.
A distribution is least conserved when each of the t possible alphabet characters occurs with frequency 1/t and
more conserved when one or two alphabet characters dominate. Two distributions should be judged closer if they
share the same conserved characters than otherwise. A simple example of a function which disregards conserved
nature is variational distance, the sum of differences in frequency for each letter in the distributions.
A variety of entropy related functions have been used as divergence measures. Examples are the relative entropy
and the symmetric relative entropy [10,12]. These measures are excellent for detecting conserved nature, but are
undefined (divide-by-zero) when one distribution has a character that the other lacks. Related functions include
the K, L and Jensen-Shannon divergence measures defined by Lin [12] and similar measures defined by Wong et
al. [19]. These are relative entropy measures which compute similarity to an average or weighted average distribution
and thereby avoid the divide-by-zero problem. But, these functions are extremely sensitive to the location of the
average distribution relative to those being tested and can produce a range of similarity measures which span
several orders of magnitude making them unsuitable for use in alignment algorithms.
A function commonly used in multiple alignment algorithms is the sum-of-pairs, based on the product of
matching and mismatching letter frequencies. Related measures have been used by Cavalli-Sforza [6] and Nei [13].
In the case of DNA, where nucleotide bases are generally scored as either a match or a mismatch, this scheme
inappropriately overemphasizes the mismatches and does not give consistent scores when distributions are identical
(for example if both distributions are (A,C,G,T)= (1.0, 0, 0, 0) the sum-of-pairs score is much better than if both
distributions are (A,C,G,T)= (0.5, 0, 0.5, 0)).
We define a divergence measure d(a, b) based on the shortest path between a pair of probability vectors a and b
with the path constrained to lie on the entropy surface. A simplified elementary measure dG(a, b) is also defined. It
can be computed in a case free manner whereas, thus far, we are only able to use our numerical package for solving
the Euler–Lagrange equation in order to compute d(a, b) whose optimal probability paths are smooth and have no
zero components in their probability vectors. For probability paths with monotone components, it is shown that
the distance function dG may be used to find a lower bound for the more intuitive distance function d. These new
distance functions do not base the comparison on an average distribution and give intuitively appealing distances
between distributions.
The paper is organized as follows: Section 2, “Entropy distance of a path”, defines the objective functional and
argument class as the set of paths in the space of measures on N categories. It is shown that the functional defines
a metric satisfying the triangle inequality. Section 3, “Elementary distance”, explores the relation of the defined
distance metric to the L2 norm. Section 4, “Computing the minimal entropy distance”, gives a formal argument of
how the function minimization problem may be reformulated as a dynamical system. The features of this section
are the formulation of a Lagrange–Multiplier rule, consistency and norm conditions, definition of the ROPI class
of matrices. This section defines how a numerical procedure may commence for rich probability vectors. Section 5,
“The deflated problem”, discusses how the L1 norm constraint may reduce the dimensionality of the problem in
implementation. A full implementation and numerical method is presented. Section 6, “Myostatin gene and protein
Minimal Entropy Probability Paths Between Genome Families
3
examples”, presents the results of the new divergence method on a family of myostatin genes. Section 7, “Example
N = 2”, verifies the optimality of a given function is the lowest dimensional case. Section 8, “Software description
and links”, gives access to numerical code which was used in the numerical examples.
2. Entropy distance of a path.
For p an N × 1 vector p = (pi) with N ≥ 2, then p is said to be a probability vector if pi ≥ 0 for i = 1, . . . , N and
N
p
i=1
i = 1. The components pi will be called the state variables. An entropy function H (p) is defined on the set
of probability vectors by
N
H(p) =
pi log2(1/pi),
(1)
i=1
where pi log2(1/pi) is defined to be 0 if pi = 0, since that is its limit as pi → 0 + . This makes H a nonnegative
continuous function of p. A probability vector p is pure if some pi = 1, i.e., p is an elementary vector. A probability
vector p will be called rich if no pi takes on the value of 0, i.e., p is a positive probability vector. Since we assumed
N ≥ 2, we know that H(p) is positive if and only if p is not pure, i.e., only pure vectors have zero entropy. By the
Lagrange multiplier rule, H(p) has maximum on the set of probability vectors p of log2(N) if and only if p is the
uniform probability distribution of pi := 1/N. Thus H(p) is a measure of disorder from a pure probability vector,
where the entropy is 0, to a maximum which occurs only for the uniform probability distribution.
Let a and b be probability vectors in RN . A probability path p(t) is an N × 1 probability vector valued function
defined for t ∈ [0, 1]. A probability path p(t) is said to be admissible if p(0) = a, p(1) = b, and p(t) is piecewise
smooth on [0, 1], i.e., p is continuously differentiable or there exist t1, . . . , tk with p continuous on [0, 1] and of
class C1 on each [ti−1, ti]. Note that some component of p could have different finite one–sided derivatives at one
or more ti, i.e., a component of p could have “corners”.
Let us label the class of admissible probability paths from a to b as Pa b. Define a functional E(p) on this
admissible class Pa b by
1
E(p) =
H(p(t)) ||p (t)||2 dt.
(2)
0
For motivation, see the calculus formulation of line integrals along curves given in [17, Theorem 1, p. 974]. Our
formulation entails a cost based not only on the path length but is weighted by the entropy along the path. A
similar measure was described in [4]. The variational principle which we are formulating is that the most efficient
transition between probability state vectors is the probability path which minimizes the line integral of the entropy
function.
We defined a distance function d on pairs of probability vectors a, b in RN as the infimum of E(p) on Pa b,
i.e.,
d(a, b) = infp∈P
E(p).
(3)
a b
In order to establish that d is a metric we define a scalar function h(u) for u in [0, 1] by
h(0) = 0,
and h(u) = u log2(1/u), 0 < u ≤ 1.
(4)
Then for u ∈ (0, 1] we have h(u) = −u log2(u) = u(− ln u)/(ln 2). Replace u by 1 − x and use the Maclaurin series
of f (x) := − ln(1 − x), valid for |x| < 1, which gives
∞
− ln u = − ln(1 − x) =
xn/n.
n=1
Thus we have the series representation
1
∞ u(1 − u)n
h(u) =
,
valid for 0 ≤ u ≤ 1.
(5)
ln 2
n
n=1
4
Ahlbrandt, Benson & Casey
Let PN be the set of N × 1 probability vectors.
Proposition 1 The distance function d : PN × PN → R is a metric.
Proof: If a and b are probability vectors in RN , then d(a, b) is nonnegative since H is a nonnegative function.
Also, d(a, a) = 0 since E(p) = 0 for p(t) := a on [0, 1].
Before we establish that a = b implies d(a, b) > 0, note that if p is admissible and joins a to b, then
q(t) := p(1 − t) has q(0) = b, q(1) = a, q is admissible, and E(p) = E(q). Take the infimum over all p for
d(a, b) = d(b, a), i.e., the distance is symmetric.
For the triangle inequality, let a, b, and c be probability vectors in RN . Consider admissible paths p ∈ Pa b
and q ∈ Pb c and define a path r ∈ Pa c by
r(t) = p(2t) if t ∈ [0, .5] and r(t) = q(2(t − .5)) if t ∈ [.5, 1].
(6)
Then
.5
1
d(a, b) ≤ E(r) =
H(p(2t))||2p (2t)||2 dt +
H(q(2(t − .5)))||2q (2(t − .5))||2 dt = E(p) + E(q)
0
.5
and we have d(a, b) ≤ E(p) + E(q) for all p ∈ Pa b and all q ∈ Pb c. Now take infimums over Pa b and Pb c to
establish the triangle inequality.
We now complete the proof that d is a metric by showing that d satisfies d(a, b) > 0 if a = b. Suppose a and
b are distinct probability vectors in RN . Then there exists an index k such that ak = bk. Assume without loss of
generality (by symmetry of d) that ak < bk. Let p be in Pa b. Then
1
1
E(p) =
H(p(t))||p (t)||2 dt ≥
h(pk(t))||p (t)||2 dt
0
0
1
1
bk
≥
h(pk(t))|pk (t)| dt ≥
h(pk(t))pk (t) dt =
h(u) du
(7)
0
0
ak
Since 0 ≤ ak < bk ≤ 1 and equation (5) implies the lower bound h(u) ≥ u(1 − u)/(ln 2), for 0 ≤ u ≤ 1, we know
that
1
bk
E(p) ≥
u(1 − u) du > 0.
ln 2 ak
Since E(p) is bounded away from 0 and that bound depends only upon ak and bk and is therefore independent of
the choice of path p between a and b, then the infimum over Pa b is bounded away from 0 and d(a, b) > 0.
3. An elementary distance
We use the above calculation to motivate the definition of an elementary distance function. Modify the above
argument as follows. Suppose that a and b are probability vectors in RN and p is an admissible probability path
joining a to b. Then the norm bound on RN of ||x||2 ≥ N−1/2||x||1 [16, p. 170] and h defined by equation (4) imply
1 N
E(p) =
h(pk(t))||p (t)||2 dt
0 k=1
1
1
N
N
≥ √
h(pk(t))
|p
N
k (t)|
dt
0
k=1
k=1
1
1
N
1
N
1
≥ √
h(pk(t))|p
√
h(pk(t))p
N
k (t)|
dt ≥
k (t) dt
0
N
k=1
k=1 0
1
N
bk
= √
h(u) du .
(8)
N k=1 ak
Minimal Entropy Probability Paths Between Genome Families
5
Let g(u) := (ln 2)h(u) = u ln(1/u) = −u ln u for 0 < u ≤ 1 and g(0):=0. Choose the antiderivative
G(u) = u2((1/2) − ln u)/2
for
0 < u ≤ 1 and
G(0) = 0.
(9)
Define a distance dG(a, b) by
√2 N
dG(a, b) :=
|G(b
ln 2
k) − G(ak)|.
(10)
k=1
√
The factor of
2 in the numerator was an arbitrary choice made to improve the comparison with d(a, b) when
N is 2. Note that as for d, the above dG has the property that if M additional zeros are appended after the last
components of a and b to produce vectors ˆ
a and ˆ
b in RN+M , then dG(ˆa, ˆb) = dG(a, b). We now compare dG(a, b)
to E(p) for p ∈ Pa b with monotone components.
Proposition 2 Suppose a and b are joined by an admissible p which has monotone components. Then
√
dG(a, b) ≤ ( 2N)E(p).
(11)
Proof: If p (t) is of constant sign, then
k
1
1
bk
g(pk(t))|pk (t)| dt = (±1)
g(pk(t))pk (t) dt = (±1)
g(u) du = |G(bk) − G(ak)|.
0
0
ak
Examples show that minimizing paths need not have monotone components if N > 2, although they do for N = 2.
Before discussing how we estimate d(a, b), we give an example.
Example 3 Suppose that N ≥ 2 and ej, for j = 1, . . . , N, is the elementary vector [δi,j] and cN is the centroid
of the simplex PN , i.e., cN = 1/N . . . 1/N T . If i = j, then an estimate for d(ei, ej) is 1.0201 and estimates for
d(cN , ej) are 0.51006 for N = 2; 0.9125 for N = 3; and 1.2020 for N = 4.
Remark 4 If a and b are near the centroid cN of PN , then d(a, b) is approximated by (log2 N)||b − a||2 since in
a neighborhood of cN , the entropy function has a maximum of log2 N and ||b − a||2 is an approximation of the arc
length of the minimizing curve.
4. Computing the minimal entropy distance
The variational problem as stated requires the paths p(t) to satisfy the N + 1 constraints
N
pi(t) = 1 and for each i = 1, . . . , N, pi(t) ≥ 0 on [0, 1].
(12)
i=1
In order to formulate the minimal entropy problem as a constrained variational problem with a single equality
constraint, introduce hidden variables qi which satisfy
q2i = pi,
i = 1, . . . , N.
(13)
Then pi(t) ≥ 0. Also,
N
p
q2
i=1 i = 1 if and only if
N
i=1 i = 1. Suppose that c and d are vectors in RN with c2
i = ai
and d2i = bi for i = 1, . . . , N. Define a function K(q) by
N
K(q) := H(p) =
q2i log2(1/q2i).
(14)
i=1
6
Ahlbrandt, Benson & Casey
Then the hidden variational problem is that of minimizing the functional
1/2
1
N
J(q) =
2K(q(t))
q2i(t)[qi (t)]2
dt
(15)
0
i=1
on the class Qc d of admissible piecewise smooth q defined on [0, 1] which satisfy
q(0) = c,
q(1) = d,
and ||q(t)||22 ≡ 1 on [0, 1].
(16)
If q ∈ Qc d and p is defined by pi = q2i for i = 1, . . . , N, then p ∈ Pa b. If, on the other hand, q is defined for
√
p ∈ Pa b by qi =
pi, for i = 1, . . . , N, we ask if q is in Qc d? Since the square root function is continuous on
[0, ∞), the qi are continuous on [0, 1]. However, the square root function of a real variable is differentiable on (0, ∞)
but does not have a right hand derivative at 0. Thus a qi might fail to have one–sided derivatives at zeros of qi.
For example, the real valued function f defined on [0, 1] by f (t) = |t − 1/2| is piecewise smooth and nonnegative
on [0, 1] but g defined by g(t) =
f (t) fails to have one sided derivatives at t = 1/2.
A probability path p(t) will be called rich if no pi(t) takes on the value of 0. We will usually suppose that the
vectors a and b are rich. The subset of Pa b consisting of rich probability paths will be denoted by Prich. The subset
a b
of Qc d consisting of admissible q(t) which have each qi(t) positive on [0, 1] will be denoted by Qpos. If p ∈ Prich
c d
a b
and q is defined by qi(t) =
pi(t) for each i = 1, . . . , N and each t ∈ [0, 1], then q ∈ Qpos. Conversely, if q ∈ Qpos
c d
c d
and p is defined by (13), then p ∈ Prich.
a b
Theorem 5 (Equivalent Minimization Problems) Suppose ˆ
q ∈ Qpos and ˆ
p ∈ Prich are related by ˆ
q
c d
a b
i(t) =
ˆ
pi(t) for i = 1, . . . , N. Then E(ˆ
p) = J(ˆ
q) and ˆ
p minimizes E on Prich if and only if ˆ
q minimizes J on Qpos.
a b
c d
Proof: Suppose p ∈ Prich and q ∈ Qpos are related on [0, 1] by q
a b
c d
i(t) =
pi(t) for each i = 1, . . . , N. If ˆ
p minimizes
E on Prich then
a b
J(ˆ
q) = E(ˆ
p) ≤ E(p) = J(q)
and ˆ
q minimizes J on Qpos. Conversely, if ˆ
q minimizes J on Qpos, then
c d
c d
E(ˆ
p) = J(ˆ
q) ≤ J(q) = E(p)
and ˆ
p minimizes E on Prich.
a b
Note that this theorem does not say that the variational problem of minimizing E on Pa b is equivalent to the
variational problem of minimizing J on Qc d. It could be that a minimizer p for E on Pa b, while piecewise smooth,
would not allow definition of a piecewise smooth q.
In order to use variational methods, we assume that the hidden problem has the property that J has a minimum
on Qc d which is assumed by a q in Qpos. Then the corresponding p is a candidate for minimizing E on P
c d
a b and p is
actually a rich probability path. When the numerics for approximating a function q in Qpos, of the boundary value
c d
problem arising in the Euler–Lagrange multiplier rule fail, then failure could be caused by flawed code, limitations
on the numerical methods, or because the assumption of a rich minimizer for E on Pa b is invalid, possibly because
smaller values of E can be attained with a probability path having corners.
We seek positive mimimizers for J because the hidden variational problem has a single differentiable side
condition and is amenable to the Lagrange multiplier method.
Suppose we succeed in finding a minimizer ˆ
q for J on Qc d which is in Qpos. Then ˆ
q is a minimizer for J on
c d
Qpos and the corresponding ˆ
p is a minimizer for E on the subclass Prich.
c d
a b
Once we have a minimizer q for J, it may be possible to numerically check by other methods (such as Dijkstra’s
algorithm) that the corresponding p approximates a minimum for E on Pa b.
In order to state the multiplier rule, introduce the following terminology and notation. For p a probability path,
let v(t) := p (t), and think of v as the “velocity vector”. Set V (v) := ||v||2. Then
N
1/2
N
1/2
V (p (t)) = ||p (t)||2 =
[pi (t)]2
=
[2qi(t)qi (t)]2
.
i=1
i=1
Minimal Entropy Probability Paths Between Genome Families
7
Let W (q, r) := {
N [2q
i=1
iri]2}1/2 and M (q, r) := K(q)W (q, r). In this notation we have
1
E(p) =
L(p(t), p (t)) dt
0
for L(p, v) = H(p)V (v) and
1
1
J(q) =
M (q(t), q (t)) dt =
K(q(t)) W (q(t), q (t)) dt.
0
0
We now give the multiplier rule for the problem of minimizing J on Qc d. Introduce an auxiliary function
F (q, r) := λ0M(q, r) + λ1g(q),
where
(17)
g(q) := ||q||22 =
N
q2
i=1 i .
From Ewing, [7, Thm 5.2, p. 115], we have the following multiplier rule.
Theorem 6 (Lagrange Multiplier Rule) Suppose that ˆ
q ∈ C1 is a minimizer for the hidden variational prob-
lem such that each component ˆ
qi(t) is never 0 on [0, 1]. Then there exist multipliers λ0, a real constant, and a
function λ1(t), not simultaneously 0 at any point of [0, 1], such that the auxiliary function F satisfies the Euler–
Lagrange equation
d F
dt r(ˆ
q(t), ˆ
q (t)) = Fq(ˆ
q(t), ˆ
q (t)) on [0, 1].
(18)
Here Fq and Fr denote the N × 1 column vectors [Fq ] and [F ], respectively.
i
ri
If λ0 = 0, then it follows that λ1(t)qi(t) ≡ 0 from which q(t) would have to be identically the zero vector contrary
to ||q(t)||2 = 1. Thus, the problem is “normal” and, without loss of generality, we may choose λ0 = 1. Denote λ1(t)
by λ(t). Now F (q, r) = M (q, r) + λg(q) = K(q)W (q, r) + λg(q) gives
Fq = K W + KW + λg
i
qi
qi
qi
(19)
Fr = KW
i
ri
and the Euler–Lagrange multiplier rule is
d {KW } = K W + KW + λg
i = 1, . . . , N,
(20)
dt
ri
qi
qi
qi
where the functions are evaluated at (ˆ
q(t), ˆ
q (t)). These equations become
d
K(q)
4q2q
i i
= 2λq
dt
{PN
i +
[2q
]2}1/2
j=1
j qj
1/2
K(q)
4qi(q )2
i
+ 2q
) − 2qi
N
[2q
,
{PN
i log2( 1
j q
[2q
]2}1/2
q2
ln 2
j=1
j ]2
j=1
j qj
i
for i = 1, . . . , N. Multiply the ith equation by qi and use pi = 2qiqi for
d
4q2
(p
1
2q2
q
i qi
i )2
i
i
K(q)
= 2λq2
+ 2q2
) −
||p ||.
dt
||p ||
i
+ K(q) ||p ||
i log2( q2
ln 2
i
Apply uv = (uv) − u v to the left hand side to obtain
d
4q3
(p
1
2q2
K(q)
i qi
= 2λq2
i )2 + 2q2
) −
i
||p || .
dt
||p ||
i
+ 2K(q) ||p ||
i log2( q2
ln 2
i
8
Ahlbrandt, Benson & Casey
Remove a factor of 2 for
d
q2
(p
1
q2
K(q) i (2qiqi )
= λq2
i )2 + q2
) −
i
||p ||
dt
||p ||
i
+ K(q) ||p ||
i log2( q2
ln 2
i
which may be expressed in p variables as
d
p
(p
1
p
H(p) ipi
= λp
i )2
+ p
) −
i
||p ||.
(21)
dt
||p ||
i + H (p)
||p ||
i log2( pi
ln 2
Change to a dummy index of j, sum from j = 1 to j = N, and use the fact that p is a probability vector to express
the multiplier in terms of p and p
1
N
d
pjp
λ =
− 2H(p) ||p || +
H(p)
j
.
(22)
ln 2
dt
||p ||
j=1
This is the first step. Now substitute for λ in equation (21) for the resulting Euler–Lagrange system of equations
in terms of state and velocity vectors
d
H(p) pipi
= −2p
N
d
H(p) pjpj
dt
||p ||
iH (p)||p || + pi
j=1 dt
||p ||
(23)
+ H(p) (p )2
i
+ p
) ||p ||
||p ||
i log2( 1
pi
for i = 1, . . . , N. These equations become tractable if one makes the assumption that ||p (t)|| is alway positive on
[0,1] and then (as in differential geometry) makes a change of independent variable from t to arc length s, i.e.,
t
s :=
||p (τ )|| dτ.
0
Let L = 1 ||p (τ )|| dτ, i.e., L is the arc length of the optimal probability path. Then let y(s) := p(t) and let an
0
overdot be differentiation with respect to s. Since ds = ||p (t)||, the chain rule applied to any differentiable function
dt
u(t) gives
du
du ds
=
= ˙u(s)||p (t)||
dt
ds dt
which implies
1
du = ˙u(s).
||p (t)|| dt
Thus pj/||p || = ˙yj. Divide system (23) by ||p || and replace p(t) by y(s) for the equivalent Euler–Lagrange equation
d {H(y)y
N
d {H(y)y
ds
i ˙
yi} = − 2yiH(y) + yi
j=1 ds
j ˙
yj}
(24)
+ H(y)( ˙yi)2 + yi log2( 1 ),
yi
for i = 1, . . . , N. Define z by
zj(s) := yj(s) ˙yj(s)H(y(s))
(25)
for j = 1, . . . , N. Equation (24) becomes
N
˙zi − yi
˙zj = −2yiH(y) + ( ˙yi)2H(y) + yi log2(1/yi), i = 1, . . . , N.
(26)
j=1
Minimal Entropy Probability Paths Between Genome Families
9
Then express ˙yi in terms of H(y), yi, and zi. It will be possible to solve for ˙yi as
z
˙y
i
i = Pi(y, z) :=
, i = 1, . . . , N,
(27)
yiH(y)
if we use the assumption that the qi never vanish. Then our probability path is rich, i.e., no pi(t) takes on the
value of 0 at any t ∈ [0, 1]. Rewrite equation (26) as
˙z
N
i − yi
˙z
j=1 j = Ri(y, z)
(28)
:= −2yiH(y) +
z2
i
+ y
y2H(y)
i log2(1/yi), i = 1, . . . , N.
i
Let
1 − y1 −y1 −y1 . . . −y1
−y2 1 − y2 −y2 . . .
−y2
A(y) :=
.
.
.
.
= I
.
.
N − ywT ,
(29)
.
.
.
.
.
. .
.
.
−yN −yN −yN . . . 1 − yN
where wT := 1 1 . . . 1 . Then equation (28) may be written as
A(y) ˙z = R(y, z)
(30)
An N –vector y will be called positive if each yi is positive [11, p. 352].
Theorem 7 (Consistency and Norm Conditions) If y, z is a solution of the system
˙y = P (y, z),
A(y) ˙z = R(y, z)
(31)
on [0, L] with y(s) a rich probability vector for every s ∈ [0, L], then y, z satisfies the “consistency condition”
N
zi/yi ≡ 0 on [0, L]
(32)
i=1
and the “norm condition”
1
N
z2i ≡ 1 on [0,L],
(33)
(H(y))2
y2
i=1
i
i.e., || ˙y(s)||2 ≡ 1 on [0, L]. Conversely, if y, z is a solution of (31) with y(s) a positive vector on [0, L] such that the
consistency condition is satisfied, then
N
y
i=1 i(s) is constant on [0, L]; if, in addition, y(0) is a probability vector,
then y(s) is a probability vector for each s ∈ [0, L] and || ˙y(s)||2 ≡ 1 on [0, L].
Proof. If y is a probability vector, then
N
˙y
1 ≡ 0. Sum equation (27) on i for the consistency condition.
i=1 i ≡ d
ds
Summing equation (28) on i gives
1
N
z2
1
N
z2
0 = −2H(y) +
i + H(y) = −H(y) +
i .
H(y)
y2
H(y)
y2
i=1
i
i=1
i
Thus 1 = (1/H(y))2
N
z2
( ˙y
i=1 i /y2
i =
N
i=1
i)2 = || ˙
y||2. Conversely, the consistency condition and summing of (27)
implies that
N
˙y
y
i=1 i is 0 and
N
i=1 i is constant.
In order to numerically approximate solutions of (31) by methods for autonomous vector systems, we would
like to replace that system with a system
˙y = P (y, z)
˙z = Q(y, z)
(34)
10
Ahlbrandt, Benson & Casey
Theorem 8 (Explicit E–L System) Assume that y is a rich probability path on [0, L] and y, z is a solution of
(31) on [0, L] with P, R, and A defined by (27), (28), and (29), respectively. Then y, z satisfies the consistency
condition (32) on [0, L], || ˙y|| ≡ 1, and there exists a continuous Q(y, z) such that y, z is a solution of the system
(34) on [0, L].
Before presenting a construction of Q, which suggests code for use of a numerical integration package, we need
some general comments about matrices of the form of A(y). Observe that A(y) is of a more general form than a
Householder matrix, i.e., an elementary reflector, H := I −2uuT where uT u = 1, which has the properties HT = H,
H2 = I and hence H−1 = H. [16, p. 45]. Observe that A(y) of equation (29) and Householder matrices H share
the property of being rank one perturbations of I .
In a wider sense, the concept of rank one perturbations of matrices, or of operators, has been employed previously
by Wilkinson [18] in finding a nearest singular matrix to a given nonsingular matrix and in the study of sensitivity
of eigenvalues of matrices; also Simon [15] gave a spectral analysis of rank one perturbations of a positive selfadjoint
operator on a separable Hilbert space.
The acronym ROPI will denote a special class of rank one perturbations of I defined as follows:
Definition 9 (ROPI Matrices) Let γ = 0 and consider nonzero N × 1 vectors of real numbers y and w. If
δ := wT y is such that γδ = −1, then the N × N matrix A, called a ROPI matrix, is defined by
A = IN + γywT .
(35)
Theorem 10 (Inverses of ROPI Matrices) If A = IN + γywT is a ROPI matrix, then A is nonsingular, A−1
is ROPI, and
γ
A−1 = IN −
ywT
(36)
1 + γδ
Proof. Let B = γywT . Then A = I + B and
B2 = γywT (γywT ) = γ2y(wT y)wT = γ2yδwT = γδ(γywT ) = γδB.
Replace B by A − I to obtain
(A − I)2 = γδ(A − I)
which implies
A[A − (2 + γδ)I] = −(1 + γδ)I.
The assumption γδ = −1 implies that A has an inverse which is given by
1
1
A−1 = −
[A − (2 + γδ)I] = −
[I + γywT − (2 + γδ)I]
1 + γδ
1 + γδ
γ
and A−1 = I −
ywT as claimed.
1 + γδ
Unfortunately, in our application the matrix A(y) defined by (29) does not have an inverse if y is a probability
vector. In that case, every column of A(y) sums to 0 and the rows of A(y) are linearly dependent. Thus A(y) is
singular. Also, γ = −1 and δ = wT y = 1 imply γδ = −1, contrary to our definition of a ROPI matrix. However,
since the system of equations A(y) ˙z = R(y, z) is redundant, any one equation, say the kth equation, may be deleted
without loss of information if we carry along the consistency condition. The concept of ROPI matrices then may
be applied to the deflated problem.
Proof of Theorem 8. Suppose that y is a rich probability vector and k ∈ {1, . . . , N }. If z is an N × 1 matrix,
let z[k] be the (N − 1) × 1 matrix, i.e, column vector, obtained by deleting the kth entry of z. For the N × N
matrix A, let A[k] be the submatrix obtained from A by deleting the kth row and kth column from A.. Then
Add New Comment