Once that we have obtained the topology of the tree we can compute the length of the branches in a more accurate way. Recall that in an evolution phylogenetic tree (ept), the length of the branches are proportional to the PAM distance between the nodes. In other words, we want to build a tree such that the branch lengths conform the distance matrix as closely as possible.

Since we do not have information about the internal nodes, all our computations will be based on the external nodes (present day) information.

An ept will be one which reflects the distance constraints accurately. For example,

the distance from node 1 to node *Y* plus the distance
from node *Y* to node 3 should be as close as possible
to the PAM distance between the two sequences 1 and 3
In general, if is the closest
common ancestor
of the sequences *i* and *j*

The crucial observation necessary to compute the distances is that in a correct ept

is a random variable
with expected value 0 and variance 1.
Optimally all the would be zero, but this
is not possible to guarantee since we do not have an
exact measure of the distances but estimates.
Then it is natural to select the internode branch lengths
as the values which minimize the sum of the .
This will give us a least squares fit
of the distances.
This is a non-conventional least squares problem which
does not present any technical difficulties except for
attention to detail.
In Darwin the computation of an ept by the above method
is done by the function `MinSqTree`

.

> print(MinSqTree);

MinSqTree: Usage: MinSqTree( Dist:matrix(numeric), Var:matrix(numeric), La bels:list ) Returns a binary tree which approximates the given distances and variances. Labels is an optional array for labeling the nodes.

`MinSqTree`

computes the topology of the tree with
the function `PhyloTree`

and then it solves the least
squares fit for the internode lengths and constructs the
final tree.
> tree3 := MinSqTree( Dist, Var, ['1','2','3','4','5'] );

MinLen not set, assumed to be 0.1 tree fitting index 0.12 (poor is >1, good is <1) tree3 := Tree(Tree(Leaf(3,-6.2611,3),-1.6669,Tree(Leaf(2,-8.3550,2), -6.1652,Leaf(1,-10.3819,1))),0,Tree(Leaf(4,-6.7873,4),-4.4666,Leaf(5, -9.8780,5)))

> DrawTree(tree3);The output of the above tree is shown in figure . (m14.tree3.ps, Phylogenetic tree generated by MinSqTree)

There are three sources of difficulties in the least squares computation. The first difficulty is that one pair of distances is underdetermined. The distances between the root and left and right subtrees are not completely determined. Their sum is, but the precise location of the root between these first two descendants is not. This is due to the fact that every time that one of these distances appears, it appears added with the other. This can be quickly verified in our example where all the occurrences of and are of the form . This is a consequence of our symmetric model of evolution, and it also happens for the alignment of two individual sequences (see chapter 7).

We resolve this difficulty with an arbitrary constraint, that the average height of all the nodes in the right and left subtree be the same. This gives a reasonable shape to the tree, but it has to be remembered that the location of the root between its two first descendants is arbitrary.

The second difficulty arises when we have incomplete data
and the least squares fit results underdetermined.
This could happen because the all-against-all procedure
may fail to align some pairs.
When we do the all-against-all
alignment we should only accept the matches with a similarity
score above certain threshold
(this threshold is usually between 80 and 100).
As a result of being unable to align some sequences the
tree may be disconnected.
When the tree is disconnected, the
connected components
should be handled separately.
However it may happen that the tree is connected but
the number of successful alignments is not enough to resolve
all internode distances.
Notice that when we have *n* sequences, we have *n*-1
internal nodes which give 2*n*-2 internode unknown
distances.
One of the two distances at the root is not determined so
we have 2*n*-3 unknowns.
The all-against-all matching provides at most *n*(*n*-1)/2
data points.
Consequently the internode distance computation

is exactly determined for *n*=2 and *n*=3 and
overdetermined
(if we have all the data points) for .
On the other hand a tree could be connected with just
*n*-1 alignments which will not be enough to resolve
2*n*-3 distances.

We resolve this problem by adding a term to the least squares which minimizes the sum of the square of the lengths of the branches. This sum of squares is multiplied by a very small factor ( ), so when the problem is fully determined, this extra term will not affect the results in any significant way, but when the system of equations is underdetermined this term will force undetermined lengths to be as short as possible. This solution is entirely satisfactory in practice. It should also be noted that this sort of problem happens rarely, and only when the sequences being analyzed are marginally related.

The third difficulty is that the least squares solution
may make some of the internode lengths
negative.
A negative length branch does not make sense in evolutionary
terms.
Such a situation arises from: (a) a bias in the
estimation of the PAM distances, or (b) alignments which
do not correspond to homology and hence the distance
estimates are incorrect and (c) an incorrect topology
(phase I of the construction failing).
When an internode branch is assigned a negative value,
this means that the least squares are not
attainable at that point, but in the limit of that branch
being of length zero.
A branch of length 0 is not impossible, it corresponds
to an ancestral sequence which split in a very short
period into three descendant families (two new offsprings)
instead of two (one new offspring).
In other words a ternary
as opposed to binary internal node.
The Darwin function `MinSqTree`

changes negative branch
lengths to very short (0.2 PAM) positive lengths.
The selection of a very small value as opposed to 0 is
inconsequential in the model, but allows better displaying of the
tree.

Negative branch lengths do not arise frequently and only happen with data which is difficult to align correctly. It is not a serious problem in practice.

The sum of the residual squares ( ) after the fit gives a rough idea of the quality of the fit. A small number indicates a good fit, a large number indicates a poor fit. Since this value is normalized with respect to its variance, it has an expected value 1. Values lower than 1 indicate that the data was better behaved than expected. A value greater than 1 indicates that either we were unlucky or the alignments are suspicious. In any case a value greater than 1 decreases our confidence on the phylogenetic tree.

A note to the complexity
of these procedures is in order.
The `PhyloTree`

function requires a pair of matrices
(double precision numbers) and works in time
proportional to .
Thus we can use the entire memory available
and expect the tree construction to complete quickly.
For example with 16Mb of memory we could resolve phylogenetic
trees up to 1000 sequences.
The procedure `MinSqTree`

has to resolve a system of
linear equations of size .
This requires about double
precision numbers in total and
time.
Consequently `MinSqTree`

will be able to solve only
smaller problems.

Thu Oct 31 08:31:39 EST 1996