In: "1992 Lectures in Complex Systems", Stein,D. and Nadel, L., Eds. Addison-Wesley Pub. Cy. and Santa Fe Institute, 1993, pp. 519-530.
[Caution: There are many subscripts and a several supers; thus, RCC should be read: R_{CC}, i.e C-C bond length, etc. Greek letters are spelled out in the text, e.g. pai, TETA (cap); the joys of html..., sorry, GY]
Biologists occasionally state that a certain system, or phenomenon "is very complex indeed". What exactly is meant by such a statement? Does it mean merely that no real understanding of the system concerned is available, or can the term "complex" be assigned a positive and precise meaning? We may consult a dictionary: Besides definitions like: "Complex: An irrational attitude (Psych.)" or a "non real number (Math.)", both not too helpful, we find also: "Composed of many interconnected parts" or: "Of intricate design" (Random House Dictionary, 1987). These last two definitions sound intuitively appropriate, because many biosystems are indeed composed of many parts, or components, and certainly have an intricate design. The components can be atoms or molecules on the ultimate level, molecular assemblies on the intermediate level, as well as cells or organs on the higher levels of bioorganization. The intricate details of these bioassemblies are coded for at least partly by the genome of the organism concerned. The information coded has been accumulated during evolution, is faithfully transmitted from generation to generation and is precisely decoded (expressed) within each generation.
All this is well known. The questions to be addressed here are whether these complex aspects of biosystems can be given a precise definition suitable for quantitative evaluation, and what such an evaluation can contribute to the understanding of these systems. Several formal approaches to the quantitative definition of complexity are available (for a critical evaluation, see Bennet, 1990).
One approach is based on the tacit assumption that behind many complex phenomena hides a simple mathematical relation, like a set of differential equations or a rule of a cellular automaton. It is the solution of these which within a range of real conditions manifests the complex behavior or pattern observed. The task of the biologist, or biophysicist , is to detect the generating function giving rise to the complex pattern. This approach gives low weight to the fact that bioentities have many of their properties specified by the vast repertoire of instructions encoded in their genome. An example can be the generation of biological form - morphogenesis. While a certain amount of symmetry related features can be beautifully explained by simple growth mechanisms (Green and Phoetig, 1982), it is clear that even quite simple creatures like viruses cannot be generated unless dozens of genes produce their precisely coded product. For instance, in bacteriophage T4 at least 49 genes must be precisely expressed in order to generate the three structural components of its virion form - head, tail and tail fibers and produce an infective virion (Yagil, in preparation).
A second approach does not insist on the presence of a simple generating function, and tries to evaluate complexity by estimating the information content of the states which a biosystem realizes in relation to the information content of all possible states of the system. The complexity of a system is expressed in terms of informational entropy, employing Shannon-Weaver related expressions (Wicken, 1987; Feistel and Ebeling 1989). This approach is particularly suitable for the description of stochasticaly determined features or processes, but so far it has not proved to be able to include those aspects of life where a high input of genome coded instructions is involved.
The third approach, formulated by Solomonoff (1964), Kolmogorov (1965) and Chaitin (1987), characterizes the complexity of a string of symbols by the minimal size of the program which will compute that string; in a real system this would mean the of minimal size of the set of instructions required in order to obtain a complete description of that system. This approach neither presumes a mechanistic understanding nor requires complete knowledge of all possible states a system can assume. Complexity thus evaluated has been called algorithmic complexity, and has been shown to be able to account for the entropic properties of physical systems (Zurek, 1990).
This last approach is particularly suitable for the characterization of biosystems, because it has the ability to evaluate long sets of coded information inherent in biosystems. For a beautiful descriptive exposure, see Dawkins, The Blind Watchmaker, 1987. Examining so far sequenced genomic DNA, it does not seem that we shall discover simple functions which can generate most of the the many million bits of information present in the genetic code. This implies that the complexity of a substantial part of biological information is more likely to be described by approaches which just enumerate features than by approaches which calculate probabilities or presume underlying simple processes. We have consequently adopted the Kolmogorov-Chaitin approach to the evaluation of the structural features of biocomplexity applying it to molecular assemblies of interest in biosystems (Yagil, 1985).
In this paper the formalism designed to evaluate the structural complexity of bio- and other systems is described and applied in detail some fairly simple organic molecules . The basic idea is to express the Structural Complexity of a system in terms of the size of the of set instructions leading to that structure. The formalism was developed for typed point systems, i.e for sets of points which may have different compositions. Simple molecules as well as biomolecules and their assemblies can be regarded as typed point systems. In previous papers we analyzed two very simple molecules (methane and ethane, Yagil 1985) as well as several macromolecules and bioassemblies (Yagil, 1992), and showed a connection between the complexities arrived at and the coding requirements of the biomolecules treated.
In the following the complexity of three simple organic molecules is evaluated, because this helps to clarify both the procedure employed and the assumptions inherent in the proposed treatment. To analyze a molecule, the composing atoms are numbered, and the molecule is put in a suitable coordinate system. Next, each coordinate is examined as to whether it has the same value in every molecule (is ordered), or whether it assumes different values at different times (is random). Complexity of the ordered coordinates is then evaluated by the following set of rules listed below, essentially the same as in Yagil (1985):
The criteria by which these rules have been formulated are the extent by which they lead to consistent descriptions using different coordinate and numbering systems. A formal justification is not attempted at present. The crucial rule in determining structural complexity is No.3, which reduces the number of specifications needed for each k fold regularity from k statements to a single one. This can be written:
C = SIGMA _{k}[c(k)/k] - c' (1)
The c(1) term of eq. 1 gives the contribution of uniquely specified coordinates, while all other terms represent coordinates of some repetitivity or regularity. These uniquely specified coordinates are not random, but ordered , because random coordinates have been excluded on the ground that they are indeterminate and it can not therefore be determined whether they obey any regular relationships. Most natural DNA templates are uniquely specified rather than random, because DNA in every cell of the same organism will have the same base sequence. The distinction between ordered (whether regular or uniquely specified) and random coordinates c can be formally expressed:
c = c_{ran} + c_{ord} = c_{ran} + c_{reg} + c_{us} (2)
The simple hydrocarbon molecule of pentane will serve as an example. A pentane molecule is composed of n=17 atoms - 5 carbons and 12 hydrogens. Three noncyclic isomers exist: n-pentane, isopentane and neopentane; the structural formulas are shown in Fig. 1.
Each pentane molecule is fully specified when all its 68 coordinates (one coordinate for type and three in space for each atom) are specified. The values the coordinates of neopentane assume are shown in the specification table for neopentane, Table 1. This table contains 68 numerical entries; this figure of 68 is thus an upper limit for the complexity of a pentane molecule.
(p= gk pai ; Q = gk TETA) ____________________________________________________________ i eps. r fi teta T ____________________________________________________________ 1 C 0 0 0 T0 2 C Rcc 0 Qcc/2 T0 3 C Rcc p/4 -Qcc/2 T0 4 C Rcc 2p/4 p - Qcc/2 T0 5 C Rcc 3p/4 p + Qcc/2 T0 6 H RCH Any1 QCH T1 7 H RCH Any1 + 2p/3 QCH T1 8 H RCH Any1 + 4p/3 QCH T1 9 H RCH Any2 QCH T2 10 H RCH Any2 + 2p/3 QCH T2 11 H RCH Any2 + 4p/3 QCH T2 12 H RCH Any3 QCH T3 13 H RCH Any3 + 2p/3 QCH T3 14 H RCH Any3 + 4p/3 QCH T3 15 H RCH Any4 QCH T4 16 H RCH Any4 + 2p/3 QCH T4 17 H RCH Any4 + 4p/3 QCH T4_____________________________________________________________
sinq cosf - cosq cosf sin f Rcc
(x') = T(x) = R(x) + D = sinq sinf - cosq sinf -cos f x + 0 (2)
cosq sinq 0 0
The actual structural complexity of neopentane is however considerably lower than 68, for three reasons:
Points 1 and 2 imply that the maximal complexity C_{max} that a pentane analogue with no regular feature can attain is : C_{max} = 4n - c' - c_{ran} = 68 - 6 - 4 = Ê58. An examination of Table 1 (and of analogous tables constructed with different coordinate or numbering systems) leads to the conclusion that polar coordinates give the most concise set of instructions for neopentane, comprising of the following minimal set of statements :
Cpri (1) is a primary, methyl carbon, RCC; RCH are carbon-carbon;carbon-hydrogen bond lengths and Q_{cc}; Q_{cH} are CCC and CCH bond angles. These 12 statements provide all the information needed to construct a neopentane molecule. Statements 4,7, and 10 are however placement statements, which do not contribute to the complexity of the molecule. The remaining 9 statements are required and lead to a value of C = 9 for the structural complexity of neopentane. If we want to relate this value to the maximal complexity available for a 17 atom system with 4 random coordinates, we obtain a relative complexity C_{r} = C/C_{max} of 9/58, i.e. C_{r}Ê=Ê0.155 for neopentane. The indeterminate four "Any" are, in this case, included in statement 9, which contains in addition the constant 2/3p. Relative complexity values can help to relate complexities of differently sized sytems.
Is neopentane less or more complex than n-pentane (CH3.CH2.CH2.CH2.CH3) or isopentane (CH3.CH2.CH.(CH3)2)? An effort to answer this question was the incentive to analyze the pentanes. To this end, the specification tables for n- and iso pentanes were set up and examined. The following minimal sets of statements results (the student is encouraged to do that):
For n-pentane (for numbering see Fig. 1):
Csec is a secondary, methylenic carbon. RCC'; RCC'' are distances between primary-secondary and secondary-secondary carbons, respectively. Statements 4, 9 and 15 are placement statements; statements 11, 12 refer to random coordinates only. This leaves 13 statements to describe the ordered part of the molecule. Consequently, the complexity of neopentane is C =13 and Cr = 13/58 = 0.225, more complex than neopentane on both absolute and relative scales.
For isopentane:
Statements 5,12,15 and 18 are placements, and 13 is random. On the other hand, statements 11,17 and 23 have to be counted twice, because they involve two different transformations each (T2, which transforms from C3 to C1, is different from T3,T4). This results in 21 necessary statements , i.e. the structural complexity of isopentane is C = 21 (CrÊÊ=ÊÊ0.36). Isopentane is thus the most complex of the three noncyclic pentanes, as intuitively expected. It should be noted that both isopentane and n-pentane have a plane of symmetry at certain values of "Any_{k}". Symmetry relations have so far not been too helpful; further analysis might nevertheless be rewarding.
The examples analyzed demonstrate that:
a. A value for the structural complexity of a typed point system can be assigned. This assignment is so far based on a somewhat arbitrary set of rules. Practice shows however that changing these rules leads to inconsistencies when the same system is analyzed in different ways. A more rigorous mathematical analysis is needed in order to determine whether the assignments are indeed unique and whether algorithms can be devised leading to these assignments uniquely.
b. An important step in complexity analysis of any system is the determination which coordinates are random and which are ordered (whether regular or uniquely specified). The test is in principle simple: Let us examine a certain number of systems in an ensemble, e.g. molecules in a specimen. If a certain coordinate assumes the same value in each molecule of the ensemble, then it belongs to the ordered repertoire. On the cellular level, one can compare for instance tubuli cells in the kidney to red blood cells. Tubuli cells are arranged in a radial fashion around the kidney tubuli, so that they represent an ordered, fairly regular set, and their contribution to the complexity of the organ can be assessed. In contrast, an erythrocyte (red cell) can be found anywhere in the blood stream; its positional coordinates are random, and no complexity value can be assigned to the arrangement of the erythrocytes in the organism. This randomicity test has to be applied for each coordinate before the complexity of any element can be assessed.
c. The formalism permits the assignment of a value not only to the complexity, but also to the degree of ordering of a system. This can be done by simply subtracting those coordinates which are indeterminate in the system; for instance for each of the 3 noncyclic pentanes the order OMEGA is OMEGA = 58/62. In cyclopentane (n=15), OMEGA = 52/54, because the ring constraints leave only two indeterminate angles, the "pseudorotation" and one torsion angle (Sanger, 1983). The distinction between random and ordered coordinates is important, because not only biosystems, but most real world systems have many indeterminate coordinates; think of the position of a point on the rim of a car wheel, or the number of twigs on a tree. Most real systems are only partially ordered, just like the pentanes. Structural complexity is relevant and assignable only to the ordered part of a system. The distinction between ordered and random coordinates of a system is a fundamental feature of the treatment presented, separating it from all previous treatments of the subject.
d. Structural complexities are extrathermodynamic quantities, because the structural complexities are determined by the stable molecular bonds and not by the occupancy of internal energy levels (except for possible rotational levels associated with random coordinates). Complexity differences persist at zero degrees Kelvin, where all internal energies are in the ground state and where, according to the third law of thermodynamics, all crystalline (ordered) compounds have a physical entropy of zero. Further, while conventionally measured entropy is an extensive property of systems, structural complexity is an intensive property, the complexity of a single pentane molecule being equal to that of a mole.
e. Structural complexity is low for most inanimate systems, but will assume high values in designed systems - systems which are created with the help of instructions specifying their pattern and composition. In the primitive molecular systems tackled here, instructions are provided by specific catalysts which can direct a chemical reaction towards one isomer (i.e. select one pentane isomer in preference to others). The degree of complexity thus achieved is, as we have seen, not too high. Higher degrees of complexity can be achieved when, in addition to a catalyst (enzyme), template molecules participate, like in DNA or protein biosynthesis. The high degree of complexity attained in the bioworld would be unthinkable without participation of replicable templates. Templates, in contrast to simple catalysts, can store and transmit large amounts of information, and their active presence accounts for the high complexity found in bioorganisms. Even higher degrees of complexity are achieved in artificial systems created by intelligent beings: The creation of a template, blueprint, or design (all synonyms for the present discussion), whether in the mind of the designer or on paper, is an essential step in making really complicated instruments or works of art. We can expect therefore that the concept of structural complexity will reach its full utility in the physical and chemical analysis of templated and designed systems.