On The Structural Complexity of Templated Systems

By: G.Yagil, The Weizmann institute of Science, Rehovot, Israel.


In: "1992 Lectures in Complex Systems", Stein,D. and Nadel, L., Eds. Addison-Wesley Pub. Cy. and Santa Fe Institute, 1993, pp. 519-530.
[Caution: There are many subscripts and a several supers; thus, RCC should be read: RCC, i.e C-C bond length, etc. Greek letters are spelled out in the text, e.g. pai, TETA (cap); the joys of html..., sorry, GY]



A. INTRODUCTION

Biologists occasionally state that a certain system, or phenomenon "is very complex indeed". What exactly is meant by such a statement? Does it mean merely that no real understanding of the system concerned is available, or can the term "complex" be assigned a positive and precise meaning? We may consult a dictionary: Besides definitions like: "Complex: An irrational attitude (Psych.)" or a "non real number (Math.)", both not too helpful, we find also: "Composed of many interconnected parts" or: "Of intricate design" (Random House Dictionary, 1987). These last two definitions sound intuitively appropriate, because many biosystems are indeed composed of many parts, or components, and certainly have an intricate design. The components can be atoms or molecules on the ultimate level, molecular assemblies on the intermediate level, as well as cells or organs on the higher levels of bioorganization. The intricate details of these bioassemblies are coded for at least partly by the genome of the organism concerned. The information coded has been accumulated during evolution, is faithfully transmitted from generation to generation and is precisely decoded (expressed) within each generation.

All this is well known. The questions to be addressed here are whether these complex aspects of biosystems can be given a precise definition suitable for quantitative evaluation, and what such an evaluation can contribute to the understanding of these systems. Several formal approaches to the quantitative definition of complexity are available (for a critical evaluation, see Bennet, 1990).

One approach is based on the tacit assumption that behind many complex phenomena hides a simple mathematical relation, like a set of differential equations or a rule of a cellular automaton. It is the solution of these which within a range of real conditions manifests the complex behavior or pattern observed. The task of the biologist, or biophysicist , is to detect the generating function giving rise to the complex pattern. This approach gives low weight to the fact that bioentities have many of their properties specified by the vast repertoire of instructions encoded in their genome. An example can be the generation of biological form - morphogenesis. While a certain amount of symmetry related features can be beautifully explained by simple growth mechanisms (Green and Phoetig, 1982), it is clear that even quite simple creatures like viruses cannot be generated unless dozens of genes produce their precisely coded product. For instance, in bacteriophage T4 at least 49 genes must be precisely expressed in order to generate the three structural components of its virion form - head, tail and tail fibers and produce an infective virion (Yagil, in preparation).

A second approach does not insist on the presence of a simple generating function, and tries to evaluate complexity by estimating the information content of the states which a biosystem realizes in relation to the information content of all possible states of the system. The complexity of a system is expressed in terms of informational entropy, employing Shannon-Weaver related expressions (Wicken, 1987; Feistel and Ebeling 1989). This approach is particularly suitable for the description of stochasticaly determined features or processes, but so far it has not proved to be able to include those aspects of life where a high input of genome coded instructions is involved.

The third approach, formulated by Solomonoff (1964), Kolmogorov (1965) and Chaitin (1987), characterizes the complexity of a string of symbols by the minimal size of the program which will compute that string; in a real system this would mean the of minimal size of the set of instructions required in order to obtain a complete description of that system. This approach neither presumes a mechanistic understanding nor requires complete knowledge of all possible states a system can assume. Complexity thus evaluated has been called algorithmic complexity, and has been shown to be able to account for the entropic properties of physical systems (Zurek, 1990).

This last approach is particularly suitable for the characterization of biosystems, because it has the ability to evaluate long sets of coded information inherent in biosystems. For a beautiful descriptive exposure, see Dawkins, The Blind Watchmaker, 1987. Examining so far sequenced genomic DNA, it does not seem that we shall discover simple functions which can generate most of the the many million bits of information present in the genetic code. This implies that the complexity of a substantial part of biological information is more likely to be described by approaches which just enumerate features than by approaches which calculate probabilities or presume underlying simple processes. We have consequently adopted the Kolmogorov-Chaitin approach to the evaluation of the structural features of biocomplexity applying it to molecular assemblies of interest in biosystems (Yagil, 1985).

B. STRUCTURAL COMPLEXITY

In this paper the formalism designed to evaluate the structural complexity of bio- and other systems is described and applied in detail some fairly simple organic molecules . The basic idea is to express the Structural Complexity of a system in terms of the size of the of set instructions leading to that structure. The formalism was developed for typed point systems, i.e for sets of points which may have different compositions. Simple molecules as well as biomolecules and their assemblies can be regarded as typed point systems. In previous papers we analyzed two very simple molecules (methane and ethane, Yagil 1985) as well as several macromolecules and bioassemblies (Yagil, 1992), and showed a connection between the complexities arrived at and the coding requirements of the biomolecules treated.

In the following the complexity of three simple organic molecules is evaluated, because this helps to clarify both the procedure employed and the assumptions inherent in the proposed treatment. To analyze a molecule, the composing atoms are numbered, and the molecule is put in a suitable coordinate system. Next, each coordinate is examined as to whether it has the same value in every molecule (is ordered), or whether it assumes different values at different times (is random). Complexity of the ordered coordinates is then evaluated by the following set of rules listed below, essentially the same as in Yagil (1985):

  1. Structural complexity C is the size of {the minimal} set of specifications describing that system.
  2. A specification can be the assignment of a numerical value to one or more spatial coordinates of a point in the system, or the declaration of the type of the point . A type may be a chemical element, a nucleotide base, a cell type, or any other compositional element e(gk epsilon).
  3. Coordinate values which can be correlated by a mathematical expression are counted either as a single specification when a single numerical value is included, or by as many new numerical constants as present in the expression
  4. An ordinal number is not counted as a specification.
  5. The declaration of the range of atom numbers over which an expression is valid is not counted as a specification.
  6. A simple numerical coefficient like pai or (-1)i is not counted as a separate specification (this rule makes tetrahedral and planar coordination spheres, for example, equally complex).
  7. A transformation of the coordinate system adds to the complexity as many specifications as new constants are included; only a single (dummy) transformation can accompany a specification statement.

The criteria by which these rules have been formulated are the extent by which they lead to consistent descriptions using different coordinate and numbering systems. A formal justification is not attempted at present. The crucial rule in determining structural complexity is No.3, which reduces the number of specifications needed for each k fold regularity from k statements to a single one. This can be written:

         C =    SIGMA k[c(k)/k] - c'        (1) 

where c(k) is the number of coordinates sharing a k fold regularity and c' is the number of the coordinates necessary for placing the system in the external space (usually 5 or 6). Equation (1) represents the intuitive idea that the more regular, repetitive features a system has, the lower its complexity will be.

The c(1) term of eq. 1 gives the contribution of uniquely specified coordinates, while all other terms represent coordinates of some repetitivity or regularity. These uniquely specified coordinates are not random, but ordered , because random coordinates have been excluded on the ground that they are indeterminate and it can not therefore be determined whether they obey any regular relationships. Most natural DNA templates are uniquely specified rather than random, because DNA in every cell of the same organism will have the same base sequence. The distinction between ordered (whether regular or uniquely specified) and random coordinates c can be formally expressed:

          c  =  cran  +  cord =  cran + creg + cus          (2)

We shall soon see how this distinction between random and ordered elements can be implemented.

C. THE NEOPENTANE MOLECULE

The simple hydrocarbon molecule of pentane will serve as an example. A pentane molecule is composed of n=17 atoms - 5 carbons and 12 hydrogens. Three noncyclic isomers exist: n-pentane, isopentane and neopentane; the structural formulas are shown in Fig. 1.

Each pentane molecule is fully specified when all its 68 coordinates (one coordinate for type and three in space for each atom) are specified. The values the coordinates of neopentane assume are shown in the specification table for neopentane, Table 1. This table contains 68 numerical entries; this figure of 68 is thus an upper limit for the complexity of a pentane molecule.

Table 1. Specification Table of Neopentane, C(CH3)4; n= 17

 (p= gk pai ; Q = gk TETA)

____________________________________________________________
 i  	eps. 	r	 fi		 teta		T
____________________________________________________________
1	C	0	  0		 0 		T0
2	C	Rcc	  0		 Qcc/2		T0
3	C	Rcc	   p/4   	-Qcc/2		T0
4	C	Rcc	2p/4		p - Qcc/2	T0
5	C	Rcc	3p/4		p + Qcc/2	T0
6	H	RCH	Any1		QCH		T1
7	H	RCH	Any1 + 2p/3	QCH		T1
8	H	RCH	Any1 + 4p/3	QCH		T1
9	H	RCH	Any2		QCH		T2
10	H	RCH	Any2 + 2p/3	QCH		T2
11	H	RCH	Any2 + 4p/3	QCH		T2
12	H	RCH	Any3		QCH		T3
13	H	RCH	Any3 + 2p/3	QCH		T3
14	H	RCH	Any3 + 4p/3	QCH		T3
15	H	RCH	Any4		QCH		T4
16	H	RCH	Any4 + 2p/3	QCH		T4
17	H	RCH	Any4 + 4p/3	QCH		T4
_____________________________________________________________

The actual structural complexity of neopentane is however considerably lower than 68, for three reasons:

  1. The placement coordinates, c'=6 (4 zeros and ▒TETACC/2), should not be counted, because they fix the position of the molecule in the external space, independently from internal complexity.
  2. More significant, four coordinates in each pentane molecule have no fixed value (at high enough temperatures) because of the free rotations around 4 of the bonds. This results in four f angles having indeterminate values, designated "Any1" to "Any4" in Table 1. In other words, the values of these f angles are different for each molecule in an molecular ensemble as well as at any particular time point. These coordinates cannot be considered as ordered features, and, as said, do not contribute to the complexity of the system.
  3. Many entries in the table are redundant, because of the many specifications which are either equal or interrelated. These specifications can be correlated by short statements like: ri = Rcc, i=2-5 (or r2-5 = Rcc) for the four methyl carbons (i = 2-5 is a range statement; RCC the carbon-carbon bond length ). These four ri values thus form , by rule 3, a single c(4) contribution to eq. (1).

Points 1 and 2 imply that the maximal complexity Cmax that a pentane analogue with no regular feature can attain is : Cmax = 4n - c' - cran = 68 - 6 - 4 = ╩58. An examination of Table 1 (and of analogous tables constructed with different coordinate or numbering systems) leads to the conclusion that polar coordinates give the most concise set of instructions for neopentane, comprising of the following minimal set of statements :

Cpri (1) is a primary, methyl carbon, RCC; RCH are carbon-carbon;carbon-hydrogen bond lengths and Qcc; QcH are CCC and CCH bond angles. These 12 statements provide all the information needed to construct a neopentane molecule. Statements 4,7, and 10 are however placement statements, which do not contribute to the complexity of the molecule. The remaining 9 statements are required and lead to a value of C = 9 for the structural complexity of neopentane. If we want to relate this value to the maximal complexity available for a 17 atom system with 4 random coordinates, we obtain a relative complexity Cr = C/Cmax of 9/58, i.e. Cr╩=╩0.155 for neopentane. The indeterminate four "Any" are, in this case, included in statement 9, which contains in addition the constant 2/3p. Relative complexity values can help to relate complexities of differently sized sytems.

D. n-PENTANE AND isoPENTANE.

Is neopentane less or more complex than n-pentane (CH3.CH2.CH2.CH2.CH3) or isopentane (CH3.CH2.CH.(CH3)2)? An effort to answer this question was the incentive to analyze the pentanes. To this end, the specification tables for n- and iso pentanes were set up and examined. The following minimal sets of statements results (the student is encouraged to do that):

For n-pentane (for numbering see Fig. 1):

Csec is a secondary, methylenic carbon. RCC'; RCC'' are distances between primary-secondary and secondary-secondary carbons, respectively. Statements 4, 9 and 15 are placement statements; statements 11, 12 refer to random coordinates only. This leaves 13 statements to describe the ordered part of the molecule. Consequently, the complexity of neopentane is C =13 and Cr = 13/58 = 0.225, more complex than neopentane on both absolute and relative scales.

For isopentane:

Statements 5,12,15 and 18 are placements, and 13 is random. On the other hand, statements 11,17 and 23 have to be counted twice, because they involve two different transformations each (T2, which transforms from C3 to C1, is different from T3,T4). This results in 21 necessary statements , i.e. the structural complexity of isopentane is C = 21 (Cr╩╩=╩╩0.36). Isopentane is thus the most complex of the three noncyclic pentanes, as intuitively expected. It should be noted that both isopentane and n-pentane have a plane of symmetry at certain values of "Anyk". Symmetry relations have so far not been too helpful; further analysis might nevertheless be rewarding.

E. CONCLUSIONS

The examples analyzed demonstrate that:

a. A value for the structural complexity of a typed point system can be assigned. This assignment is so far based on a somewhat arbitrary set of rules. Practice shows however that changing these rules leads to inconsistencies when the same system is analyzed in different ways. A more rigorous mathematical analysis is needed in order to determine whether the assignments are indeed unique and whether algorithms can be devised leading to these assignments uniquely.

b. An important step in complexity analysis of any system is the determination which coordinates are random and which are ordered (whether regular or uniquely specified). The test is in principle simple: Let us examine a certain number of systems in an ensemble, e.g. molecules in a specimen. If a certain coordinate assumes the same value in each molecule of the ensemble, then it belongs to the ordered repertoire. On the cellular level, one can compare for instance tubuli cells in the kidney to red blood cells. Tubuli cells are arranged in a radial fashion around the kidney tubuli, so that they represent an ordered, fairly regular set, and their contribution to the complexity of the organ can be assessed. In contrast, an erythrocyte (red cell) can be found anywhere in the blood stream; its positional coordinates are random, and no complexity value can be assigned to the arrangement of the erythrocytes in the organism. This randomicity test has to be applied for each coordinate before the complexity of any element can be assessed.

c. The formalism permits the assignment of a value not only to the complexity, but also to the degree of ordering of a system. This can be done by simply subtracting those coordinates which are indeterminate in the system; for instance for each of the 3 noncyclic pentanes the order OMEGA is OMEGA = 58/62. In cyclopentane (n=15), OMEGA = 52/54, because the ring constraints leave only two indeterminate angles, the "pseudorotation" and one torsion angle (Sanger, 1983). The distinction between random and ordered coordinates is important, because not only biosystems, but most real world systems have many indeterminate coordinates; think of the position of a point on the rim of a car wheel, or the number of twigs on a tree. Most real systems are only partially ordered, just like the pentanes. Structural complexity is relevant and assignable only to the ordered part of a system. The distinction between ordered and random coordinates of a system is a fundamental feature of the treatment presented, separating it from all previous treatments of the subject.

d. Structural complexities are extrathermodynamic quantities, because the structural complexities are determined by the stable molecular bonds and not by the occupancy of internal energy levels (except for possible rotational levels associated with random coordinates). Complexity differences persist at zero degrees Kelvin, where all internal energies are in the ground state and where, according to the third law of thermodynamics, all crystalline (ordered) compounds have a physical entropy of zero. Further, while conventionally measured entropy is an extensive property of systems, structural complexity is an intensive property, the complexity of a single pentane molecule being equal to that of a mole.

e. Structural complexity is low for most inanimate systems, but will assume high values in designed systems - systems which are created with the help of instructions specifying their pattern and composition. In the primitive molecular systems tackled here, instructions are provided by specific catalysts which can direct a chemical reaction towards one isomer (i.e. select one pentane isomer in preference to others). The degree of complexity thus achieved is, as we have seen, not too high. Higher degrees of complexity can be achieved when, in addition to a catalyst (enzyme), template molecules participate, like in DNA or protein biosynthesis. The high degree of complexity attained in the bioworld would be unthinkable without participation of replicable templates. Templates, in contrast to simple catalysts, can store and transmit large amounts of information, and their active presence accounts for the high complexity found in bioorganisms. Even higher degrees of complexity are achieved in artificial systems created by intelligent beings: The creation of a template, blueprint, or design (all synonyms for the present discussion), whether in the mind of the designer or on paper, is an essential step in making really complicated instruments or works of art. We can expect therefore that the concept of structural complexity will reach its full utility in the physical and chemical analysis of templated and designed systems.


References:

  1. Bennet, C.H. (1990) Entropy and information: How to define complexity in physics and why. In: Complexity, Entropy and Physics of Information, Vol VII, W.H. Zurek, Ed., Addison Wesley, pp 137-148.
  2. Chaitin, G.J. (1987): Algorithmic Information Theory, Cambridge University Press, 1987
  3. Dawkins, R. The Blind Watchmaker , W.W.Norton, New York, 1987.
  4. Feistel, R. and Ebeling, W . (1989). Evolution of Complex Systems, Kluwer Academic Publishers, Dordrecht.
  5. Flory, P.J. (1969), The Statistical Mechanics of Chain Molecules, Interscience, N.Y., p.18 ff; Appendix B, p.385.
  6. Green, P.B. and Phoetig, R.S. (1982), Biophysics of the extension and initiation of plant organs. In: Developmental Order: Its origin and Regulation, Subtelny,S. and Green,P.B.,Eds , Allan R. Liss, New York, p.485-510.
  7. Kolmogorov, A.N. (1965), Three approaches to the quantitative definition of Information. Problems of Information Transmission,1: 4-7
  8. Sanger, W. Principles of Nucleic Acid Structure, Springer, New York, 1983, p.20.
  9. Solomonoff, R.J. (1964) A formal theory of inductive reference Information and Control, 7: 1-22;224-254.
  10. Wicken, J.S. Evolution, Thermodynamics and Information . Oxford Univ. Press, Oxford. (1987).
  11. Wood, W.B. (1980) Quaterly Revs. Biol. 55: 353.
  12. Yagil, G. (1985). On the structural complexity of simple biosystems. J. Theor. Biol., 112: 1-23.
  13. Yagil, G. (1993) Complexity analysis of a protein molecule. In: Proceedings, 1st Europ. Conference on Mathematical and Theoretical Biology, J. Demongeot, Ed., Wuertz Publ. 305-313
  14. Zurek, W.H. (1990) Algorithmic information content, Church-Turing hypothesis, physical entropy and Maxwell's demon. In: Complexity, Entropy and Physics of Information, Vol. VII, W.H. Zurek, Ed. Addison Wesley, pp.73-89