Complexity and Hierarchy - A Level Rule

 

Gad Yagil*

 

Dept. of Molecular Cell Biology

The Weizmann Institute, Rehovot, Israel 76100

* Corresponding Author

TEL 972 -89-342-775

FAX 972-89-344-125

email: lcyagil@wiccmail.weizmann.ac.il

 

Keywords: Algorithmic Complexity; Order; Regularity; Hierarchy; Insulin

I apologize for the multiple use of the letter C in this paper, dictated by conventions and previous publications. Bold C stands for Structural Complexity; plain capital C stands mainly for the chemical element Carbon, but occasionally for the amino acid cysteine (cys); capital C also had to be used in the phrase "C-level", not implying the carbon atom; lower case c is used to denote the coordinates of a sub-element.

 

 

Abstract

In this paper the connection between structural complexity and hierarchical organization is examined. The following quantitative rule, connecting complexities evaluated at different hierarchical levels, is offered:

CA/C = CA/B + SjC(j)B/C - Ce + Cor - Cx

CA/C is the complexity of an A-level structure evaluated in terms of its C sub-level components. This level rule is used to evaluate the complexity of the insulin A chain at two different levels. The complexity of the insulin chain at the C (atomic) sub-level is derived from its complexity on the B (amino acid) sub-level and the complexities of the j component amino acids in terms of their C level elements. The result obtained is the same as that previously evaluated by a direct approach, not using the level rule. It is proposed that the above level rule is applicable to a wide range of actual and virtual hierarchical systems.

1. Introduction

Living systems have a both complex and hierarchical structure: Complex - because they are composed of numerous interacting sub-particles; hierarchical - because the interacting sub-particles are organized in a range of levels, each with its specific sub-levels. The hierarchical structure of the living world was first recognized by Linné, and had been formally treated several times in recent years (1-3). The complex nature of biosystems can be studied at many different levels of hierarchical organization, ranging from the atomic - molecular to the organismic - social. The two concepts are therefore fundamentally related.

Previously we have offered a quantitative framework by which the complexity of a biostructure can be evaluated (4-6). An initial step in that evaluation is the decision in terms of which lower-level units of organization the complexity is to be evaluated. For instance, whether the complexity of a virus is to be evaluated in terms of its protein components, of the amino acid making up these proteins, or possibly of their atomic elements; whether the complexity of a complete organism is to be expressed in terms of its organs, of the tissue types comprising these organs, or of individual cells, and so on.

In this communication I formulate a simple quantitative rule ("The Level Rule" ) that connects structural complexity evaluated in terms of one particular hierarchical level with that of the next lower level of organization. By reapplication of the rule, the complexity of a particular structure in terms of components of any lower hierarchical level can be evaluated. The formulation demonstrates that complexity and hierarchical structures are tightly interrelated concepts.

2. Structural complexity

Several distinct approaches to the concept of complexity are currently followed. One approach is based on the property of a broad class of differential equations or cellular automata to produce highly intricate solutions. A second approach is based on classical information theory, in particular, the Shannon-Weaver relation. This approach is basically a probabilistic approach which equates complexity with improbability. The approach followed here is based on the algorithmic complexity concept first introduced by Kolmogorov and Chaitin (7). This approach can be applied without probabilistic considerations and is therefore more suitable for systems whose elements are of a discrete predetermined nature, the probability of which may not be unequivocally available. Computational complexity has not been mentioned because the aim here is to evaluate real objects, rather than symbolic algorithms.

The previously offered evaluation of structural complexity (4,6) is based on the simple expression :

C = min {Sk[c(k) /k] - c'} (1)

Here C is the structural complexity, defined as the minimal (min) number of specifications, numerical or color, needed to describe a structure (system) in physical space. The c values are the coordinates of the elements composing the hierarchical level by which structural complexity is to be evaluated, e.g., the coordinates of the amino acid comprising a protein molecule. c(k) is the number of system coordinates sharing a k fold regularity, where a regularity is a numerical specification repeated k times (e.g., an interatomic distance repeated k = 6x6x1023 times in one mole of crystalline NaCl, or a rib repeated k = 12 times in a human body). c' represents the 5-6 coordinates ("degrees of freedom") needed to orient a structure in the external world. We shall consider here typed point systems i.e., systems in which each point (atom, cell), in addition to its coordinates, is also of a certain type or "color". We shall consequently operate in a four dimensional space in which each element will have three space coordinates (cartesian, cylindrical or whatever leads most readily to the minimal set) and one color coordinate, for the nature of the element. Additional coordinates, e.g., for time and momentum, can be added in dynamical situations. A set of rules to ensure consistent evaluation has been formulated (4) and is shown in the appendix. A wide range of chemical-biological systems, ranging from simple molecules (6) to complete viruses (8), have been treated by the described procedure.

3. The Level Rule

We shall designate level A as the hierarchical level of the structure for which the complexity is to be evaluated (e.g. a protein) and level B as that of the immediately composing elements (e.g. amino acids). Each amino acid is in turn composed of a number of atoms, that constitute the level C elements. If we would like to evaluate the A-level protein in terms of the C-level atoms, we can apply the following "Level Rule":

CA/C = CA/B + SjC(j)B/C - Ce + Cor - Cx (2)

CA/C denotes structural complexity at hierarchical level A evaluated in terms of its lower C-level elements; CA/B is the structural complexity of the level A structure in terms of its B-level elements (amino acids) and C(j)B/C is the complexity of the B-level element of the jth type in terms of its C-level components. The summation (1, j ) ranges over those B-level elements present in A. With the insulin A chain, the example treated in detail next, the summation ranges over the 12 different amino acids present in the A chain. Each j element needs to be specified only once, because individual atom specifications are the same whenever that element (amino acid) appears.

Ce represents the color complexities of the B-level elements. The color of each B element is substituted by the color of one C-level element and has to be subtracted so that it is not counted twice. The corresponding positional coordinates need not be subtracted because these can be arranged to be placement coordinates at the C level and thus are counted only once, as will be seen when the data in Table 1 are discussed. The orientations of the amino acids within the A-level structure are not included in the B-level specifications and need to be specified on the C level. These orientations contribute the Cor term in Eq. (2) and add up to 3n specifications (n is the number of B-elements participating in the A level). The CX term corrects for inter-element regularities, i.e. for regularities that appear between the j different C-level elements.

4. Example: The Insulin A chain.

The application of the level rule is best clarified by a specific example: We shall treat here the native, folded Insulin A chain, composed of 21 amino acids, of 12 different colors:

NH3-Gly-Ile-Val-Glu-Gln-Cys-Cys-Thr-Ser-Ile-Cys-Ser-Leu-Tyr -Gln-Leu-Glu-Asp-Tyr-Cys-Asn-COO-
i or in short: GIVEQCCTSICSLYQLEDYCN.

This is the B-level composition of the chain. If we were to consider this sequence as an abstract string then the complexity of Insulin A would be simply CA/B = 21, because there is no obvious regularity in the amino acid sequence. The amino acid sequence of Insulin A is a typical non-regular, yet ordered, feature, using the terminology previously introduced (only few proteins show clear regularities, such as the gly-pro-X repeat of collagen). We denote the sequence as ordered, rather than random, to stress that whenever a human insulin A chain is encountered it has the same amino acid sequence. It is important to note that we are NOT dealing here with a statistical ensemble, but are counting only elements (or rather coordinates) which have only a single value in their respective phase space.

When operating in real geometrical space, the positional coordinates of each sub-element should also be specified. We shall assume that the position of each amino acid is represented by its alpha carbon (Ca). The position of each successive amino acid residue is specified by the Ca - Ca distances and the Ca main chain bond angle, which have the same value for all amino acid residues in a chain (except for proline, not present in Insulin A). This adds just 2 numerical specifications to CA/B. In addition, there are two varying angles, namely the two Ramachandran angles F and Y (see any textbook describing protein structure, e.g., Darnell et al. (9)). These two angles determine the exact conformation of the folded chain and assume, in the folded chain, a fixed value, but different for each amino acid residue (when not e.g. in an alpha helix). These angles add 21*2 = 42 numerical specifications to the CA/B term. (In the unfolded state F and Y have no fixed value and do not belong the ordered repertoire of the molecule). The final complexity of A in terms of its B elements (complexity "on the B level") is therefore: CA/B = 21 + 2 + 42 - 6 = 59. The 6 is subtracted for c', the coordinates placing the "system" in external space, e.g., the coordinates of the terminal NH2 residue and the orientation of the vector to the ensuing G residue. These are non-ordered, i.e. can assume any value in phase space.

If we want now to express the structural complexity in terms of the atoms making up the insulin A chain ("on the C level"), we specify first the complexity of each amino acid in terms of its composing atoms. To obtain, for instance, the contribution of serine residues no. 9 or 12 to the complexity of Insulin, we set up the specification table for the serine residue, shown in Table 1 (see at the end). L-serine has n = 11 atoms, consequently Table 1 has 4*11 = 44 numerical entries, one color coordinate for the composition and three numerical coordinates for the position of each atom-element. The numerical values are represented by symbols for clarity. The only obvious regularity in serine is that of the two beta hydrogens (entries no. 5 and 6) which occupy chemically equivalent positions and have therefore identical numerical specifications. This reduces the number of specifications to 40. Many entries contain both an R and a Q value; however, since every R and Q appears twice, there is only one independent numerical value for each (rules 4 and 9, Appendix).

Two of the f values have the value of "Any", because of two free rotations, one of the HO atom around the CO-Cb bond, and another of the OH group around the Ca- Cb bond. The O and its H, specifically their f coordinates, thus assume no fixed position in space and can assume any f value. These two f coordinates do not belong therefore to the ordered repertoire of serine and a value of -2 has been subtracted from the final complexity value. The exclusion of non ordered coordinates, i.e. coordinates that assume a different value at different time points, is a basic tenet of the proposed formalism, and a distinctive point by which it differs from similar formalisms (10).

In summary, the complexity of the L-serine residue on the C-level is

CB/C = 44 - 4 - 2 - 6 = 32 units. This value of 32 needs to appear only once in the Insulin A chain on the C level, because the entered values are the same for every serine residue, no. 9 and no. 12 in the sequence. The number 6 was subtracted for c', the six non-ordered placement coordinates. In a previous publication a value of 10 was quoted for the complexity of the serine residue (ref. 11, Table 2, last column). In that case the contributions of Ca, Cb, Ha and the 4 peptide atoms

(-NH.CO-) were counted as part of a separate peptide backbone unit. The complexity of the peptide unit was evaluated as 22, leading to the same value of 32 for the serine residue.

Specification tables similar to Table 1 have been set up for each amino acid.

They can be viewed at http://www.weizmann.ac.il/~lcyagil. Since 12 of the 20 amino acids are present in the insulin A chain, only 12 such tables have to be consulted and the resulting complexity values to be considered. The complexity values of these 12 amino acids are presented in Table 2. The sum of the numbers in Table 2 yields a value of 432 for the SjC(j)B/C term of eq. (2). The Ce term is -21, compensating for the 21 color values already declared on the B level. The corresponding positional coordinates were also declared at the B level, but were designated as placement coordinates in the tables (marked with an asterisk in Table 1) so that no compensation is necessary. As for Cor , the orientations of the amino acid residues within the complete chain, these need to be specified. The orienting vector has nevertheless the same value for all residues along the peptide chain, so that just a single vector, Cor = 3, need be specified.

Finally, the Cx term, that corrects for regularities identified in the complete chain but not present within each individual residue, has to be subtracted. The regular elements are primarily the peptide backbone elements, 22 specifications for each (for serine, entries no. 1-3 and 8-11 in Table 1,). These 22 specifications need to be specified just once, not 12 times, therefore we have to subtract (12-1)*22 = 242 specifications. In addition, the two beta hydrogens are similar for at least 9 amino acids (gly, val and ile excluded), a further reduction by (9-1)*4 = 32 specifications, giving a total inter-subelement correction of Cx = 274. We shall ignore for brevity a few further regularities that may be identified. Altogether, the level rule (1) yields for the structural complexity of the folded insulin A chain on the atomic level a value of: CA/C = 59 + 432 - 21 + 3 - 274 = 199.

In a previous study the complexity of insulin A was calculated directly, without using the level rule, with a value of C = 221 (ref. 11 , Table 3, column 5, summation corrected). The difference is due to: The Ramachandran angles, considered now as ordered (+42) ; the Ce term (-21) neglected there; the beta hydrogen redundancy included here (-32) and errors in the values for the peptide backbone (-7) and tyrosine (-4). Taking these in consideration, the two ways to calculate the complexity yield identical results.

5. Virtual systems

The level rule holds of course also for virtual or symbolic systems, i.e. systems that do not exist in real space. A typical example can be a written document which consists of paragraphs, sentences, words, etc. Let us consider this paragraph from sentence 5 on (level A). There are 7 more sentences (level B, CA/B = 7) which in turn have 15;17;23;14;20;27 and 23 words respectively (level C). By Equation (1) we have trivially CA/C = 7 + 139 - 7 = 139. Several words, like "the" and "in", are present in most sentences and are also repeated within sentences. These repetitions need not be regarded as a regularity on the B/C level, since the repetitions are coincidental, as for cys in Insulin. Cx is thus zero, as is Cor, which are meaningful only in real space. The repetition of "the" makes a difference only when complexity is evaluated in terms of individual letters (hierarchical Level D). The present paragraph can also serve as an example of a system of almost maximal complexity, because most probably no program which can shorten its presentation can be devised. In summary, structural complexity analysis can be performed also on strings, which could be of interest in genomic analysis, linguistics and other disciplines.

6. Conclusion

I hope that the example of insulin, treated in detail, clarifies how the level rule can be applied to derive complexities at one hierarchical level from complexities at other, higher or lower hierarchical levels. This is straightforward as long as the structural relations between the different levels are quantitatively defined. The level rule accentuates the general principle, that the structural complexity and the hierarchical structure of a system are intimately connected; whenever we ask for the complexity of a system we have to specify first in terms of which components we want to have the answer. In that sense, complexity is a fundamental characteristic of emerging systems. For instance, it is clear that the ability of insulin to recognize insulin receptors on target cells, can not be detected neither in an assembly of its amino-acid (B-level) nor of its atomic (C-level) components; this ability is a property of the folded A-level entity alone. The evaluation of the geometrical and functional relationships between hierarchical levels can provide a measure relating the emergent higher level property with its generating lower level components.

A main difficulty in assessing complexity is the lack of a unique algorithm to arrive at the minimal description of a structure or pattern. The complexities arrived here have been obtained by a rule based procedure, based on the rules (appendix) that have proven to give consistent and minimal descriptions in a variety of systems. It should nevertheless not be impossible to devise unique procedures for assessing the complexity of specific classes of systems, as exemplified for molecular systems in the present studies.

As stated previously, the utility of the complexity concept is in its ability to predict instructional requirements for pattern generation. In particular, to predict coding requirements for patterns that have a code or blueprint behind them, whether by intelligent design, or just by that blind watchmaker (12). There is, for instance, a correspondence between the length of the biosynthetic pathway to adenine (13 steps, one of the longest in the metabolic table) and the high relative complexity of this molecule (ref. 4; Cr = 0.79). In another case, the complexity of a wing pattern of an African butterfly, Bicyclus anynana, studied intensively by Brakefield et al. (13 ), was evaluated (6). A numerical fit was found between the number of genes required for generating the observed pattern (13) and the structural complexity of the structure generated, as demonstrated in Table 3 Complexity analysis can thus serve as a theoretical tool for predicting coding requirements for pattern generation, an acute topic of current bioresearch.

 

 

 

References

1. H. A. Simon, The architecture of complexity. Proc. Am. Philos. Soc. 1962, 106: 467-482.

2. M. Polanyi, Life's irreducible structure. Science 160: 1967, 1308 -1312.

3. J. Collier, Supervenience and reduction in biological hierarchies. Canad. J. Philos. 14: 1986 supl. 209 - 234.

4. G. Yagil, On the structural complexity of simple biosystems. J. Theor. Biol. 1985, 112: 1-23.

5. G. Yagil, On the structural complexity of designed systems. In: 1992 Lectures in Complex Systems. L. Nadel and D. Stein, Eds. The Santa Fe Institute and Addison-Wesley, Reading MA, 1993, 519-530.

6. G. Yagil, Complexity and order in chemical and biological systems. Interjournal, 1998, http://interjournal.org, #135.

7. G. J. Chaitin, Algorithmic Information Theory. Cambridge University Press, 1990.

8. G. Yagil, Complexity analysis of a protein molecule. In: Mathematics in Biology and Medicine V. Capasso and J. Demongeot, Eds. Wuertz Publ., Winnipeg, Canada 1993, 303 -313.

9. J. Darnell, H. Lodish and D. Baltimore, Molecular Biology of the Cell, Scientific American Books, NY, 1986 p. 61f.

10. M. Gell-Mann and S. Lloyd, Information measures, effective complexity and total information. Complexity 2: 1996, 44 -53.

11. G. Yagil, Complexity analysis of a self-assembling versus a template directed system. Lectures in Artificial Intelligence, 1995, 929: 179-187.

12. R. C. Dawkins, The Blind Watchmaker. Norton, NY - London, 1986.

13. P. M. Brakefield, J. Gates, D. Keys, F. Kesbeke, P. J. Wijngaarden, A. Monteiro and S. B. Carroll, Development, plasticity and evolution of butterfly eyespot patterns . Nature 1996, 384: 236-242.

 

 

 

 

Footnote

1 I apologize for the multiple use of the letter C in this paper, dictated by conventions and previous publications. Bold C stands for Structural Complexity; plain capital C stands mainly for the chemical element Carbon, but occasionally for the amino acid cysteine (cys); capital C also had to be used in the phrase "C-level", not implying the carbon atom; lower case c is used to denote the coordinates of a sub-element.

 

Appendix:

The following set of rules was adopted to compute structural complexity (4), the criterion being the extent they lead to consistent descriptions using different coordinate and numbering systems:

1. The unit of structural complexity C is the specification.

2. The assignment of a numerical value to a single spatial coordinate in a system, or the declaration of the color of a point, count as a single specification. A color may be a chemical element, a nucleotide base, a cell type, or any other compositional element, depending on the hierarchical levels chosen.

3. A mathematical or logical statement relating the specifications of several points (a regularity) is counted as a single specification, only if a single numerical value is specified; else it contributes as many specifications as the independent numerical and symbolical values present.

4. A numerical value appearing in more than one statement is counted only once, except when clearly repeated by coincidence.

5. An ordinal number is not counted as a separate specification.

6. A range statement is not counted as a separate specification.

7. A simple numerical coefficient like (-1)i (alternation) is not counted as a separate specification.

8. A transformation of the coordinates adds to the complexity as many statements as new constants are included.

9. A function (sin q, log r) is not counted separately of its argument.

 

Table 1. Complexity of the L-Serine Amino acid.

-NH.CH(CH2OH).CO-
a b p

 

No.

e

r

f

z

T3

1

Ca

0*

0*

0*

-

2

Ha

RaH.cos QbaH 1

FCpHa

RaH.sinQbaH

-

3

Cb

0*

0*

Rab

-

4

Ob

RbO.cos QabO

Any1

RbO.sinQabO

-

5

Hb1

RbH.cos QabH

Any1- fOH

RbH.sinQabH

-

6

Hb2

RbH.cos QabH

Any1+fOH

RbH.sinQabH

-

7

HO

ROH.cos QbOH

Any2

ROH.sinQbOH

T2

8

Np 2

RaNp

FaNp

0

T-1

9

Hp

RaHp

FaHp

0

T -1

10

Cp

-RaCp.cosQbap

0*

RaCp.sinQbap

-

11

Op

RaO p

FaO p

0

Tp

e is a color coordinate; r,f,z are cylindrical position coordinates. A rectangular coordinate system with the origin on Ca is used, with the z axis along the Ca-Cb bond and with the x axis on the Cb-Ca-Cp plane.

* represents a placement coordinate (six).

1 RaH is the Ca-Ha distance. QbaH is the Cb -Ca-Ha angle, etc.

2 Subscript p denotes peptide backbone.

3 Where T values are listed, the coordinates are given in a coordinate system not based on Ca. The transformation T has to be applied to transform the listed values to the Ca based system. Thus the coordinates of atom HO are in a system based on atom Cb and can be transformed to the system based on Ca by a matrix T2. Matrix T-1 transforms to coordinates based on the following Ca in the peptide chain; Tp - to the peptide plane. None of the transformations include new numerical constants and therefore do not contribute to the complexity of serine. For further details see ref. (11).

Table 2. Complexity values of amino acid residues

 

 

Gly

Ile

Val

Glu

Gln

Cys

Thr

Ser

Leu

Tyr

Asp

Asn

22

44

33

36

43

32

40

32

40

37

33

40

 

Source: Yagil 1993, ref 11, Table 2; 22 units have been added to each entry for the backbone atoms. The free rotating side chain angles are taken as random coordinates and are excluded. Specifications of individual amino acids can be seen at http://www.weizmann.ac.il/~lcyagil.

 

Table 3 : Complexity of B. anynana Wing eyespot pattern.

 

Pattern:

No.

e

r

q

Total

Center

1

-

R = ~ 300

Q = ~10°

Inner circle

2

White

r1 = 15

Any (0-360°)

1st ring

3

Black

r2 =50

Any "

2nd ring

4

Orange

r3 =71

Any "

C

all

3

4

1

C = 8

 

No. of Genes responsible, estimated from genetic analysis: Male : 6.4 - 9.9

(Bakerfield et al., ref. 13) ¹ Female: 7.5 - 10.8

Comment: A single eyespot is analyzed. A planar coordinate system is used, with the origin at the eyespot center. The first row gives the coordinates of this center from the wings origin.