Complexity and Hierarchy  A Level Rule
Gad Yagil*
Dept. of Molecular Cell Biology
The Weizmann Institute, Rehovot, Israel 76100
* Corresponding Author
TEL 972 89342775
FAX 97289344125
email: lcyagil@wiccmail.weizmann.ac.il
Keywords: Algorithmic Complexity; Order; Regularity; Hierarchy; Insulin
I apologize for the multiple use of the letter C in this paper, dictated by conventions and previous publications. Bold C stands for Structural Complexity; plain capital C stands mainly for the chemical element Carbon, but occasionally for the amino acid cysteine (cys); capital C also had to be used in the phrase "Clevel", not implying the carbon atom; lower case c is used to denote the coordinates of a subelement.
Abstract
In this paper the connection between structural complexity and hierarchical organization is examined. The following quantitative rule, connecting complexities evaluated at different hierarchical levels, is offered:
CA/C = CA/B + SjC(j)B/C  Ce + Cor  Cx
CA/C is the complexity of an Alevel structure evaluated in terms of its C sublevel components. This level rule is used to evaluate the complexity of the insulin A chain at two different levels. The complexity of the insulin chain at the C (atomic) sublevel is derived from its complexity on the B (amino acid) sublevel and the complexities of the j component amino acids in terms of their C level elements. The result obtained is the same as that previously evaluated by a direct approach, not using the level rule. It is proposed that the above level rule is applicable to a wide range of actual and virtual hierarchical systems.
1. Introduction
Living systems have a both complex and hierarchical structure: Complex  because they are composed of numerous interacting subparticles; hierarchical  because the interacting subparticles are organized in a range of levels, each with its specific sublevels. The hierarchical structure of the living world was first recognized by Linné, and had been formally treated several times in recent years (13). The complex nature of biosystems can be studied at many different levels of hierarchical organization, ranging from the atomic  molecular to the organismic  social. The two concepts are therefore fundamentally related.
Previously we have offered a quantitative framework by which the complexity of a biostructure can be evaluated (46). An initial step in that evaluation is the decision in terms of which lowerlevel units of organization the complexity is to be evaluated. For instance, whether the complexity of a virus is to be evaluated in terms of its protein components, of the amino acid making up these proteins, or possibly of their atomic elements; whether the complexity of a complete organism is to be expressed in terms of its organs, of the tissue types comprising these organs, or of individual cells, and so on.
In this communication I formulate a simple quantitative rule ("The Level Rule" ) that connects structural complexity evaluated in terms of one particular hierarchical level with that of the next lower level of organization. By reapplication of the rule, the complexity of a particular structure in terms of components of any lower hierarchical level can be evaluated. The formulation demonstrates that complexity and hierarchical structures are tightly interrelated concepts.
2. Structural complexity
Several distinct approaches to the concept of complexity are currently followed. One approach is based on the property of a broad class of differential equations or cellular automata to produce highly intricate solutions. A second approach is based on classical information theory, in particular, the ShannonWeaver relation. This approach is basically a probabilistic approach which equates complexity with improbability. The approach followed here is based on the algorithmic complexity concept first introduced by Kolmogorov and Chaitin (7). This approach can be applied without probabilistic considerations and is therefore more suitable for systems whose elements are of a discrete predetermined nature, the probability of which may not be unequivocally available. Computational complexity has not been mentioned because the aim here is to evaluate real objects, rather than symbolic algorithms.
The previously offered evaluation of structural complexity (4,6) is based on the simple expression ^{ }:
C = min {Sk[c(k) /k]  c'} (1)
Here C is the structural complexity, defined as the minimal (min) number of specifications, numerical or color, needed to describe a structure (system) in physical space. The c values are the coordinates of the elements composing the hierarchical level by which structural complexity is to be evaluated, e.g., the coordinates of the amino acid comprising a protein molecule. c(k) is the number of system coordinates sharing a k fold regularity, where a regularity is a numerical specification repeated k times (e.g., an interatomic distance repeated k = 6x6x10^{23} times in one mole of crystalline NaCl, or a rib repeated k = 12 times in a human body). c' represents the 56 coordinates ("degrees of freedom") needed to orient a structure in the external world. We shall consider here typed point systems i.e., systems in which each point (atom, cell), in addition to its coordinates, is also of a certain type or "color". We shall consequently operate in a four dimensional space in which each element will have three space coordinates (cartesian, cylindrical or whatever leads most readily to the minimal set) and one color coordinate, for the nature of the element. Additional coordinates, e.g., for time and momentum, can be added in dynamical situations. A set of rules to ensure consistent evaluation has been formulated (4) and is shown in the appendix. A wide range of chemicalbiological systems, ranging from simple molecules (6) to complete viruses (8), have been treated by the described procedure.
3. The Level Rule
We shall designate level A as the hierarchical level of the structure for which the complexity is to be evaluated (e.g. a protein) and level B as that of the immediately composing elements (e.g. amino acids). Each amino acid is in turn composed of a number of atoms, that constitute the level C elements. If we would like to evaluate the Alevel protein in terms of the Clevel atoms, we can apply the following "Level Rule":
CA/C = CA/B + SjC(j)B/C  Ce + Cor  Cx (2)
CA/C denotes structural complexity at hierarchical level A evaluated in terms of its lower Clevel elements; CA/B is the structural complexity of the level A structure in terms of its Blevel elements (amino acids) and C(j)B/C is the complexity of the Blevel element of the jth type in terms of its Clevel components. The summation (1, j ) ranges over those Blevel elements present in A. With the insulin A chain, the example treated in detail next, the summation ranges over the 12 different amino acids present in the A chain. Each j element needs to be specified only once, because individual atom specifications are the same whenever that element (amino acid) appears.
Ce represents the color complexities of the Blevel elements. The color of each B element is substituted by the color of one Clevel element and has to be subtracted so that it is not counted twice. The corresponding positional coordinates need not be subtracted because these can be arranged to be placement coordinates at the C level and thus are counted only once, as will be seen when the data in Table 1 are discussed. The orientations of the amino acids within the Alevel structure are not included in the Blevel specifications and need to be specified on the C level. These orientations contribute the Cor term in Eq. (2) and add up to 3n specifications (n is the number of Belements participating in the A level). The C_{X} term corrects for interelement regularities, i.e. for regularities that appear between the j different Clevel elements.
4. Example: The Insulin A chain.
The application of the level rule is best clarified by a specific example: We shall treat here the native, folded Insulin A chain, composed of 21 amino acids, of 12 different colors:
NH3GlyIleValGluGlnCysCysThrSerIleCysSerLeuTyr
GlnLeuGluAspTyrCysAsnCOO
i
or in short: GIVEQCCTSICSLYQLEDYCN.
This is the Blevel composition of the chain. If we were to consider this sequence as an abstract string then the complexity of Insulin A would be simply CA/B = 21, because there is no obvious regularity in the amino acid sequence. The amino acid sequence of Insulin A is a typical nonregular, yet ordered, feature, using the terminology previously introduced (only few proteins show clear regularities, such as the glyproX repeat of collagen). We denote the sequence as ordered, rather than random, to stress that whenever a human insulin A chain is encountered it has the same amino acid sequence. It is important to note that we are NOT dealing here with a statistical ensemble, but are counting only elements (or rather coordinates) which have only a single value in their respective phase space.
When operating in real geometrical space, the positional coordinates of each subelement should also be specified. We shall assume that the position of each amino acid is represented by its alpha carbon (Ca). The position of each successive amino acid residue is specified by the Ca  Ca distances and the Ca main chain bond angle, which have the same value for all amino acid residues in a chain (except for proline, not present in Insulin A). This adds just 2 numerical specifications to CA/B. In addition, there are two varying angles, namely the two Ramachandran angles F and Y (see any textbook describing protein structure, e.g., Darnell et al. (9)). These two angles determine the exact conformation of the folded chain and assume, in the folded chain, a fixed value, but different for each amino acid residue (when not e.g. in an alpha helix). These angles add 21*2 = 42 numerical specifications to the CA/B term. (In the unfolded state F and Y have no fixed value and do not belong the ordered repertoire of the molecule). The final complexity of A in terms of its B elements (complexity "on the B level") is therefore: CA/B = 21 + 2 + 42  6 = 59. The 6 is subtracted for c', the coordinates placing the "system" in external space, e.g., the coordinates of the terminal NH2 residue and the orientation of the vector to the ensuing G residue. These are nonordered, i.e. can assume any value in phase space.
If we want now to express the structural complexity in terms of the atoms making up the insulin A chain ("on the C level"), we specify first the complexity of each amino acid in terms of its composing atoms. To obtain, for instance, the contribution of serine residues no. 9 or 12 to the complexity of Insulin, we set up the specification table for the serine residue, shown in Table 1 (see at the end). Lserine has n = 11 atoms, consequently Table 1 has 4*11 = 44 numerical entries, one color coordinate for the composition and three numerical coordinates for the position of each atomelement. The numerical values are represented by symbols for clarity. The only obvious regularity in serine is that of the two beta hydrogens (entries no. 5 and 6) which occupy chemically equivalent positions and have therefore identical numerical specifications. This reduces the number of specifications to 40. Many entries contain both an R and a Q value; however, since every R and Q appears twice, there is only one independent numerical value for each (rules 4 and 9, Appendix).
Two of the f values have the value of "Any", because of two free rotations, one of the HO atom around the COCb bond, and another of the OH group around the Ca Cb bond. The O and its H, specifically their f coordinates, thus assume no fixed position in space and can assume any f value. These two f coordinates do not belong therefore to the ordered repertoire of serine and a value of 2 has been subtracted from the final complexity value. The exclusion of non ordered coordinates, i.e. coordinates that assume a different value at different time points, is a basic tenet of the proposed formalism, and a distinctive point by which it differs from similar formalisms (10).
In summary, the complexity of the Lserine residue on the Clevel is
CB/C = 44  4  2  6 = 32 units. This value of 32 needs to appear only once in the Insulin A chain on the C level, because the entered values are the same for every serine residue, no. 9 and no. 12 in the sequence. The number 6 was subtracted for c', the six nonordered placement coordinates. In a previous publication a value of 10 was quoted for the complexity of the serine residue (ref. 11, Table 2, last column). In that case the contributions of Ca, Cb, Ha and the 4 peptide atoms
(NH.CO) were counted as part of a separate peptide backbone unit. The complexity of the peptide unit was evaluated as 22, leading to the same value of 32 for the serine residue.
Specification tables similar to Table 1 have been set up for each amino acid.
They can be viewed at http://www.weizmann.ac.il/~lcyagil. Since 12 of the 20 amino acids are present in the insulin A chain, only 12 such tables have to be consulted and the resulting complexity values to be considered. The complexity values of these 12 amino acids are presented in Table 2. The sum of the numbers in Table 2 yields a value of 432 for the SjC(j)B/C term of eq. (2). The Ce term is 21, compensating for the 21 color values already declared on the B level. The corresponding positional coordinates were also declared at the B level, but were designated as placement coordinates in the tables (marked with an asterisk in Table 1) so that no compensation is necessary. As for Cor , the orientations of the amino acid residues within the complete chain, these need to be specified. The orienting vector has nevertheless the same value for all residues along the peptide chain, so that just a single vector, Cor = 3, need be specified.
Finally, the Cx term, that corrects for regularities identified in the complete chain but not present within each individual residue, has to be subtracted. The regular elements are primarily the peptide backbone elements, 22 specifications for each (for serine, entries no. 13 and 811 in Table 1,). These 22 specifications need to be specified just once, not 12 times, therefore we have to subtract (121)*22 = 242 specifications. In addition, the two beta hydrogens are similar for at least 9 amino acids (gly, val and ile excluded), a further reduction by (91)*4 = 32 specifications, giving a total intersubelement correction of Cx = 274. We shall ignore for brevity a few further regularities that may be identified. Altogether, the level rule (1) yields for the structural complexity of the folded insulin A chain on the atomic level a value of: CA/C = 59 + 432  21 + 3  274 = 199.
In a previous study the complexity of insulin A was calculated directly, without using the level rule, with a value of C = 221 (ref. 11 , Table 3, column 5, summation corrected). The difference is due to: The Ramachandran angles, considered now as ordered (+42) ; the Ce term (21) neglected there; the beta hydrogen redundancy included here (32) and errors in the values for the peptide backbone (7) and tyrosine (4). Taking these in consideration, the two ways to calculate the complexity yield identical results.
5. Virtual systems
The level rule holds of course also for virtual or symbolic systems, i.e. systems that do not exist in real space. A typical example can be a written document which consists of paragraphs, sentences, words, etc. Let us consider this paragraph from sentence 5 on (level A). There are 7 more sentences (level B, CA/B = 7) which in turn have 15;17;23;14;20;27 and 23 words respectively (level C). By Equation (1) we have trivially CA/C = 7 + 139  7 = 139. Several words, like "the" and "in", are present in most sentences and are also repeated within sentences. These repetitions need not be regarded as a regularity on the B/C level, since the repetitions are coincidental, as for cys in Insulin. Cx is thus zero, as is Cor, which are meaningful only in real space. The repetition of "the" makes a difference only when complexity is evaluated in terms of individual letters (hierarchical Level D). The present paragraph can also serve as an example of a system of almost maximal complexity, because most probably no program which can shorten its presentation can be devised. In summary, structural complexity analysis can be performed also on strings, which could be of interest in genomic analysis, linguistics and other disciplines.
6. Conclusion
I hope that the example of insulin, treated in detail, clarifies how the level rule can be applied to derive complexities at one hierarchical level from complexities at other, higher or lower hierarchical levels. This is straightforward as long as the structural relations between the different levels are quantitatively defined. The level rule accentuates the general principle, that the structural complexity and the hierarchical structure of a system are intimately connected; whenever we ask for the complexity of a system we have to specify first in terms of which components we want to have the answer. In that sense, complexity is a fundamental characteristic of emerging systems. For instance, it is clear that the ability of insulin to recognize insulin receptors on target cells, can not be detected neither in an assembly of its aminoacid (Blevel) nor of its atomic (Clevel) components; this ability is a property of the folded Alevel entity alone. The evaluation of the geometrical and functional relationships between hierarchical levels can provide a measure relating the emergent higher level property with its generating lower level components.
A main difficulty in assessing complexity is the lack of a unique algorithm to arrive at the minimal description of a structure or pattern. The complexities arrived here have been obtained by a rule based procedure, based on the rules (appendix) that have proven to give consistent and minimal descriptions in a variety of systems. It should nevertheless not be impossible to devise unique procedures for assessing the complexity of specific classes of systems, as exemplified for molecular systems in the present studies.
As stated previously, the utility of the complexity concept is in its ability to predict instructional requirements for pattern generation. In particular, to predict coding requirements for patterns that have a code or blueprint behind them, whether by intelligent design, or just by that blind watchmaker (12). There is, for instance, a correspondence between the length of the biosynthetic pathway to adenine (13 steps, one of the longest in the metabolic table) and the high relative complexity of this molecule (ref. 4; Cr = 0.79). In another case, the complexity of a wing pattern of an African butterfly, Bicyclus anynana, studied intensively by Brakefield et al. (13 ), was evaluated (6). A numerical fit was found between the number of genes required for generating the observed pattern (13) and the structural complexity of the structure generated, as demonstrated in Table 3 Complexity analysis can thus serve as a theoretical tool for predicting coding requirements for pattern generation, an acute topic of current bioresearch.
References
1. H. A. Simon, The architecture of complexity. Proc. Am. Philos. Soc. 1962, 106: 467482.
2. M. Polanyi, Life's irreducible structure. Science 160: 1967, 1308 1312.
3. J. Collier, Supervenience and reduction in biological hierarchies. Canad. J. Philos. 14: 1986 supl. 209  234.
4. G. Yagil, On the structural complexity of simple biosystems. J. Theor. Biol. 1985, 112: 123.
5. G. Yagil, On the structural complexity of designed systems. In: 1992 Lectures in Complex Systems. L. Nadel and D. Stein, Eds. The Santa Fe Institute and AddisonWesley, Reading MA, 1993, 519530.
6. G. Yagil, Complexity and order in chemical and biological systems. Interjournal, 1998, http://interjournal.org, #135.
7. G. J. Chaitin, Algorithmic Information Theory. Cambridge University Press, 1990.
8. G. Yagil, Complexity analysis of a protein molecule. In: Mathematics in Biology and Medicine V. Capasso and J. Demongeot, Eds. Wuertz Publ., Winnipeg, Canada 1993, 303 313.
9. J. Darnell, H. Lodish and D. Baltimore, Molecular Biology of the Cell, Scientific American Books, NY, 1986 p. 61f.
10. M. GellMann and S. Lloyd, Information measures, effective complexity and total information. Complexity 2: 1996, 44 53.
11. G. Yagil, Complexity analysis of a selfassembling versus a template directed system. Lectures in Artificial Intelligence, 1995, 929: 179187.
12. R. C. Dawkins, The Blind Watchmaker. Norton, NY  London, 1986.
13. P. M. Brakefield, J. Gates, D. Keys, F. Kesbeke, P. J. Wijngaarden, A. Monteiro and S. B. Carroll, Development, plasticity and evolution of butterfly eyespot patterns . Nature 1996, 384: 236242.
Footnote
1 I apologize for the multiple use of the letter C in this paper, dictated by conventions and previous publications. Bold C stands for Structural Complexity; plain capital C stands mainly for the chemical element Carbon, but occasionally for the amino acid cysteine (cys); capital C also had to be used in the phrase "Clevel", not implying the carbon atom; lower case c is used to denote the coordinates of a subelement.
Appendix:
The following set of rules was adopted to compute structural complexity (4), the criterion being the extent they lead to consistent descriptions using different coordinate and numbering systems:
1. The unit of structural complexity C is the specification.
2. The assignment of a numerical value to a single spatial coordinate in a system, or the declaration of the color of a point, count as a single specification. A color may be a chemical element, a nucleotide base, a cell type, or any other compositional element, depending on the hierarchical levels chosen.
3. A mathematical or logical statement relating the specifications of several points (a regularity) is counted as a single specification, only if a single numerical value is specified; else it contributes as many specifications as the independent numerical and symbolical values present.
4. A numerical value appearing in more than one statement is counted only once, except when clearly repeated by coincidence.
5. An ordinal number is not counted as a separate specification.
6. A range statement is not counted as a separate specification.
7. A simple numerical coefficient like (1)^{i} (alternation) is not counted as a separate specification.
8. A transformation of the coordinates adds to the complexity as many statements as new constants are included.
9. A function (sin q, log r) is not counted separately of its argument.
Table 1. Complexity of the LSerine Amino acid.
NH.CH(CH2OH).CO
a
b p
No. 
e 
r 
f 
z 
T^{3} 
1 
Ca 
0* 
0* 
0* 
 
2 
Ha 
RaH.cos QbaH 1 
FCpHa 
RaH.sinQbaH 
 
3 
Cb 
0* 
0* 
Rab 
 
4 
Ob 
RbO.cos QabO 
Any1 
RbO.sinQabO 
 
5 
Hb1 
RbH.cos QabH 
Any1 fOH 
RbH.sinQabH 
 
6 
Hb2 
RbH.cos QabH 
Any1+fOH 
RbH.sinQabH 
 
7 
HO 
ROH.cos QbOH 
Any2 
ROH.sinQbOH 
T2 
8 
Np 2 
RaNp 
FaNp 
0 
T1 
9 
Hp 
RaHp 
FaHp 
0 
T 1 
10 
Cp 
RaCp.cosQbap 
0* 
RaCp.sinQbap 
 
11 
Op 
RaO p 
FaO p 
0 
Tp 
e is a color coordinate; r,f,z are cylindrical position coordinates. A rectangular coordinate system with the origin on Ca is used, with the z axis along the CaCb bond and with the x axis on the C_{b}CaCp plane.
* represents a placement coordinate (six).
1 RaH is the CaHa distance. QbaH is the Cb CaHa angle, etc.
2 Subscript p denotes peptide backbone.
3 Where T values are listed, the coordinates are given in a coordinate system not based on Ca. The transformation T has to be applied to transform the listed values to the Ca based system. Thus the coordinates of atom HO are in a system based on atom Cb and can be transformed to the system based on Ca by a matrix T2. Matrix T1 transforms to coordinates based on the following Ca in the peptide chain; Tp  to the peptide plane. None of the transformations include new numerical constants and therefore do not contribute to the complexity of serine. For further details see ref. (11).
Table 2. Complexity values of amino acid residues
Gly 
Ile 
Val 
Glu 
Gln 
Cys 
Thr 
Ser 
Leu 
Tyr 
Asp 
Asn 
22 
44 
33 
36 
43 
32 
40 
32 
40 
37 
33 
40 
Source: Yagil 1993, ref 11, Table 2; 22 units have been added to each entry for the backbone atoms. The free rotating side chain angles are taken as random coordinates and are excluded. Specifications of individual amino acids can be seen at http://www.weizmann.ac.il/~lcyagil.
Table 3 : Complexity of B. anynana Wing eyespot pattern.
Pattern: 
No. 
e 
r 
q 
Total 

Center 
1 
 
R = ~ 300 
Q = ~10° 

Inner circle 
2 
White 
r1 = 15 
Any (0360°) 

1st ring 
3 
Black 
r2 =50 
Any " 

2nd ring 
4 
Orange 
r3 =71 
Any " 



C 
all 
3 
4 
1 
C = 8 

No. of Genes responsible, estimated from genetic analysis: Male : 6.4  9.9
(Bakerfield et al., ref. 13) ¹ Female: 7.5  10.8
Comment: A single eyespot is analyzed. A planar coordinate system is used, with the origin at the eyespot center. The first row gives the coordinates of this center from the wings origin.