Complexity and Order in Chemical and Biological Systems

In: "Unifying themes in complex systems", Y. Bar-Yam Ed., Perseus books, 2000, PP. 645-654.

By: Gad Yagil
Dept. of Molecular Cell Biology, The Weizmann Institute of Science,
Rehovot, Israel 76100
lcyagil@wiccmail.weizmann.ac.il


Comments:

1. Greek letters are spelled out in the text:
pai(p); fi(f); epsilon(e); SIGMA(S); TETA(Q); teta(q); OMEGA(W);
2. In chemical compounds (e.g. H2O) numbers are not in subs.

Abstract

The complexity of structurally defined objects in the chemical and biological context is calculated using a previously defined formalism. Structural complexity is defined as the minimal number of numerical and compositional specifications needed to describe the ordered features of an object . The formalism is applied to a set of simple one to four atomic molecules, incl. H2, HCl, H2O and H2O2. H2O2 is shown to be the prototype of partially ordered systems. Order is defined as the ratio of the ordered to the total coordinates of an object or sequence. Ordered coordinates are discernible from non-ordered ones by a proposed reproducibility criterion. Partial order is the dominant feature in most bio- and man-made systems so that the proposed formalism is capable of treating such systems. This is demonstrated in the last part, where the formalism is applied to a biostructure, the wing eyespot patterns of an African butterfly,B. anynana, and is shown to correctly predict the number of genes specifying that structure.

Introduction

Biosystems differ from simpler chemical and physical systems in that they contain a huge arsenal of coded instructions embodied in their genetic material. The complex behavior of biosystems is a direct result of this special feature, and makes it difficult to predict responses of a bioentity to changes in external conditions based merely on the local state of its various non-instructional components. To understand the way the genetic material is decoded and expressed in response to a change in external conditions, is essential to the understanding of cellular and organismic behavior.

How can genetic instruction processing be incorporated into a quantitative description of biosystems? On the experimental level, the detailed interactions between the genetic material and surrounding components have to be studied and quantified (see e.g. Yagil, 1975; Schneider, 1991; Westerhoff 1994). On the conceptual level, a formalism which is capable of incorporating the special properties of template encoded instructions and of their mode of implementation must be built. To serve as templates, molecules have to be highly ordered, in the sense that each coding element (nucleotide base) has to be exactly at its specified position, with any change being potentially lethal. DNA and the bioproperties directed by it are, therefore, unlikely to be amenable to the techniques of nonlinear dynamics or of simple cellular automata. The concept of AC (Algorithmic Complexity), which has the capability for dealing with highly ordered (or 'structured', Crutchfield, 1994) systems offers a more suitable approach to the understanding of instructive systems. In this paper we extend a formalism previously proposed for assessing the structural complexity of molecular systems (Yagil, 1985,1993,1995). Simple molecules are treated here for the first time and serve to illustrate the method; an application to a particular biopattern is described in the last part, demonstrating the utility of the concept. A similar concept of complexity has recently been offered (Lloyd and Gell-Mann, 1996).

Order

The definition of biocomplexity is tightly connected with the concept of order. An ordered system can be defined as a system (string, object) in which each element occupies the same coordinates in every specimen of the system . A structure most often associated with order is that of a crystal. A crystal is, however, not a good paradigm for order, because it is also highly regular, which is to say that its elements (atoms) are not only located at specified locations but also that these locations manifest repetitious spatial relations between them. These regular, repetitious features can be specified by a short list of numerical statements. Regularity is however not a necessary condition for a structure to be ordered. Here are three examples of ordered sequences devoid of any regularity, but which are still highly ordered, in the sense that displacing any of their elements could annihilate their meaning or function:

  1. The sequence of a DNA molecule.
  2. The First Verse of the Bible: "Bereshit Bara Elohim et Hashamaim ve'et Haarets" (In the beginning God created earth and heaven).
  3. The often mentioned second hundred digits of Pai (p).

What is common to the three strings, or sequences, is that all three have a governing principle or design behind them. The necessity to keep the prescribed order is clear to any person who understands the principle or design which governs each sequence and knows the code employed. The knowledge of the principle or design makes these ordered strings reproducible, which is to say that the strings can be reproduced either by invoking the principle behind them, or by reconstructing them according to the known design. In the case of Pai, the generating function will serve as the guiding principle; the story which the original writer of the Bible had in mind can be considered as the design of the First Verse and the complementary strand of the DNA (perfected by the blind watchmaker, Dawkins, 1986) can serve as the design or template for the DNA sequence.

DNA is a particularly good paradigm for ordered systems, because there are very few regularities in DNA. I hope that even a non-biologist realizes that whenever a DNA region is sampled from any tissue of a certain human, or even from different humans (ignoring about 1% "polymorphisms"), one and the same sequence will be obtained. This is in sharp contrast to an ideal gas, where the positions and moments of each molecule have no fixed values; nor have the letters in the proverbial text printed by a monkey. Both systems have no detailed design behind them, their coordinates are irreproducible and both are consequently non ordered systems. I am avoiding the word "random", because it is currently used to characterize both ordered and non-ordered strings. It seems nevertheless somehow inappropriate to call the sequence of any DNA "a random string" by the daily meaning of that word.

The reproducibility criterion

The property of reproducibility offers a criterion to determine whether a system (string, object) is ordered or not. This can be done by sampling an ensemble of the strings or objects, and to examine the positions/coordinates of each sampled member (specimen). If the position of every coordinate is the same in every (or most) sampled members, then the system members are ordered. For our three strings, the reproducibility test can be done by picking up another Bible, or by consulting a Bible scholar, in the case of the First Verse; by performing a DNA sequencing analysis from a kin, or by regenerating Pai by the well known series. Only a sequence or object which passes this reproducibility criterion can be considered as ordered (partially ordered systems will be considered below). This was articulated in detail, because it is a basic tenet of our formalism that structural complexity can be attributed only to the ordered coordinates of a system. For non ordered coordinates, only macro variables like entropy can be evaluated. In many disciplines, ranging from molecular biology to man made systems, it is the ordered micro-structure which is responsible for the inner working of these systems and which is responsible for their complex features.

Structural complexity of point systems

Chemical molecules provide a class of systems the structural complexity of which can be readily calculated. Molecules can be considered as typed point systems, i.e. systems or objects the elements of which consist of points with a different type or "color" for each point. The spatial position of each element (atom) within a molecule is customarily specified by numerical values in a suitable coordinate system. Three spatial coordinates + one color coordinate occupy a 4 dimensioned space. In dynamical situations, time can be added as a fifth coordinate.

The coordinates of each element can now be divided into two classes: Ordered coordinates, i.e. coordinates which do obey the reproducibility criterion defined above and assume the same numerical value in every member of an ensemble of these molecules, and non-ordered coordinates, which do not have a fixed value, and therefore can not contribute to the evaluated complexity. The ordered coordinates can be redivided into uniquely specified ones and regular ones. The uniquely specified coordinates (e.g. the bases in the DNA example, or the atoms in a simple chiral molecule) contribute to the complexity a full unit each. The regular coordinates can be compressed by a mathematical or logical expression and will consequently contribute less then a unit, in analogy to the shorter program of algorithmic complexity. Structural complexity may thus be written:

C = min {SIGMAk [c(k)/k] - c'}

where c(k) is the number of ordered coordinates sharing a k fold regularity and c' is the number of specifications necessary to place the system in an external framework (normally 6).

The next step in the analysis of a molecule is to identify regularities in several coordinate systems, each regularity contributing in inverse proportion to the number of elements sharing that regularity (The more regularities, the less complex the system is). This procedure can be repeated in other coordinate systems. The complexity value to be assigned is of course that in the coordinate system yielding the minimal value of C. There is at present no systematic way to determine the minimal set, and there seems to be no general solution (Chaitin 1990; Crutchfield 1994). We have however formulated a set of rules, shown in the Appendix, which enables a consistent identification of regularities present in typed point systems (Yagil, 1985;1993). We can now proceed to illustrate the procedure on a set of simple molecules.

The Simple Molecules:

In Table 1 the complexity of several of the simplest molecules is presented. The simplest system is obviously a single atom, represented in the Table by helium. Once placed in the external world (c' = 3) only its color e ("He") needs to be specified. Consequently both the maximal complexity Cmax and the actual complexity C have a value of unity.

For hydrogen, a single color specification,H, and the interatomic distance (RHH), are all that is needed to completely describe the molecule. The complexity of H2 is consequently C = 2. The maximal complexity Cmax = c - c' is nevertheless 3, because 2 colored points in a 4 dimensional space have 8 coordinates (2nd column) of which c' = 5 (3rd column) are placement coordinates. The complexity is less than 3 because of one twofold regularity (both atoms are H), yielding a relative complexity of Cr = C/Cmax = 2/3 (last column).

In the next molecule, hydrogen chloride, the single regularity of hydrogen is removed. To specify HCl, the two unique colors and one distance, RHCl, are needed. The Complexity , C = 3, is therefore equal to maximal complexity in a two atom system, with Cr =1.

For three atomic molecules both linear (c'=5) and nonlinear structures (c'=6) are possible. Several molecular regularities are manifested, some of which are represented in the Table. I hope the way complexity is derived is obvious by now. The maximally complex triatomic molecule (no regularities; Cr =1) is represented by HCN for linear molecules and by HOCl for the nonlinear ones.

All the molecules treated so far adhere to the reproducibility criterion formulated above - whenever we observe an HOCl molecule its inner coordinates will have the same numerical values as stated in Table 1 (ignoring excited states). HOCl is therefore a completely ordered structure. We shall proceed now to the next molecule, which has already a non-ordered feature.

In Table 2 the specifications of the four atomic hydrogen peroxide molecule are shown. In H2O2 one of the O-H bonds can rotate freely against the other O-H bond. In the cylindrical coordinate system chosen ( the z axis passing through the two oxygen atoms) this means that the fi coordinate of H4 can assume any value between 0 and 360 degrees. This is expressed in Table 2 by assigning the value of "Any" to the fi4 angle. In other words, only nine of the 4 x 4 - 6 = 10 spatial and color coordinates are ordered and do contribute to the complexity, while the tenth one is non-ordered. H2O2 would therefore not obey the reproducibility criterion, or, to be exact, nine out of its ten coordinates would pass the test; only the tenth coordinate, which assumes a different value in each molecule in an ensemble of H2O2 molecules, would not pass the reproducibility test.

Hydrogen peroxide can serve thus as a prototype of the wide spread phenomenon of partly ordered systems. Most biosystems are partly ordered systems - free rotating side chains of proteins or flopping wings of a bird are typical examples. Returning to our prototype, the recognition that only nine out of ten coordinates are ordered permits a quantitative definition of order, namely: The order OMEGA of a system is the ratio of its ordered to its total non placement coordinates:

OMEGA = (cord)/(ctot- c');
(ctot - c' = cord + cnord; nord = non-ordered)
in H2O2:   OMEGA = 9/10 = 0.9
i.e. the hydrogen peroxide molecule is a 90% ordered system. The proposed formalism provides thus a simple way to discern between ordered and non-ordered parts of a system, at least in those cases where a separation is possible.

What can be the utility of the Structural Complexity Concept?

Biosystems are, as said, instructed systems. The complexity of a particular subsystem, like a metabolic pathway or morphogenetic pattern, should be related to the number of instructions required to specify that pathway or pattern, i.e. complexity analysis should give an indication on the number of genes involved. Thus, it was previously (Yagil, 1985) pointed out that the high complexity of adenine (C = 59) reflects the large number of genes involved in the purine biosynthesis pathway (13 genes, one of the longest pathways in metabolic map). An example of a morphogentic process in a higher organism will be given here:

Wing patterns of the butterfly Bicyclus anynana.

The genetics of wing pattern formation by this butterfly have been intensively studied by Bakerfield and colleagues (1996). The wings of B. anynana exhibit "eyespots" patterns, consisting of three concentric rings, each of a different color. To specify the size of each ring, 3 radial specifications are required. 3 further specifications spell the color of the different rings; 2 more specifications determine the location of the center of the spot relative to a defined origin in the wing, altogether 8 specifications ( Table 3 ) Genetic crossing experiments by Bakerfield et al. revealed that the eyespot patterns are multigene traits; the number of genes involved could be estimated from the crossing experiments and were found to be 4.8 - 9.3 for males and 7.5 - 10.8 for females (Bakerfield et al., 1996). There seems thus to be a good correspondence between the number predicted from complexity analysis and the much more tedious genetic analysis. While direct experimental analysis is certainly required in any actual situation, complexity analysis can give directions as to anticipated results, and help thus in planning experiments and interpreting the results. The formalism proposed here has thus the potential to be helpful in understanding those complex patterns.

In summary,
the procedure and the examples described here, as well as previously treated molecules and structures (Yagil, 1985; 1993a,b; 1995) offer a practical way to assess complexities in a wide range of systems in which ordered features predominate. At present an algorithm to provide systematically minimal complexity is not available, and is considered impossible in the general case (Chaitin 1990). So far, consistency with other techniques, such as normal mode analysis ensured reliable results for the simple molecules treated here. Previous treatments of physical complexity (Crutchfield, 1994; Gell-Mann, 1994) proposed that complexity reaches a maximum somewhere between complete order and complete disorder. The present procedure sees no continuum between order and disorder, but rather divides coordinate space into ordered and non-ordered realms. Structural complexity is defined only in the ordered realm and increases there continuously with declining regularity up to maximal order (OMEGA = 1) in a completely unique system. It is therefore hoped that the formalism described here may help to assess complexity in a wide range of systems.

Acknowledgments:

The inputs and criticisms of my colleagues and friends: Shneior Lifson, David Mukamel, David Harel and Uri Feige are gratefully acknowledged.

References:

Appendix: Rules for determining structural complexity

To compute structural complexity, the following set of rules was adopted, the criterion being the extent they lead to consistent descriptions using different coordinate and numbering systems:

  1. The unit of structural complexity C is the specification.
  2. The assignment of a numerical value to a single spatial coordinate, or the declaration of the color of a point, count as one specification. A color may be a chemical element, a nucleotide base, a cell type, or any other compositional element, depending on the hierarchical levels chosen.
  3. A mathematical or logical statement relating the specifications of several points (a regularity) is counted as a single specification, only if a single numerical value is specified; else it contributes as many specifications as the independent numerical values that are present.
  4. A numerical value appearing in more than one statement is counted only once, except if clearly repeated by coincidence.
  5. An ordinal number is not counted as a specification.
  6. A range statement is not counted as a specification
  7. A simple numerical coefficient like (-1)i is not counted as a specification.
  8. A transformation of the coordinates adds to the complexity as many statements as new constants are included.
  9. A function (sin fi, log r) is not counted separately of its argument.