Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, United

Kingdom.

Tel. 44 1223 494628

Fax. 44 1223 494468Email: ku.ca.ibenull@daehtsew

[toc wrapping=left]

## 1. Introduction to protein structural topology

Most of what we have to say in this review is based upon intuitive concepts of geometry and topology. From the point of view of a mathematician, topology concerns those features of geometrical objects which are invariant under continuous deformations. The surface of a sphere is topologically equivalent to that of any ellipsoid, though geometrically different, and not equivalent to a torus (the surface of a doughnut) since the latter has a hole. Many surfaces have different geometry but the same topology. A protein structure has geometry, expressed in the conformation of the protein backbone and sidechains. We will consider to what extent it is possible to define a topology of protein structures, and how different structures (geometries) may share the same topology.

Even when as few as thirty protein structures were known it was apparent that many of them shared some main features while differing in less important geometrical details (see for example Richardson, J.S. 1977), opening the possibility of a topological description of the structures. The structures could be viewed as comprising a protein backbone to which side chains were attached. Ignoring the side chains, the backbone was seen to be divisible into two different types of structural region. First there were the segments of backbone constituting *secondary structure elements* ( alpha helices and beta sheets ). These typically ran from one side of the protein globule to another, making up the proteins hydrophobic core and some of its surface. Second were the turn and irregular loop regions connecting the secondary structure elements and making up the N and C terminal tails. To the secondary structure elements could be attached a sequential order and relative spatial position. In addition the beta strands had well defined hydrogen bond relationships grouping them into sheet and barrel structures and conferring a well defined relative direction. It was observed that many protein structures shared secondary structure elements with the same sequential order, relative position, direction and hydrogen bond relationships, while differing in many other structural details such as side chain conformations, lengths and conformations of the connecting loop regions, and even the lengths of the secondary structure elements themselves. These observations pointed the way to a topological description of protein structures including the conserved details pertaining to the secondary structure elements and shared by many structures differing in less important geometrical details. It should be said that the use of this description does not imply any continuous deformations relating topologically equivalent structures like those relating the sphere and the ellipsoid, it does however appeal to the GSTE principle (Geometrical Similarity means Topological Equivalence, see Mezey 1993).

Over the years many authors have studied protein structural topology and a number of definitions, notations and representations have emerged. Some of these representations are purely diagrammatical, and are designed to help the human observer understand the topology and topological relationships of protein folds, while others are more formal and mathematical, allowing greater possibility for automated comparison, and hence database searches of various types, and more rigorous attempts at enumeration of topological possibilities and ultimately topology prediction. Work related to notation, representation, comparison and database searches is reviewed in section 2, while that related more directly to topology prediction is reviewed in section 3. The aim of these two sections is simply a good coverage of the work published in the area to date. Issues arising from the literature review are discussed in section 4, where we will attempt to address such questions as the value of the work to date with respect to protein structure research, the usefulness of topological representations, and what directions future work might take.

Readers should note that this review focuses *exclusively* on work in which an abstract topological representation has been defined and is distinct from the actual geometry of the structures. Typically this will be a sequence of secondary structure elements along with some representation of their relationships, which themselves will typically be binary, for instance whether two elements are neighbours, whether they are parallel or anti-parallel, or the chirality of their connection. Of course this representation is ultimately derived from the geometry of the structures, ie. the atomic coordinates, but our aim is to address the usefulness of this type of representation. When the concern is comparison of (sub-)structures, an alternative approach which operates at the level of comparison of atomic coordinates, or the slightly more abstract level of spatial vector representations of secondary structure elements, is also common in the literature. This type of work is not reviewed here, and the reader is referred to Gibrat (1996) and Orengo (1994) for good reviews. The relative merits of the two approaches to structure comparison are discussed in section 4.

## 2. The definition, representation and comparison of protein structural topology

Early work related to structural topology considered how it might be represented in hand drawn diagrammatic form (Levitt, M. and Chothia, C. 1976, Sternberg, M.J.E. and Thornton, J.M. 1977, Nagano 1977). These diagrams were two-dimensional schematic representations of protein folds in which particular symbols were used to represent helices and strands and their sequential connections. Although the diagrams did not imply any formal definition of protein topology and were not generated automatically, they were found to be a very valuable aid to understanding protein folds and their similarities and differences at the topological level. In particular the diagrams were shown to be good for representing topologies within beta structures and chiralities of certain sub-structures like the beta-alpha-beta unit. Automated generation of the diagrams was first attempted by Flores (1994). The use of protein topology diagrams to represent particular protein folds remains popular today, as exemplified by the work of Stirk and co-workers on jellyroll structures (Stirk, H.J., Woolfson, D.N., Hutchinson, E.G. and Thornton, J.M. 1992).

The first systematic survey of protein topology was made by Richardson (1977) and Richardson (1981). This survey was restricted to beta sheet structures, for which it is clear how a topology may be defined since the strands group to form sheets by hydrogen bonding giving a clear concept of spatial adjacency and relative direction. The absence of this type of well defined organization for alpha helices makes the definition of topology for these structures more difficult and is the reason why alpha type structures were not considered in this work. Diagrams representing the topology of the beta sheets were introduced along with a notation describing the topological connectivity of the sheets as a string of numbers,letters and symbols. This notation, known as Richardson notation, is still used today. In addition to the introductions of diagrams and notation the paper made a number of interesting observations about the topology of the structures studied. It was noted that when a connection between two parallel strands in a sheet crosses over between one side of the sheet and the other, the connection is usually right handed (this observation had already been made by other authors); that beta sheets in which the strands run parallel are usually protected from solvent by other secondary structures; that simple topological patterns are favoured and some occur very frequently ( in particular the so called Greek key motif ); and that protein sheet topologies are usually free from knots. It is interesting to note that these conclusions still hold true to a large extent for the far greater number of protein structures known today. A more recent survey of beta structure topologies was carried out by Artymuik and co-workers (1994) which used graph theoretical methods to compare beta sheet topologies.

Also discussed in the paper by Richardson is how particular folding pathways would favour particular types of topology and to what extent a topological similarity between structures might imply an evolutionary relationship. Both questions remain areas of active debate today. The former question was considered in greater detail by Ptitsyn and co-workers (1980) who also discuss definitions of protein structural topology. A more recent review is due to Chothia and Finkelstein (1990). Very recently, Efimov (1997), has re-considered how protein folding patterns (topologies) may be built up from basic super-secondary motifs, and what rules govern this process.

Some years later Chirgadze (1987a and 1987b) made a very thorough study of the topological motifs found in anti-parallel beta proteins and how more complicated motifs can be constructed as combinations of simple strands, meanders and Greek keys. An algebraic notation for these combinations was introduced. The examples in this paper concern mainly classification of Greek key type motifs from both topological and geometrical viewpoints. A related study was carried out some years later by Hutchinson and Thornton (1993) in which Greek key motifs were extracted from a database of protein structures using geometrical similarity as well as topological criteria. This work made use of the program HERA (Hutchinson, E.G. and Thornton, J.M. 1990) to calculate beta sheet topologies.

The first attempt to store a formal definition of protein topology in a computerized database was due to Rawlings and co-workers (1985). Again concentrating on beta structures these authors represented protein topologies as sets of rules written in the programming language PROLOG. For each protein in the database some rules were derived automatically from protein databank files but further rules were added manually. Higher level rules were used to define topological motifs ( hairpins, meanders, greek keys and jelly rolls ) and to search for these in the database of topologies. It is not clear whether or not this work was ever extended, but the database reported contained entries for only seven proteins.

Following this work, but some seven years later, Koch and co-workers (1992) attempted a more mathematical formalization of the description of protein structural topology. Again these workers focused exclusively on beta structures. They introduced the concept of a beta graph: a mathematical graph in which each vertex represents a single beta strand and two edge sets describe sequential and hydrogen bond connections respectively. The language of graph theory allows the representation of any topology, however complicated. In addition to the graph theoretical description the paper introduces a number of linear string descriptions of the beta graphs, one of which is identical to that of Richardson. These linear descriptions however do not deal with bifurcations in sheets in a very natural way and are not in general unique. However, each representation was derived automatically for every protein in the Brookhaven databank using an automated method of secondary structure assignment and searches of this database for (sub-)topologies (motifs) were reported. It is not made clear in the paper why the authors choose to do these searches as substring searches on the linear representations rather than sub-graph searches on the beta graphs. The latter may be slower, but the string descriptions are very clumsy when bifurcations are involved, and speed may not be such an issue given that there are many fast algorithms known for the problem of sub-graph isomorphism.

Ideas related to those of Koch and co-workers described above were pursued by Flower in a series of papers (Flower, D.R. 1994a, Flower, D.R. 1994b and Flower, D.R. 1995). The first of these papers considers automated detection of beta barrel structures as cycles in the beta graph. The second paper builds on the work of Koch and co-workers in constructing linear string representations of beta graphs. The new linear representation is based on the well known SMILES notation used for the connection graphs of small organic molecules; it proves able to deal in a satisfactory manner with complicated beta structures such as barrels and bifurcated or branched sheets. It seems that this work was never extended to include any database searching applications, despite the obvious analogy of protein sub-topology searches to the molecular sub-structure searches of databases of small organic molecules usually carried out using SMILES strings or connection graphs.

Except for that concerned with diagrammatical representation, the work considered so far has concerned only the beta strand type secondary structures. The reason for this suggested earlier was that beta strands are grouped into higher order structures by well defined hydrogen bonds which confer a well defined spatial adjacency and relative direction (parallel or anti-parallel) to the strands, leading to a well defined topology for the structures. The work of Grigoriev and co-workers (1994) takes the opposite approach of considering only all alpha helical structures. Again the approach uses the graph theoretical approach with graph nodes representing secondary structures. This time the edges represent contacts between helices, rather than hydrogen bonds which was the case for beta graphs. Several definitions of “contact” were tested, the main application being not comparison of topology for database searches but rather what insights into the folding process and domain organization of the proteins could be gained by considering the connectivity of the graphs under various definitions and strength thresholds for the helical contacts.

The approaches described in the previous three paragraphs have recently been unified by Koch and co-workers (1996 and 1997). These authors define an *alpha-beta* graph, or *protein* graph, which is a graph theoretical representation of protein topology incorporating both sheets and helices. Nodes in the graph represent secondary structure elements (labelled with type) and edges connecting the nodes are present when two secondary structure elements are in spatial contact. The edges are labelled according to the type of contact which may be parallel, anti-parallel or mixed. The application for which these graphs are used is search for maximal common sub-topologies between pairs and within sets of protein structures. Improving on earlier work on automated comparison which simply compared somewhat clumsy string representations, these comparison are done using the graph theoretical Bron and Kerbosch algorithm – a fast algorithm for finding the maximal common sub-graph of two graphs. This algorithm is further speeded up by limiting the search to connected sub-graphs. A number of interesting results for various protein families are presented in these papers, in particular TIM barrels, bacteriochlorophyll and porins, and some Rossmann type folds, showing at least that the method can detect weak structural similarity. A weakness of the method is that information about the sequential connectivity of the secondary structure elements is not included, meaning that structural similarities could be detected between two proteins where the chain traces a different path through the equivalent secondary structures. The author of this review would prefer to call such similarities *non-topological*. Koch and co-workers do indicate that this weakness could be overcome by including a second directed edge set in the protein graph to indicate sequential adjacency.

Recently, Tsukamoto and co-workers (1997) have reported a deductive database system PACADE which stores topological and geometrical information for 186 non-identical protein structures. This database is searchable for both topologically equivalent and geometrically similar sub-structures. However the rules necessary to describe relatively simple sub-structures such as Greek keys are complicated, and this might inhibit use of the database to “canned” queries unless the user has considerable expertise. The system is illustrated by the example of searches for Greek keys defined both topologically and geometrically with a range of geometrical similarity thresholds.

## 3. The prediction of protein topology

So far all there is in this section is a list of references.

## 4. Discussion

The work on the definition of protein structural topology reviewed in section 1 had a number of themes: the meaning of protein structural topology; how it may be represented diagrammatically and in more formal mathematical/computational language; how topologies may be compared automatically and manually; and how topology may be stored in a database to permit (sub-)topology searches. Clearly these themes are closely related.

The simplest representation of protein topology is diagrammatical, and this is really no more than a schematic diagram of a protein fold illustrating secondary structure elements and their topological relationships. However, it is no less useful for its simplicity, and it permits the human observer to understand and compare protein folds much more quickly and easily than would be possible using full 3D representations of structure.

It is desirable to be able to compare topologies automatically *in silico* as well as by manual inspection. This is the necessary condition to be able to perform (sub-)topology database searches which enable the retrieval of topologically related structures. Such comparison of purely diagrammatical representation is difficult since the diagrams are not unique – for each structure there is more than one possible diagram – and all that would be possible in a hypothetical database of topological diagrams would be a “fuzzy” search matching diagrams according to a similarity criterion known to reflect the way in which diagrams of identical structures may vary. This has, to the knowledge of the author, never been attempted.

This problem lead to the need for the more formal representations of protein topology which have appeared in the literature. These vary from representation as mathematical graphs, to string representations, and representations as rules and facts in deductive databases and languages such as PROLOG. These have all been used in searching applications to some extent, usually the application being the extraction of all examples of particular sub-topologies in the database. More often than not the Greek key sub-topology (super-secondary structure) has been taken as an example. While some methods do allow consideration of helical secondary structures, most do not, and even when a method could use helices it is rare to find real examples. The authors view is that while these papers have presented good technical work on algorithms they are a bit thin on applications and real scientific results.

The recent work of Koch and co-workers (1996 and 1997) stands as an exception to the complaints of the previous paragraph in that helices are included and a number of examples are presented. The work is aimed a *maximal* common sub-topology and thus at global structural comparison rather than sub-topology search, but modification to do the latter should be straightforward. It would be very nice to see this work used for an all against all comparison of the protein structural database which could be compared with similar comparisons which have used geometrical rather than topological similarity searches. How do the two approaches compare? Does a topological search really detect weaker structural similarities and are these significant? The author of this review would suspect that such a test might reveal weaknesses in the method. Many folds which are really quite unrelated topologically might end up equivalent in a method which only considers neighbour contact relationships, makes no distinction between contacts involving helices and the much more well defined hydrogen bond relationships between strands, and ignores sequential connections. All these potential problems could be addressed within a more sophisticated version of the same basic graph theoretical method. The work could equally be compared with work like that done by Willett and co-workers (Mitchell, E.M. *et al* 1989, Grindley, H.M. *et al* 1993 and Artymuik, P.J. *et al* 1994) which is similar in that graph theoretical methods are used, but differs in being carried out at the geometrical level using spatial vectors to represent secondary structure elements.

The previous paragraph raises the general issue of whether protein structures are better compared by searching for geometrical similarity using atomic coordinates directly or by first forming topological representations and searching for similarities in these. Generally one might expect that comparison of topological representations would reveal weaker yet significant similarities, and could for this reason be advantageous since they ought to find close similarities as well. However, there is the danger that topological representations are not sufficiently specific, as suggested in the previous paragraph, and that they find many similarities of little evolutionary or even structural significance. Could a topological representation like the one above pick out all beta trefoil folds from a database without hitting any other folds? Doubtful, though I think that some representations could. A further danger with topological representations is that their construction might be sensitive to small differences between very similar structures. It is very common that two folds be closely related, but differ in that a few hydrogen bonds are missing in one and present in the other – while a topological method dependent on hydrogen bonds would need to take special care to avoid missing the similarity, methods based on comparison of atomic coordinates would be less sensitive to this small structural difference.

PREDICTION – will be discussed when literature review section is complete.

## 6. References

### References related to definition and database searching of abstract representations of protein topology

[bibliplug category=”TOP-REVIEW1″ order_by=”last_name, year, title”]

### References related to (sub-)structure comparison by purely geometrical means

[bibliplug category=”TOP-REVIEW2″ order_by=”last_name, year, title”]

### References related to prediction of topology

[bibliplug category=”TOP-REVIEW3″ order_by=”last_name, year, title”]

### References related to analysis of 3d structures

[bibliplug category=”TOP-REVIEW4″ order_by=”last_name, year, title”]