Author: Paweł Golik
Institute of Genetics and Biotechnology, Faculty of Biology, University of Warsaw,
Institute of Biochemistry and Biophysics, Polish Academy of Sciences, Warsaw,
Prof. Paweł Golik is the director of the Institute of Genetics and Biotechnology, leader of the mitochondrial gene expression research group, and an avid popularizer of genetics.
If you’re not familiar with this classic short story from The Cyberiad, it will be well worth your while to make up for that fact, as no précis can do justice to Stanisław Lem’s witty prose. The titular pirate is no ordinary bandit, in that he cares not for riches but for knowledge, total and factual. The Constructors build him a machine that will provide him with all that he desires – by looking at moving particles of gas, it reads the information encoded in their random perturbations. But unlimited access to knowledge soon proves to be the pirate’s downfall: he ends up buried under the endless printouts of information, unable to find anything of use or interest among the avalanche of utterly useless (if factually correct) data.
But what does this philosophical parable written half a century ago have to do with contemporary biology? More than you might think! Finding sense in the deluge of data and information currently becoming available is one of the major challenges faced by this scientific field.
Avalanche of data
The late 1970s saw the development of a method of sequencing nucleotides that encode the genetic information in DNA; its discoverers Walter Gilbert and Frederick Sanger were awarded the Nobel Prize for chemistry in 1980. By the late 1980s, the first automated sequencers appeared, simplifying the complex and arduous process. Geneticists soon became bolder in discussing DNA sequences not in terms of individual genes, but rather entire genomes – organisms’ complete hereditary information – marking the beginning of the field of genomics. Its pace of development outstripped even the most optimistic expectations. By the mid-1990s, the first complete genomes of simple organisms such as bacteria and yeasts had been published, and the successful sequencing of the first draft of the human genome, comprising around 3 billion nucleotides, was announced in 2001. The early years of the 21st century marked the next stage of DNA sequencing, with the development of high-throughput or „next generation” sequencing. Importantly, the development of state-of-the-art methods for studying DNA occurred in parallel with the rapid upsurge of computing power, essential for assembling complete genetic sequences from the short fragments read by sequencers. Over the course of just a decade (1999-2009), the cost of DNA sequencing dropped around 14,000-fold, while the speed of reading sequences increased fifty thousand fold!
Today, contemporary sequencers reading billions of nucleotides in a single experiment can be found in many laboratories around the globe, including Poland. This is accompanied by an extensive and fast-growing range of automated functional genetics techniques which study how genetic information is expressed in cells, which proteins are encoded, the effects of mutations in individual genes, and so on. Specialist journals regularly report the latest sequenced genomes and progress of other high-throughput analyses, while hard disks in bioinformatics laboratories are rapidly filling with terabytes of data. Casual observers would be forgiven for believing that biology is about to reveal its deepest secrets any moment. In fact, this was how the announcement of the completion of the human genome project was widely reported: since researchers now knew our species’ DNA sequence, surely a full understanding of human biology would be at their fingertips.
However, it soon became obvious that reading DNA sequences is just the first step along a very long journey. So how do we avoid meeting a fate similar to Lem’s notorious pirate, overwhelmed by the vast volumes of data?
In order to appreciate the challenges faced by genetics and genomics, we must first contemplate what is understood by genetic information. It is sometimes compared to construction engineering blueprints; however, this is the wrong comparison, since DNA doesn’t contain simple instructions for the function and appearance of cells or whole organisms. Genetic information is a language; a set of instructions dictating which proteins and RNA molecules will be produced by cells, and when this production should take place. Protein and RNA molecules are responsible for chemical reactions, construction of chemical structures, gene regulation and signal transmission.
A better metaphor than a blueprint would be a cookbook recipe. The difference is very important: when we look at a schematic diagram, we can easily imagine the final construction it depicts, while reading a recipe gives no clues as to the flavor of the final dish – at least without extensive cookery experience. If someone has never seen or tasted cake, they won’t be able to imagine the results of combining flour, sugar, eggs and fat and heating the mixture to a certain temperature for a period of time, even if they are familiar with the appearance and taste of the individual ingredients. They will only learn about the outcome by tasting the cake prepared following the recipe. Biology is not dissimilar: even if we read the sequence of all genes in a genome and predict the sequence of the encoded proteins and RNA molecules, or find out when each one will be synthesized, it will be insufficient to gain an insight into the functioning of the entire cell. We must study genes, cells, and organisms as they function and learn about the complex systems formed by the individual elements. So far, researchers cannot be replaced by machines, and so generations of future biologists will still have fascinating challenges to face.
Complexities of reality
Sixty years after the structure of DNA was first elucidated by James Watson, Francis Crick, and Rosalind Franklin, we have a reasonable understanding of gene function and we are able to trace the route from genetic information encoded in DNA to the final products functioning within cells, even though we still encounter many surprises and new discoveries along the way. Since we know how genes operate, and genomics allows us to describe all the genes of a given organism, the path to full understanding of that organism’s biology seems simple. However, this is not the case, and data captured using state-of-the-art technologies reveals how little we really understand.
For decades, classical and molecular genetics focused on individual genes. We are all familiar with the simplified “one gene – one trait” paradigm, as described in Gregor Mendel’s pioneering works (in the pea plants he studied, variants of an individual gene determine whether the flowers are white or red); although the model has since been set aside as oversimplified, it long continued to shape our approach to genetic research. As soon as automated laboratory techniques allowed scientists to study DNA variation between individuals, the race began for uncovering a genetic basis of various characteristics, with popular press reporting on an “intelligence gene,” “alcoholism gene,” “schizophrenia gene,” and so on. Of course the reality proved to be far more complex.
The majority of characteristics of complex organisms such as humans are the result of multiple factors, with the final phenotype (observable traits) determined by variants of numerous individual genes and environmental factors. When information on the differences between individual DNA sequences became widely available, biologists started searching for statistical correlations between a given variant and the likelihood of the appearance of a certain corresponding trait. It soon became clear that genomics studies did not explain the genetic components of multifactorial traits. This led to the creation of the term “missing heritability”: we know (from studies of twins and close relatives) that a given trait is indeed inherited, although we cannot find it in the genomic sequence. Of the various possible explanations, the question of interactions between individual genes is particularly worthy of further exploration.
Network of interactions
The issue of gene interactions has been known to geneticists for quite some time. Studies of simple model organisms show that, at times, a mutation in one gene can result in the suppression of a mutation in a completely different gene; it can also have the opposite effect of enhancing it, leading to synthetic lethality. Geneticists working with yeasts are familiar with situations in which inactivation of an individual gene brings no observable effects; it isn’t until two genes are damaged at the same time that a notable phenotypic effect is observed. In simple model organisms, laboratory automation allows us to analyze similar genetic interactions on the genomic scale. Simple systems based on the “one gene – one trait” scheme, as described in Mendel’s experiments, are rare in complex organisms such as humans. With the exception of rare monogenic disorders, the majority of characteristics of complex organisms are the result of interactions between numerous genes, and researchers are only beginning to reveal how many genes may be involved.
Published in 2010, the results of a study conducted by an international consortium show that human height is affected by alleles of hundreds of different genes (at least 180). Studied individually, none of the genes bring us closer to understanding the genetic basis of the wide range of heights in humans; we will only find the answer by studying their interactions. This is a major challenge: classical genetics works well when studying the interactions of two or a few genes, but it remains powerless when faced with systems comprising hundreds of genes. We are only just beginning to realize the scale of the challenge we are facing.
The classical representation of individual genes is giving way to a highly complex network of interactions. The system’s complexity is not just due to the high number of components, but rather to the vast number of connections between them. Sequencing genomes is not unlike creating a catalogue of these elements; our route to understanding the mechanisms within the system must start with a thorough knowledge of the catalogue followed by studies of the connections within it. Advancing mass sequencing and genetic analysis techniques are generating vast volumes of data; however, without understanding systemic properties of cells and organisms, we are running the risk of getting lost under the avalanche of terabytes of data, which in and of itself remains useless.
How can biology avoid the pitfall of accumulating and cataloguing huge volumes of information which in isolation do little to improve our understanding of the functioning of living organisms? Classical descriptive biology is inadequate for dealing with such a complex problem, and so is descriptive statistics. Biology must draw upon the achievements in physics and mathematics, both of which have been dealing with problems of extremely complex systems for far longer. We are seeing the emergence of a new, interdisciplinary field known as systems biology, striving to create a holistic approach to describe biological systems. This requires devising mathematical models that will allow us to describe complex networks of interactions between genes and their products, and correlating these models with experimental data. This poses a major challenge for theoretical and applied scientists alike. Both rely entirely on the use of powerful computers, since analysis of large collections of biological data without state-of-the-arts bioinformatics tools would be an impossible task.
Systems biology is still in its early stages: researchers are creating the first mathematical models of metabolic pathways, signal transduction, and interactions between genes. In 2010, a map of interactions between the six thousand or so genes comprising the yeast genome was published; so far, only interactions between pairs of genes have been studied. Mathematical analyses of such networks are already producing fascinating results; some of their properties resemble maps of other systems such as connections between internet nodes, maps of airline connections, or networks of interpersonal relationships. There are centers that link dozens of connections, as well as peripheral nodes with just a few links to central locations. Modeling such networks allows us to analyze various properties of the system, such as resistance to external interference and capacity for adaptation. By locating a yet-to-be described gene on such a map, we can use its surroundings to pose a hypothesis as to its potential function. We can also create models of how such systems respond to external stimuli, or how they evolve. However, we must also bear in mind what physicists have learned in studying complex systems. Complex non-linear systems can be so sensitive to tiny changes to their parameters that on a greater scale they behave in ways that are practically impossible to predict. An excellent example is weather systems: we are all familiar with the difficulties in predicting how it will change even on a day-to-day basis. It is too early to say how biological systems compare, or to what extent DNA sequencing will help us predict human characteristics determined by interactions between hundreds of genes.
Future of systems biology
What we do know is that without theoretical backing or without working closely with the fields of IT, mathematics, physics, and complexity theory we run the risk of getting lost in the deluge of data generated each day at high-throughput laboratories. In the future of biological sciences, there remains a place for “classical” studies of the function of individual genes and proteins, for accumulating vast collections of data, and for interdisciplinary theoretical research into the fundamental workings of biological systems. Many great adventures in genomics are yet to come.
The human genome at ten. Nature, www.nature.com/humangenome (collection of articles on the 10th anniversary of sequencing).
Maher B., (2008). Personal genomes: The case of the missing heritability. Nature, 456, 18-21.
Lango Allen H., et al. (2010). Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature, 467, 832-838.
Costanzo M., et al. (2010). The genetic landscape of a cell. Science, 327, 425-431.
© Academia nr 2 (38) 2013