Life as we know it is specified by genomes. Every organism posseses a genome that contains the biological information needed to construct and maintain a living example of that organism. Most genomes, including those for all cellular lifeforms, are made of DNA (deoxyribonucleic acid) but a few viruses have RNA (ribonucleic acid) genomes. DNA and RNA are polymeric molecules made up of linear, unbranched chains of monomeric substances called nucleotides. Each nucleotide has three parts: a sugar, a phosphate group, and a base (fig 1.1). In DNA, the sugar is 2’- dedeoxyribose and the bases are adenine (A), cytosine (C), guanine (G) and thymine (T). Nucleotides are linked to one another by phosphodiester bonds to form a DNA polymer, or polynucleotide, which might be several million nucleotides in length. DNA in living cells is double-stranded, two polynucleotides being wound around one another to form the double heix. The double helix is held together by hydrogen bonds between the base components of the nucleotides in the two strands. The base-pairing rules are that A base-pairs with T, and G base-pairs with C. The two DNA molecules in a double helix therefore have comlementary sequences. In an RNA nucleotide sequence, the sugar is ribose rather than 2’- deoxyribose, and thymine is replaced by the related base uracil (U). RNA polymers are rarely more than a few thousand nucleotides in length, and RNA in the cell is usually single-stranded, although base pairs might form between different parts of a single molecule.
The biological information contained in a genome is encoded in the nucleotide sequence of it’s DNA or RNA molecules and is divided into discrete units called genes. The information contained ina gene is read by proteins that attach to the genome at the appropriate positions and initiate a series of biochemical reactions referred to as gene expression. For organisms with DNA genomes, on the simplist level, gene expression can be divided into to two stages, transcription and translation, the first producing an RNA copy of the genes and the second resulting in synthesis of a protein whose amino acid sequence is determined via the genetic code, by the nucleotide sequence of the RNA transcript. It must be noted however that this simplified description, should not result in attention being drawn away from the key points in the gene expression pathway at which information flow is regulated.
A complete copy of the genome must be made every time a cell divides. DNA replication has to be extremely accurate in order to avoid the introduction of mutations into the genome copies. Some mutations do, however occur, either as errors in replication or due to the effects of chemical and physical mutagens that direcly alter the physical structure of DNA. DNA repair enzymes correct many of these errors; those that escape the repair processes become permanent features of the lineage descending from the original mutated genome. These events, along with genome rearrangements resulting from recombination, underlie molecular evolution, the driving force behind the evolution of living organisms.
Of all the genomes in existence our own is quite naturally the one that interests us the most. An overview of it’s structure is therefore necessary, in commanding a picture of the goals trying to be achieved by the human genome project, and therefore the technologies developed in trying to obtain them.
The human genome is made up of two distinct components. The nuclear genome, which comprises approximately 3 000000000 bp (base pairs) of DNA. The nuclear genome is divided into 24 linear DNA molecules, the shortest is 55 Mb in length and the longest 250 Mb, each contained in a different chromosome. These 24 chromosomes consist of 22 autosomes and the two sex shromosomes, X and Y. The human genome also consists of the mitochondrial genome, a circular DNA molecule of 16, 569 bp, many copies of which are located in the energy-generated organelles called mitochondria. Each of the approximately 10-13 cells in the adult human body has it’s own copy or copies of the genome, the only exceptions being those few cell types, such as red blood cells, that lack nuclei in their fully differentiated state. The vast majority of cells are diploid and so have two copies of each autosome, plus two sex chromosomes, xx for females or xy for males- 46 chromosomes in all. These are called somatic cells, in contrast to sex cells or gametes, which are haploid and have just 23 chromosomes, comprising one of each autosome and one sex chromosome.
In order to obtain the human genome sequence, several strategies, involving different techniques were used. These techniques involved much more than just methods for sequencing DNA molecules. These methods are obviously of paramount importance but they have one major limitation: even with the most sophisticated technology it is rarely possible to obtain a sequence of more than about 750 bp in a single experiment This means that the sequence of a long DNA molecule has to be constructed from aseries of shorter sequences. One approach would be to brake the molecule into fragments, determine the sequence for each one, and use a computer to search for overlaps and build up the master sequence. This shotgon method is the standard approach for sequencing small prokaryotic genomes, but the required data analysis becomes disproportionally more complex as the number of fragments increase, leading to much higher risks of mis-sequencing. The problem is further compounded where a repetitive sequence is broken into fragments, with many of the pieces containing the same or very similar, sequence motifs. It would be very easy to reassemble these sequences so that a portion of the repetitive regions was left out or even that two pieces of the same or different chromosomes were mistakingly connected together. (fig 2.2).
The shotgun approach is therefore inappropriate for the Human genome. Instead, it was agreed that a genome map must be first generated. This would provide a guide for the sequencing experiments by showing the positions of genes and other distinctive features. For these reasons, the first 6 years of the human genome project were devoted almost exclusively to mapping the human genome, rather than sequencing it.
The convention has been to divide genome mapping methods into two categories. The first, and the older of the two techniques is genetic mapping. This technique is based on the use of genetic techniques to construct maps showing the positions of genes and other sequence features on a genome. The second type of mapping, is physical mapping, which uses molecular biology to examine DNA molecules directly in order to construct maps showing the positions of sequence features including genes.
As with any type of map, a genetic map must show the positions of distinctive features. In a geographical map these markers are recognisable components of the landscape, such as rivers, roads, and buildings. What markers however could be used in a genetic landscape?
The first comprehensive human genetic map was published only in 1987, but within 7 years very high resolution maps were achieved, mostly using microsatallite markers. Using classical methods of genetic mapping, meant that human genetic maps could not feasibly be achieved. Classical genetic maps for experimental organisms such as Drosophilia and mouse are based on genes. They have been available for decades, and have been refined continuously. They are constructed by crossing different mutants in order to determine whether the two gene loci are linked or not. For much of this period, human geneticists were envious spectators, given that a human genetic map was unattainable. Unlike the experimental organisms, the human genetic map was never going to be based on genes because of the frequency of mating between two individuals suffering from different genetic disorders is extremely small.
The only way forward for a human genetic map was to base it on polymorphic markers which were not necessarily related to disease or genes. As long as the markers showed mendalian segregation and were polymorphic enough so that recombinants could be scored in a reasonable percentage of meiosis, a human genetic map could be obtained. The problem here was that, until recently, suitable polymorphic markers were just not available. Classical human genetic markers consisted of protein polymorphisms, notably blood group and serum protein markers, which are both rare and not very informative (see box 11.1). By 1981, only very partial human linkage maps had been obtained, and then only in the case of a few chromosomes.
It was therefore the identification of DNA-based polymorphisms transformed human genetic mapping. Unlike cliassical markers, DNA-based polymorphisms were not simply confined to the 3% of the DNA that was expressed (genes): they were also available in noncoding DNA. Since the latter was not so strongly conserved in evolution, changes in the DNA are comparatively frequent. The realisation that DNA polymorphisms could be abundant called for a radical revision of thinking, and the early 1980’s saw serious discussion of the possibility of constructing a complete human genetic map for the first time (Botstein et al., 1980). Moreover, DNA markers have the advantage that they can be typed by the same technique, with their chromosomal location beind determined by using FISH or radiation hybrid mapping (sections 10.1 and 10.2), allowing DNA-based genetic maps to be cross-referenced to physical maps. This avoids the fustrating situation that arose when the long sought cystic fibrosis gene (CFTR) was first mapped. Linkage was established to a protein polymorphism of the enzyme paraoxonase, but the chromosomal location of the paraoxonase gene was not known.
The desirability of a complete linkage map of the human genome was clear. In addition to providing a framework for studying the nature of recombination in humans, it would permit rapid gene localization, assist gene cloning, and facilitate genetic diagnosis. Almost inevitably, the realisation that a comprehensive human genetic map was now attainable sparked serious efforts to construct one. The first generation of DNA markers were restriction length polymorphisms (RFLPs). RFLPs were initially typed by preparing Southern blots from restriction digests of the test DNA, and hybridising with radiolabelled probes (fis 5.12). This technology required plenty of time, money and DNA, and made a whole genome a heroic undertaking. Nowadays this is less of a problem because RFLPs can usually be typed by PCR. A sequence including the variable restriction site is amplified, the product is incubated with the appropriate restriction enzyme and then run out on a gel to see if it has been cut (fig 6.6). A more fundamental limitation is their limited informativeness. RFLPs have only two alleles; the site is present or it is absent. In 1987, after a huge effort, the first such map was published based on the use of 403 polymorphic loci, including393 RFLP (explain)markers (Donis-Keller et al., 1987). Although this achievement was important, there remained some serious draw backs with the map: the average spacing between the markers (greater than 10 cM) was still considerable, and, more significantly, RFLP markers were not very informative and are difficult to type (see box 11.1).
High-resolution human genetic maps have therefore largely been obtained through the use of microsatallite markers. Hypervariable minisatallite VNTR (variable number tandem repeat) polymorphisms as an alternative, are highly polymorphic, and were a great improvement, given that they have many alleles and high heterozygosity, however the technical problems of southern blotting and radioactive probes were still an obstacle to easy mapping, and VNTRs are not evenly spread across the genome, with their applicability to genome-wide maps being limited because they are mostly restricted to chromosomal regions near the telomeres. Microsatallite markers (also described as short tandem repeat polymorphisms, or STRPs) have the advantage of being abundant, dispersed throughout the genome, highly informative and easy to type (see box 11.1). The advent of PCR finally made mapping relatively quick and easy. Minisatallites are too long to amplify well, and so the standard tools for PCR linkage analysis beacame microsatellites. These are mostly (CA)n repeats. Tri- and tetranucleotide repeats are gradually replacing dinucleotide repeats as the markers of choice because they give cleaner results- dinucleotide repeat sequences are particularly prone to replication slippage during PCR application. Each allele gives a little ladder of “stutter bands” on a gel, making it hard to read (fig 6.8). Much effort has been devoted to producing compatable sets of microsatallite markers that can be amplified together in a multiplex PCR reaction and give non overlapping allele sizes, so that they can run in the same gel lane. With fluorescent labelling in several colours, it is possible to score up to ten markers on a sample in a single lane of automated gel. By focusing on this type of marker, researchers at the Genethon laboratory in france were quickly able to provide a second generation linkage map of the human genome (Weissenbach et al., 1992). Subsequently, maps have been produced with ever increasing numbers of genetic markers, especially microsatellite markers, and ever increasing resolution. Within a further two years, a genetic map with 1 cM resolution had been achieved (Murray et al., 1994). After this time the major effort switched to the construction of high-resolution physical maps.
Like the genetic map, a physical map of the human genome will consist of 24 maps, one for each chromosome. The different genetic maps of the human genome thay have so far been assembled all represent the same concept –sets of linked polymorphic markers (linkage groups) corresponding to different chromosomes. However, unlike this uniformity, a variety of different physical maps are possible (table 13.3 ans figure 13.2) The first physical map of the human genome was obtained more than 40 years ago when cytogenetic banding techniques were used not only to distinguish between different chromosomes, but also to provide discrimination of different subchromosomal regions (figure 2.17). Although the resolution is coarse (an average size chromosome band in a 550-band preparation contains approximately 6 Mb of DNA), it has been very useful as a framework for ordering the locations of DNA sequences by chromosome in situ hybridisation techniques. This is a simple procedure for mapping genes and other DNA sequences by hybridising a suitably labelled DNA probe against chromosomal DNA that has been denatured in situ. To do this, an air-dried microscope preparation of metaphase or prometaphase chromosomes is made, usually from peripheral blood lymphocytes or lymphoblastoid cell lines. Treatment with Rnase and proteinase K results in partially purified chromosomal DNA, which is denatured by exposure to formamide. The denatured DNA is then available for in situ hybridisation with an added solution containg a labelled nucleic probe, overlaid with a cover slip. Depending on the particular technique used, chromosome banding of the chromosomes can be arranged either before or after the hybridisation step. As a result, the signal obtained after the removal of excess probe can be correlated with the chromosome pattern in order to indentify a map location for the DNA sequences recognised by the probe. Chromosome in situ hybridisation has been revolutionized by the use of fluoresecence in situ hybridisation (FISH) techniques (sect 10.1.4)
Other maps have been obtained by mapping neutral chromosome breakpoints (using translocation and deletion hybrids; section 10.1.2), which is a more refined technique of physical mapping than that of using somatic cell hybrids and use only part of a particular chromosome. Translocation hybrids and deletion hybrids are made using by using donor human cells that have a chromosomal translocation or deletion. To be useful, the hybrids must lack the normal nomolog of the chromosome of interest. Such hybrids can be used for chromosomal mapping of a human sequenced tagged site or biochemical marker (fig 10.2). They are especially useful for defining the sequences removed by microdeletions, by segregating the deletion carrying chromosome away from it’s normal homolog. Alternatively, physical maps can be obtained by mapping artificial chromosome breakpoints using radiation hybrids (10.1.3). These are the most valuable hybrids for gene mapping (Walter et al., 1994). Donor cells are subjected to a lethal dose of radiation which fragments their chromosomes. The average size of a fragment is a function of the dose of radiation. After irradiation the donor cells are fused with recipient cells of a different species. A selection system is used to pick out the recipient cells that have taken up some of the donor chromosome fragments (box 10.1). These cells are useful for mapping in so faras they have taken up a random setn of other chromosome fragments from the donor, aswell as the selected fragment. Stably incorporated donor fragments are either intergrated into rodent chromosomes or are assembled into novel human minichromosomes formed around fragments containing a functional centromere. Although this procedure was first proposed by Goss and Harris in 1975, it was not used seriously until 1990, when hybrids were constructed using irradiated monochromosomal hybrid cells as donors (Cox et al., 1990). When a set of DNA markers from the human chromosome is assayed in a panel of such radiation hybrids, the patterns of cross-reactivity can be used to construct a map (fig 10.3). The principle being very similar to meiotic linkage analysis. However, the resolution achieved can be quite limited. Such maps, have however, been useful frameworks for mapping genes (transcription maps (13.3). Large-scale restriction maps have also been generated, such as the NotI restriction map of 21q (Ichikawa et al., 1993; fig 13.2). However, the most important maps are clone contig maps because these are the immediate templates for DNA sequencing.The construction of the ultimate physical map (the complete nucleotide sequence) requires considerable time and effort in the case of a very large DNA molecule such as that found in a chromosome. In order to provide a framework for this to be done efficiently, a series of cloned DNA fragments need to be assembled which collectively provide full representation of the sequence of interest. To ensure that there is complete representation, and no gaps, the series of clones should contain overlapping inserts forming a comprehensive clone contig (fig 10.13a). In principle, contig assembly is facilitated by the way in which genomic DNA libraries are constructed: as part of the strategy for maximising the representation of the library, the genomic DNA id deliberately subjected to partial digestion with a restriction endonuclease (by reducing the time of incubation and by using low concentrations of the enzyme). As a result, individual genomic DNA clones usually contain DNA sequences that partially overlap with the insert DNA of at least some of the other clones in the library (see fig 10.13b). The cloning step means that the individual DNA fragments are sorted into different cells and so the original positional information of the fragments (how they were related to each other on the original chromosomes) is lost. However, such information can be retrieved by a variety of methods which can identify clones with overlapping inserts.
Chromosome walking means establishing clone contigs from fixed starting points.One widely used technique for identifying clones with overlappin inserts is to use a specific DNA probe from one clone to screen a DNA library. The positively hybridising clones should contain a DNA sequence that is closely related to the probe, including clones which contain sequences which partly overlap that found in the probe. This has often involved the preparation of a so-called end probe from the starting DNA clone: a fragment located at one end of the insert DNA and preferably present as a single copy of DNA, is purified and labelled. Positively hybridising clones can then be purified and new, distal end probes can be prepared for further rounds of hybridisation screening of the DNA library.
YAC clone contig maps, as will be explained shortly, represented the first generation physical map of the human genome. A complete clone map of a chromosome would comprise all of the DNA without any gaps (contig originated as a shortened form of the word contiguous; sect 10.3. Because of their large inserts, yeast artificial chromosome (YAC) clones have been particularly useful in generating first generation physical maps of human chromosomes. Different methods of identifying overlaps between clones have been used, but STS markers (both polymorphic and non-polymorphic), which had previously been mapped to the chromosome of interest, have been particularly useful. Significant contig maps for individual human chromosomes were first reported in 1992 for chromosome 21 and the Y chromosome and, subsequently, a first-generation clone contig map of the human genome was reported by workers at the CEPH lab in Paris (Cohen et al, 1993). An updated YAC contig map, covering 75% of the human genome and consisting of 225 contigs with an average size of 10 Mb, was subsequently published by the same group (Chumakov et al, 1992). While these physical maps were recognised to be far from complete, this was an outstanding achievement and provided a good frame work for the scientific community to build upon in order to produce further detailed maps of all the chromosomes. Complementing this approach, good STS-based physical maps of the human genome have been developed, such as the one constructed at the Whitehead institute in Massachussetts (Hudson et al., 1995). These have been achieved in part by mapping STSs against panels of whole-genome radiation hybrids. Radiation hybrids as previously discussed, derived from monochromosomal hybrid donor cells have been superceded by whole genome radiation hybrids where the donor is an irradiated normal human diploid cell. The first such panel consisted of 199 hybrids made by fusing an irradiated 46, XY human fibroblast cell line to TK- hamster cells (Walter et al., 1994). Gyapay et al., (1996) used 404 microsatallite markers of known location to show that this hybrid panel could generate accurate maps, and then used it to map 374 unmapped ESTs. A subset of 93 of the hybrids has been made widely available as the Genebridge 4 panel. The 93 hybrids average 32% retention of any particular human sequence, with the average fragment size of 25 Mb. Laboratories can map any unknown STS by scoring the 93 Genebridge hybrids and comparing the pattern with the patterns previously mapped markers held on a central server (fig 10.4).
This turned out to be an extremely powerful and convenient tool for physical mapping any STS or EST. A second human-hamster panel, Stanford G3, was made using a higher dose of radiation, so that the average human size is smaller. The 83 hybrids in the G3 average 16% retention of the human genome, with an average fragment size f 2.4 Mb. Thus G3 can be used for fine mapping (Deloukas et al., 1998).
However, the utility of theYAC contig maps is limited because YAC inserts are often not faithful representatives of the original starting DNA; many YAC clones are chimeric or have internal deletions (see 10.3.5). As a result, second generation clone contig maps have relied on bacterial artificial chromosomes (BACs) and P1 artificial chromosomes (PACs). Although the insert sizes of these clones (typically 70-250 kb) are much smaller than that of YACs, this disadvantage is more than outweighed by their stability, making them more faithful representatations of the original DNA. Recently the large genome centers have focused greatly on constructing large BAC contigs as a prelude to large scale DNA sequencing.
As an early priority in the human genome project were the constructin of gene (transcript) maps. From the outset of the human genome project there was much debate over whether to go for an all-out assault (indiscriminate sequencing of all 3 billion bases), or whether to focus initially just on the coding DNA sequences. The average coding DNA of a human gene is about 1.7kb, but human genes occur on average, once every 40-50kb of DNA. As a result, coding DNA accounts for a mere 3% of the human genome? To obtain coding-DNA sequences, the easiest approach would be to make a range of human cDNA libraries, then sequence cDNA clones at random.
The priority of coding-DNA sequencing was dependant on two arguments: firstly that coding-DNA contains the information content of the genome and so is by far the more interesting and medically relevant part and secondly, that it is such a small percentage of the genome that it can be achieved very quickly and cheaply, when compared with efforts to sequence the entire genome. Supporters of whole genome sequencing emphasized that finding all genes could be difficult (some genes may not be well represented in available cDNA libraries if they are very restricted in expression, or expressed transiently during early development). In addition, at least some of the non-coding DNA is functionally important, eg in the case of regulatory elements and sequences that are important for chromosome function.
Indeed, the first comprehensive gene map was based on short sequence tags from cDNA clones. The coding sequence priority prevailed and the first reasonably comprehensive human gene maps were constructed, essentially involving three steps. Initially the cDNA needs to be randomly sequenced. To begin with this meant sequencing short (around 300 bp) sequences at the 3’ ends of cDNA clones from a variety of human cDNA libraries. These short sequences became known as expressed sequence tags (ESTs) because they permitted a simple and rapid PCR assay for a specific expressed sequence gene, (Adams et al., 1991). In this sense therefore, an EST is simply the gene equivalent of an STS (a term used to describe any type of sequence, but often noncoding DNA, which is specific for a particualar locus). Because the 3’ UTR of almost all human genes exceeds 300 bp, the 3’ ESTs typically did not contain coding sequence.
It was now necessary to map ESTs to specific chromosomes. 3’ UTR sequences are not as frequently interrupted by introns as coding DNA. This means that it is usually easy to design PCR primers from an EST that will amplify the specific squence in a genomic DNA sample. Because 3’ UTR sequences are not very well conserved during evolution, it is also possible to screen human-rodent somatic cell hybrids for the presence of human EST (the orthologous rodent sequences are usually so diverged that they do not amplify). By using a panel of human monochromosomal somatic cell hybrids (section 10.1.1), an EST can be mapped to a specific human chromosome.
Mapping ESTs to subchromosomal locations has been achieved through a huge effort based at some centres to establish integrated STS-based and EST-based maps, such as those produced by the Whitehead institute. This has involved using PCR primers that are specifis for an EST (or STS) to type YACs and other clones within clone contig maps that have been produced for the relevant chromosome and/or typing of a panel of whole genome radiation hybrids. Two such panels have been used in particular (sect 10.1.3): the Gene-bridge panel (average size 25 Mb), and for higher resolution, the Stanford G3 panel (average fragment size 2.4 Mb).
Using the above approaches, the number of human genes that were placed on the physical map increased exponentially(fig 13.3). The latest human gene map, published in Octomber 1998?, was achieved by radiation hybrid mapping consortium led by the Sanger centre, UK, together with various other centres, notably Stanford human genome centre, the Genethon lab in Paris, the Whitehead institute and the welcome trust centre for human genetics at Oxford, UK. In all, map positions for over 30 000 human genes were reported (Deloukas et al., 1998), representing possibly 30-40% of the total human gene catalog. In many cases there is little or no coding sequences for the mapped genes and considerable effort is being devoted to sequencing large inserts of human cDNA clones in various laboratories throughout the world. Different research programs are investigating gene expression in specific tissues or in specific states. For example the Cancer Genome project (electronic reference 2), a program devised at the US National Cancer Institute, is devoted to studying expression of genes in various human tumor cells, including sequencing of large insert cDNA clones from cDNA libraries made from human tumor cells and large-scale expression profiling using microarrays (section 20.2.2).
Accelerated sequencing efforts mean that the ultimate physical map, the complete neucleotide sequence of the human genome, shpuld be delivered by the year 2003. At th outset of the human genome project, DNA sequencing was expensive and not very efficient. It was anticipated however, that technological developments would lead to considerable reductions in costs and more efficient sequencing. The sequencing of the human genome at that time seemed an immense challenge because there was so little experience in sequencing large genomes. All that has changed, and some very large genomes have already been sequenced (Fig 13.4). There have been no significant changes in the sequencing technology; the dideoxy sequencing approach invented by Fred Sanger and his colleagues at Cambridge, UK, more than 20 years ago is still used. Instead, efficiency gains have been made through the use od automated fluorescence-based systems and capillary gel electrophoresis.
While the first few years of the human genome project were devoted to producing high-resolution genetic and physical maps, large-scale human genome sequencing is now very much underway and 10% of the human genome has been sequenced by May 1999 (fig 13.5, electronic ref 3). Funded largely by the welcome trust, the greatest single contributor has been the Sanger Centre at Hixton, UK (fig 13.6). By May 1999 the Sanger centre had contributed over 100 Mb of finished human sequence (out of a global 300 Mb), and had also achieved a further 65 Mb of unfinished sequence (those sequences which have not yet been compiled into large contigs). In order to avoid wasteful duplication of effort, the HUGO-sponsored Human Genome Sequencing index identifies priority chromosomes or subchromosomal regions targeted by individual sequencing centres (electronic ref 4). Currently, chromosome 22 is set to be the first human chromosome to be completed. (fig 13.7).
Partly in response to competition from the private sector (box 13.2), the UK Welcome Trust and the US national Human research Institute have collaborated to bring forward the timescale for completion of the Human Genome Project. The aim is to produce a working draft, comprising about 90% of the human genome, by the year 2000. The Sanger centre is expected to produce 33% of the working draft and the three major American genome sequencing centres (Washington University School of Medicine at St. Lois, Baylor College of Medicine and the Whitehead Institute of Technology) are expected to achieve 60% between them. Other centers notably in France, Germany and Japan, are also commited to sequencing specific subchromosomal regions (electronic ref 4). After completion of the working draft, the full genome sequence is expected to be achieved around 2002-2003.
Human genetic maps based on microsatallite markers, although extremely valuable, have some limitations. In particular, although such markers are found all over the genome their density is limited to about one per 30 kb. In addition, typing of microsatallite markers is not so amendable to automation on a very large scale. By contrast, single nucleotide polymorphisms (SNPs) are very frequent (about 1 per kb) and typing is easily automated because they have only two alleles. (section 11.2.3). As a result, they have been envisaged to have potentially powerful applications in association studies to identify genes underlying polygenic disease (Collins et al., 1997; Schafer and Hawkins, 1998). The first steps towards establishing a third-generation SNP-based genetic map have recently been described by Wang et al.,(1998). Partly in response to initiatives from the private sector (see box 13.2), the US Human Genome Project and the UK welcome trust have committed funds for the construction of a map containing 100 000 SNPs by 2003. However, the utility of SNPs remains unproven and some emerged data has dampened the initial optimism (Pennisi, 1998).
Mapping the human genome is not the only scientific focus of the human genome project; at it’s outset, the value of sequencing genomes of model organisms was recognised. Such organisms include a variety of species, some of which have been particularly amenable to genetic analysis (see box 13.3). In part, the sequencing of smaller genomes was also considered as a pilot for large-scale sequencing of the human genome. By 1999 the genomes of about 100 organisms were being sequenced or had already been sequenced (electronic ref 5).
Once the sequence of the human genome is known what difference will it make? Certainly there will be a huge boost to basic research as we grapple with the fundamental biological question of how our genome is interpreted to specify a person. In the so-called post-genome era, accurate genetic testing will become widely available, not just for genetic disorders, but also in terms of genetic suseptability to a variety of different conditions, including infectious diseases. But there may be a downside in terms of discrimination against individuals. Improved treatments can also be expected. The much vaunted gene therapy approaches may prove technically difficult, but the new information will undoubtedly assist the development of novel therapies.
Comparative and whole genome analyses permit large-scale studies of DNA organisation and evolution and of gene expression and function. The human genome project had not reached it’s half-way point before serious consideration to what the research priorities should be in the post-genome era. Certainly, the sequencing of the whole genome will provide revolutionary approaches to biomedical research. For the first time, there are opportunities to compare whole genomes and the newly developed field of bioinformatics is set to take off (Gershon et al., 1997; Smith, 1998). Genome-wide analyses of gene expression and function will become a major area of investigation.
Comparative genomics involves analysis of two or more genomes to identify the extent of similarity of various features, or large-scale screening of a genome to identify sequences present in another genome. The examples below are merely meant to be illustrative of some of the applications.?