DIS 815                                                          Louiza Patsis

                                                                        December 19, 2005

The Entrez Data Warehouse and Bioinformatics

Abstract

Every day, there is an increasing amount of biological information in the world, especially after the completion of the sequencing of the human genome in 2003. Ideally, the information can be put on the internet to assure access by the world’s scientists. Biologists and computer information scientists are creating databases, software, algorithms and more that can handle this information need. This information must be linked together and made accessible to scientists throughout the world. Problems thus include the huge amount of data, linking databases containing relevant information for a search, redundancy of deoxyribonucleic acid (DNA) sequences in databases such as GenBank, and the need for fast yet accurate algorithms. This paper will explore the relatively new field of bioinformatics by focusing on the data warehouse Entrez, zeroing in on Medline and then further narrowing down focus to use cases involving text, genes and proteins, in order to illustrate some problems in bioinformatics. This paper also will review some new research designed to tackle the problems of bioinformatics.

1.0  Paper Organization

            This paper is intended as a general introduction to bioinformatics, using Entrez and Medline as databases, and as a general introduction to bioinformatics problems and some research done to help solve them.  This paper is intended for non-biologists and non-computer programmers. This paper will be composed of the following sections:

2.0  Background – This section will provide a background of bioinformatics, the problems involved in bioinformatics, and the research statement.

2.1  Some Problems in Bioinformatics

3.0  Research Questions – This section will show the research questions to be answered by the paper.

4.0  Brief Background on Human Genetics – This section will provide a brief background on human genetics.

5.0  Entrez – This section will provide a background of the Entrez data warehouse.

6.0  Medline – This section will provide background information on Medline.

6.1  Abstract and Full Text Searching in Medline: UMLS and MeSH - This section will provide background information on Medline, such as UMLS, the Metathesaurus, the Semantic Network, the Information Source Map and the SPECIALIST Lexicon.

6.2                          Linking to Gene Banks and Linking to Protein Banks – This section will show how data sources in Entrez are linked.

6.2.1    GenBank – This section will provide information on GenBank.

6.2.2        Molecular Modeling Database (MMDB) – This section will provide information on MMDB.

6.2.3        RefSeq – This section will provide information on the database RefSeq.

7.0   Brief Description of Some Other NCBI Databases – This section will briefly           describe other NCBI databases.

8.0  Text Searching in Biology – This section will outline some common problems in text searching in biology.

9.0  Matrices and Algorithms for Linking and Retrieving Information – This section will provide a synopsis of the most common algorithms used to find genetic and proteomic sequences.

9.1  Matrices

9.2  Algorithms

10.0          New Methods for Linking Bioinformatics Information – This section will show new methods to link bioinformatics information.

10.1          Algorithms – This section will review some recent studies on algorithms.

10.2          Concept Space - This section will review some recent studies on concept space.

10.3          Semantic Networks - This section will review some recent studies on semantic networks.

10.4          Filters - This section will review some recent studies on filters.

10.5          Clustering - This section will review some recent studies on clustering of biological data.

10.6          Wrappers - This section will review some recent studies on wrappers.

11.0          Use Cases with Medline - This section will review use cases with Medline.

12.0          Use Cases with Entrez - This section will review use cases with Entrez.

13.0          Conclusions

Most of the recent research in bioinformatics is contained in the March 2005 Journal of the American Society for Information Science and Technology, and the IEEE August 2004 Proceedings.  Most papers cited will be from these two sources. The data warehouse Entrez was chosen because, along with European Molecular Biology Library (EMBL) and DNA Databank of Japan (DDBJ), it is one of the biggest bioinformatics warehouses in the world. It is of significance to American scientists.  Medline is the largest source of biological text information in the world. The general non-science public search on Medline for articles. Natural and controlled vocabulary must be linked for effective information retrieval. For scientists, Medline information must be linked to gene and protein information so that they can get the most out of their searches.

2.0  Background

               Scientists want new genetic and molecular biology information on the internet to assure access by the world’s scientists. If a scientist in India, for example, has found a new function for a protein in cell signaling that is involved in cancer or in the incorporation of a virus into a cell, a researcher in New Jersey may want to know. The second researcher may know of an enzyme that would affect this protein and its function, and thereby affect the health of millions of people. A scientist working in a pharmaceutical company would also like to know. The increasing amount of genetic information out there, especially after the DNA sequencing technology and the sequencing the whole human genome 1, makes the biological information out there even greater. Much of this science information is “transformed into information that the lay audience can understand for web sites such as that of the Centers for Disease Control and Prevention (CDC). Many journals do not allow scientists to publish unless new sequences referred to in the journal article are deposited into GenBank. 
 
                From the beginning of 1998, Computer Applications in the Biosciences (CABIOS) changed its name to Bioinformatics to portray the current scope of the journal. The journal focuses on the priority fields of computational molecular biology and genome bioinformatics. The Journal of the American Society for Information Science and Technology published a journal on bioinformatics in March 2005. Several research studies were presented at the Institute of Electrical & Electronics Engineers (IEEE) meeting in Stanford, California in August 2004. 

Bioinformatics, or computational biology as it is often called, is the use of techniques from applied informatics and mathematics, statistics and computer science to solve biological problems. Computational biology usually refers to the development of algorithms for bioinformatics and will not be the focus of this paper. In the biology-mathematics-computer science triad, bioinformatics refers to biological systems, mathematics, and computer science, while computational biology focuses on biology and mathematics.

Systems biologists seek to integrate different levels of information to understand how biological systems function. Systems biology includes the knowledge of possible genes, chromosomes, gene transcription and translation factors, proteins (which include enzymes) in an organism, and high-throughput techniques such as Western blotting, microarrays, and mass spectrometry to measure the changes in these molecules. Proteomics and genetics overlap. Proteomics is the study of the three-dimensional structure of proteins. It can be more complicated than genetics. System biologists study the parts of a system, whether it’s a cell, an organ or the whole body, hoping to come up with an understandable model of the whole system. One organism will have different protein expression in different parts of its body, in different stages of its life cycle and in different environmental conditions. Systems biologists do not focus on the elementary systems as molecular biologists do. Systems biologists think that much of this information is complex and redundant, or is not fully known. They rely on what they know to focus on organ and body systems.  System biologists need bioinformatics to make sense of a vast amount of data.

Major research efforts in bioinformatics include sequence alignment to find synteny, gene finding, genome assembly proteins structure alignment, and protein-molecule interactions.  A common thread in projects in bioinformatics is the use of mathematical tools to extract useful information from noisy data produced by high-throughput biological techniques.

Biological systems include those of all animals. The human genome is about 99% the same as that of a Cerevisae (earth) work or chicken!  Computer systems that show gene data must be able to map gene data of different organisms to human data, human gene datafrom different genes to each other, human gene data of healthy and diseased genes to each other, and all of this to protein information and documented research! And this is just the story in a nutshell!

Biological systems start at the cell level, and lead to the tissue level, organ level, organ system level and to the body. Each of these levels is a system. A gene can be a system. A biological pathway can be a system. The most evolving area of biological systems now is genetics.

One major task in the area of bioinformatics is finding a gene or protein sequence related to a newly found gene or protein sequence. Evolution is conservative; sequence residues may change but chemical and physical properties needed to maintain similar biochemical and physiological processes are conserved. If the similarity is great enough, as determined by scoring after algorithm analysis  (discussed more in section 9.0), it is assumed that the sequences are homologous. This would make it more likely that what they encode for has similar features and functions to what the already-known and similar sequence(s) encodes. This is an assumption as good as the algorithm and further laboratory experiments make it. More information on this is in Section 9.0.

 Here is a typical scenario in bioinformatics:  A pharmacologist wants to develop a drug against a certain bacterium. He needs to find a gene out of the over 1,000 bacterium genes that would be a good target for a drug (Bartlett and Toms, 469). Does he perform thousands of laboratory experiments with many different molecules of known drugs, and variations of these drugs to see what effect each of the gene-drug combinations will have? This is probably not feasible. One thing that the pharmacologist can do to speed up the process is to do a computer search of the sequences of bacterium genes and see if they exist in other organisms, then see what proteins the genes code for in those organisms (including different bacteria) to see if any of these proteins is vital to bacterium survival and when and how they are used by the bacterium. Then the researcher can choose which genes are the best to target with a new or existing drug.

            Another scenario is a scientist who knows the three-dimensional structure of a protein, and wants to find its genetic sequence. With linked databases such as those in Entrez, this is possible. From the genetic sequence, he is more likely to figure out what mutations will lead to what structure and function changes in the protein, and where the protein’s gene is on a chromosome, which affects how it is passed during reproduction. A problem where would be that the algorithms and other computer techniques would have to show with accuracy what gene encodes for that protein, taking into account non-deleterious and deleterious mutations, introns, and that a gene can encode for a pre-protein, which is a protein that is inactive until it changes to become active.

A third, out of many more, scenario would be if a scientist finds what he knows is a mutated version of a gene and he wants to find out if the mutation is deleterious. He can search databases for that mutated sequence to find proteins encoded by it to see if they are normal.

Stakeholders

            The stakeholders for all of this information ultimately are every person of the world that can benefit from drugs or other products that can be produced after the exchange and building upon all of this scientific information. Besides drugs, these products can include pesticides that will not be harmful to human beings, food additives that can help to prevent cancer and additions to fuels that will cause less harmful emissions. The direct stakeholders which deal with complex genetic and molecular biology data are the scientists themselves. They include government, academic and commercial scientists. Other direct stakeholders are government officials, such as those at the NIH, and executives, such as those in pharmaceutical corporations.

2.1 Some Problems in Bioinformatics ***

A major problem in bioinformatics still is the huge amount of data and how to cross-reference the data, and have it made available in an accurate and timely manner. As mentioned before, there is no standard way to name a gene. GenBank issues accession numbers to which all known gene sequence names are linked. However, human beings make mistakes so this system is not mistake-proof. Matters are complicated further because different parts of a gene or protein are sequenced. For instance, an enzyme has at least two regions. One region encodes for a sequence of amino acids that must be truncated from the enzyme to convert it to its active form.  An enzyme with this region is called a pro-enzyme. An enzyme (or gene or protein sequence) found may be partial. This partial sequence may be similar to sequences found in other parts of the human genome or in genomes of other animals, but experiments are needed to verify what larger sequence a sequence is part of. An accession number can be the same for a pro-enzyme and enzyme, or it may not.

A problem in bioinformatics is how to develop algorithms that will be high in sensitivity and specificity. An algorithm high in sensitivity will retrieve information on a huge amount of sequences that may be similar to the one that your have. An algorithm high on specificity will retrieve algorithms more likely to be exactly related to the one you have. Often, two algorithms are used and then laboratory results are used to see if the sequences match. This will be explained more in section 9.0. A very in-depth explanation is beyond the scope of this paper.

Another problem in bioinformatics is having databases in a data warehouse linked. “Linked” means that if a user looks for protein sequence information in RefSeq, links to relevant journal article in Medline and PubMed Central will appear on the search result page, as well as relevant DNA sequence, taxonomy and structure database links. In that way, a researcher can get a comprehensive idea and access to all of the information related to his sequence, which can prove crucial.

Two major problems in bioinformatics can thus be summarized in this way: With the increasing amount of bioinformatics data and the complexity in pairing up similar sequences, finding algorithms that are sensitive and specific, and linking all databases together in a comprehensive way is increasingly difficult.

 

 

*** In bioinformatics, a substitution matrix estimates the rate at which each possible residue in a sequence changes to each other residue over time. Substitution matrices are usually seen in the context of amino acid or DNA sequence alignment, where the similarity between sequences depends on the mutation rates as represented in the matrix.

 

 


3.0 Research Objectives and Questions

How is some new research in algorithms expanding what can be used to find and link biological information?

How thoroughly is information from different databases linked in Entrez?

4.0 Brief Background of Human Genetics

            The Human Genome Project was completed in 2003. It was a thirteen-year project coordinated by the United State Department of Energy and the National Institutes of health. Contributions from the Wellcome Trust of the United Kingdom, Japan, France, Germany and China also made a big difference. Approximately 25,000 genes were identified. The sequences of three billion base pairs were determined. The information was stored in databases such as GenBank of the US and of Japan. The information is publicly available. Technologies are licensed to private companies and federal grants are awarded of innovative research. Analysis of the data will continue for many years.

            What is a gene? Before that question is explored, one has to know what DNA is. DNA, or deoxyribonucleic acid, is the hereditary material in human beings and almost all organisms (some viruses only have RNA). Almost every cell in a human body has the same DNA. It is located mostly in the nucleus of the cell.  DNA is composed of the sugar ribose, phosphates and four base pairs: adenine (A), guanine (G), cytosine (C), and thymine (T). Uracil (U) sometimes replaces T. Human DNA consists of about 3 billion bases. More than 99 percent of those bases are the same in all people! This gives one a sense of the power that 1% of genetic difference can have – not two people are alike!

DNA has a helical structure due to the bases of each strand of DNA forming base pairs. A can “hook up to” T or U. C and G “hook up”. Three base pairs together make up a codon. The structure of the double helix is like a ladder, with the base pairs forming the ladder’s rungs and the sugar and phosphate molecules forming the vertical sidepieces of the ladder.

A certain sequence of codons has the information to form a protein. The order, or sequence, of these bases determines the information for building and maintaining an organism. A mutation can alter the protein or other molecule formed. A mutation can happen spontaneously, or can be caused by an external factor such as radiation. During a mutation, the codon changes since a base pair changes. This may have a beneficial effect, as in evolution. Or it may have no effect, since a change in some codons often does not affect the protein or other molecule formed. Or it may have a deleterious effect: the wrong protein is formed, or the protein differs in structure even slightly.  Enzymes such as DNA polymerase I make sure that mutations are corrected. Taking into account the number of times genetic information is copied, this is a big task. At the beginning of a gene is a start codon and at the end is a stop codon. Molecules called transcription factors attach to these parts of the gene to start or stop transcription, along with other DNA enzymes. A mutation that alters the start or stop codon, or the transcription factor codon, or codon of any protein that affects the synthesis of the transcription factor (you get the idea) affects when a gene is “on” or not or when a gene does not shut off as in cancer. One gene leads to one protein.

How Does DNA “Turn Into” Molecules?

 This is where RNA or ribonucleic acid comes in. The explanation here will be very elementary. There are various kinds of RNA. RNA is single-stranded. For genetic data to be copied, enzymes first cause the DNA helix to split. Messenger RNA or mRNA is formed as a copy of a single strand of DNA. This is called transcription. RNA carries with it the genetic data. Amino acids, the building blocks of protein, are formed from the codons of the mRNA. Transfer or tRNA transfers one amino acid at a time to growing polypeptide chains at the ribosome or site of protein synthesis of the cell during translation.

            DNA and RNA carry an enormous amount of information. This includes mutations that may or may not be deleterious. The same mutation on a mouse gene may or may not correspond to the same disease on the corresponding human gene. In addition, large portions of DNA called introns carry information that is not used to form proteins, and scientists do not know what its use is. It may be used to influence the structure of DNA, the portions of DNA used to make proteins are called exons. Now that scientists have the human genome, they must determine what the open reading frames are, and then what proteins they each code for, including variations of genes with mutations that do not alter the correct protein coding.

            Scientists are also interested to know where on a chromosome a gene is located, since during reproduction, chromosomes from two gametes cross over and exchange information. Knowing the location of genes can lead to knowing if a person is predisposed to a disease.

            The communication of genetic information has three components: structure, function and communication (Macmullen and Denn 2005, 449) Information flows through DNA to RNA to protein transcription and translation. The structure and function of proteins is controlled by this information flow. The new proteins are hormones and enzymes that further control structure and a function of cells. MacMullen and Denn write that this genetic flow of information is similar to information flow across populations; there are issues of replication; duplication; transcription; translation; error detection and correction; and effects of noise and errors (introns, mutations).These three functions are interdependent. For instance, all DNA is present in all cells but only some of it is expressed in some cells, and only some of that at certain times. If this is not too complex, there are the interactions of multiple genes and proteins. And if this is not enough, there is no standard system to name new genes! If someone finds in the huge genetic sequence of human beings that a gene encodes for a protein, they have no way to name it to fit into a system that everyone around the world knows! And they have to check to see if anyone else has found that gene, before they can patent the function! GenBank issues accession numbers that are linked to all the known names given to a gene. In that way, a gene can be found using any of the names.

            Denn writes that, after technology for science came about, the complexity and difficulty of organisms was revealed.  No longer was biological science composed of the Hooke and Leeuwenhoek microscope or of classifying organisms. Through technology, scientists can find out what they need about genes and proteins. Here are some things they want to know:

Sequence alignment

Algorithms and software are available for sequence alignment of nucleic or amino acids. Informatics is involved with generating suggested alignments automatically, assessing the probability that the correspondences occurred at random or are a real match, and displaying the information to the scientist. Scientists identify genes and proteins by comparing their nucleic acid or amino acid sequenc3es with those of known sequence. For instance, if a scientist identifies gene atccgatggcatcgta, they will get a clue of possible surrounding sequences or of proteins encoded by this gene by seeing that a part of it is similar to a known bacterial gene gatggcatcgta. With proteins, an example is if a scientist identifies amino acid sequence proline, leucine, isoleucine, phenyolalanine, glycine, galactosine, leucine, proline, glycine and he is not sure of the protein structure or function, he can find possibilities in structure of function by seeing that this sequence is similar to a known sequence proline, leucine, isoleucine, glycine, galactosine, leucine, proline, glycine.

There are problems in creating matrices and phylogenetic trees that will aid in accurate algorithms to assess if sequences are homologous. Describing the matrixes and phylogenetic trees is beyond the scope of this paper.  Often, two algorithms are used to see if gene or protein sequences are related, and then laboratory experiments are used to back up the computer findings.

Structure Prediction

Molecular biologists want to know the structure of proteins. The protein structure changes when it is in an active state. Also, protein structures often determine of a protein can enter a cell and it pathway in the cytoplasm. Threading algorithms are involved. This protein information and structure needs to be matched to genes, mutated or not, and to textual information. Here is an example of the importance of protein structure in cell biology: Large and small proteins vary in structure. Their bonds curve in different directions, revealing or hiding certain atoms depending on whether they are activated or not activated. Often, activation occurs once a protein is phosphorylated. For instance, the receptor for insulin hormone on cells is made up of α and β subunits.  When an insulin molecule combines with its receptor, the β subunits are phosphorylated, changing structure and setting off a series of phosphoprylations of intracellular proteins. These phosphorylations in turn change the shape of these proteins, altering their activity and causing a biological response. In this case, the end response is that the cell uptakes glucose from the blood.

Function Prediction

Once a gene is identified, the function needs to be predicted.  If a sequence is found, the scientist needs to map it to a sequence whose function is known. If there is gene homology, the genes would share a number of base pairs, implying that the sequences are from species with a common ancestor or have similar functions. With all of the sequences out there, and with the fact that different codons could code for the same protein and taking into consideration mutations, this can be a difficult task. 

Comparative Genomics, Proteomics, and Metabolomics

This refers to comparing genes, proteins, and metabolic molecules of specific cell reactions across species.

For all of the above, most scientists assume that similarities across genomes are meaningful from an evolutionary point of view and represent the conservation of useful genetic features or divergence of genetic function in response to the needs of particular organisms adapting to environmental conditions.

5.0 Entrez

The Entrez Global Query Cross-Database Search System is a huge, powerful search engine that allows access to databases at the National Center for Biotechnology Information (NCBI) website. NCBI is part of the National Library of Medicine (NLM), which is a part of the National Institutes of Health (NIH). NCBI was founded in 1988 as a national resource for organizing and delivering molecular biology information (Rapp and Wheeler 2005). At first it included three databases – Medline for text, nucleotide sequences and protein sequences. Software tools are available by Worldwide Web (www) browsing or by FTP. It includes text databases such as Medline, sequence databases, and structural databases. Recently, it started to include the molecular modeling database (3D protein structures), the unique human gene sequence collection, a gene map of the human genome, a taxonomy browser, and coordinates with the National Cancer Institute to provide the cancer genome anatomy project. For a visual map of how databases are linked in Entrez, see Appendix II.

For 15 years, it has maintained the GenBank gene and protein sequence database and the Basic Local Alignment Search Tool, (BLAST) for comparing sequences. Now it offers more than 30 publicly available database sources and search tools (Rapp and Wheeler 2005). It has laid the foundation for text-based access to diverse databases by offering a rich set of links among records within and across databases (Rapp and Wheeler 2005). NCBI databases cover gene sequences, quantitative information on levels of gene expression, and a taxonomy resource to classify organisms on the basis of sequence data. See Appendix III for a list of Entrez Databases. Entrez has cross-referencing within each database and across databases. It offers links to outside databases and offers methods for outside data providers to link to Entrez. If a user goes to http://www.ncbi.nlm.nih.gov/About/tools/index.html he can find descriptions of all Entrez databases. Through the main NCBI search webpage http://www.ncbi.nlm.nih.gov or through Medline, one can use the “all databases” button to access information about a molecule or other keyword on all databases.

Most of the Entrez databases are used by knowledgeable scientists and physicians. A tutorial is needed to be able to use the system. For any given protein, for instance, there may be tens of thousands of entries in the databases linked in Entrez. For instance, searching under “All Databases” for DNA I polymerase, just one of the enzymes involved in DNA replication, tens of thousands of results are found. (See Appendix IV.) Upon hitting the 3D Domain, the user finds 3,899 results, each with links to textual and graphic data.

 

6.0 Medline

            MEDLARS online, or Medline, is the world’s most heavily used medical database, complete with its own controlled vocabulary called Medical Subject Headings (MeSH). In 1971 Medline became one of the first online databases for information retrieval. Medline is a bibliographic database published by NLM. It contains information on medical documents and records, providing abstracts and citations on an enormous amount of medical journal articles and full citation of some in its PubMed Central division. Medline contains more than 10 million records from more than 4,200 journals that publish information about the causes, prevention and treatment of disease and injury (Katcher 1999).it is accessed almost 20,000 times a day (Kinsgland III 1993). Each record is read by a skilled indexer, who assigns it with about a dozen subject headings drawn from about 20,000 MeSH subject headings.  PubMed Central, launched in February 2000 with content from the Proceedings of the National Academy of Sciences and from Molecular Biology of the Cell, makes peer-reviewed full text journal articles available on Medline. For  a list of journals allowing full text material to be available on PubMed Central, see http://www.PubMed Central.gov/front-page/fp.fcgi.

6.1 Abstract and Full Text Searching in Medline: UMLS and MeSH

            In the area that is most important in Medline searching – the assigning of the MeSH controlled vocabulary terms – the database is remarkably accurate (Coletti and Bleich). The Unified Medical Language System™ (UMLS) was developed in the 1980s by biomedical informatics specialists. The NLM and lexical Technologies in Alameda, California, built the UMLS Knowledge Sources to improve the ability of computer programs to understand the biomedical meaning of user inquires and to retrieve and integrate relevant medical information from the Internet (Ye et al. (2002) People who work at UMLS build intellectual “middleware”, which are electronic knowledge sources an related lexical programs to help systems designers build applications that can interpret user queries and find relevant information. The NLM continually expands UMLS products to update them and to improve their utility.

            UMLS was built to overcome two important barriers to the development of information systems: disparity of terminologies used in different information sources by different users and the huge amount of information for each subject (Humphreys 1998). The MLS has produced four knowledge sources designed for system developers: the Metathesaurus, the Semantic Network, the Information Sources Map (ISM) and the SPECIALIST Lexicon.

            The Metathesaurus is a concept-based vocabulary that links MeSH to text words and to other thesauri. It is comprised of over 1 million biomedical concepts and 5 million concept names, from over 100 controlled vocabularies and classification systems used in patient records, bibliographic, administrative health data and full text databases such as Medline. MeSH is one of its controlled vocabularies. Metathesaurus provides a basis of context and inter-context relationships between various coding systems and vocabularies to provide a common basis of information exchange between the variety of clinical databases and system. Identical or almost identical concept are linked together with hierarchical context from the different vocabularies and relationship between the concepts. Metathesaurus itself is produced by the automated processing of machine-readable version of source vocabulary, and then by human intervention of editing and review. Metatheusarus allows a computer program to interpret information, interact with users to restructure queries, identify relevant databases and link abbreviations, lexical variants, systems to appropriate works and concepts. It provides definitions and information on synonymy, isa-relations and co-occurrence of terms, and contains a semantic type of every concept.

            The ISM has a long-term goal to create a software environment in which a user can pose a biomedical query. It is composed of two parts. The first part describes biomedical information sources and is like am online catalog of network-accessible information. It describes the publicly available databases of NLM, as well as biomedical-pertinent databases outside of Entrez. The second part contains a prototype of www-based applications that manage and use the ISM database. One prototype is the Apprentice, a first-stage prototype that allows an information provider to register a new information source with the ISM database. Another prototype is the Sourcerer, and third-stage prototype that accepts an English natural language biomedical query, and returns a list of sources with hypertext links. It helps the user reduce the original query to concepts known to the Metathesaurus and Semantic Network.

            The Semantic Network has 135 semantic types such as organisms, and 543 relationships, and is designed to categorize concepts in the UMLS Metathesaurus and provide relationships among the concepts. First a Metathesaurus concept is established, and then it is connected to the most specific semantic type from the Semantic Network. Links between semantic types such as “isa” are hierarchical. Others are non-hierarchical: physically related to, spatially related to, temporarily related to, functionally related to and conceptually related to. The SPECIALIST Lexicon provides access to lexical records that are single words or multi-words. They can also be morphological and orthographic information, plural forms, verb tenses, and comparative and superlative forms of adjectives. Common words and scientific words are part of a lexicon.

6.2 Linking to Gene Banks and Linking to Protein Banks

            It is necessary for molecular biologists and other scientists that want to find out about genetic data to answer questions such as the following:

Here are some questions that scientists want answered:

1.      Does the DNA sequence that I have sequenced come from the sequence of a certain animal? Perhaps the proteins translated from the gene have been “discovered”.

2.      Does the DNA sequence that I have sequenced contain a deleterious mutation?

3.       Which bacterial species have a protein that is related in lineage to a certain protein whose amino acid sequence I just delineated?

4.      Genes from what other part or the human genome or from an animal genome encode proteins with the same structure as the protein structure have just determined? What are the functions and biological pathways (metabolomics) of those proteins?

Sequences are strings of letters (representing nucleic acids or amino acids) than can be read by a human being, but cannot be understood until compared with other sequence data. This is almost impossible to be done by a human being, especially with the need to compare sequences to the sequences out there.

Many databases exist that contain vast amounts of genetic and scientific information. As was discussed above,  these databases must be linked. For the purposes of this study, the databases GenBank and Medline will be reviewed, while a few others will be described briefly.

6.2.1 GenBank

NCBI produces the GenBank database, which is an annotated collection of all publicly available nucleotide and protein translations. To produce this database, NCBI collaborates with the European Molecular Biology Laboratory (EMBL) Data Library of the European Bioinformatics Institute (EBI) and the DNA Data Bank of Japan (DDBJ). GenBank and its collaborators receive sequences from individual laboratories and huge sequencing centers from all over the world. GenBank contains sequences from more than 100,000 distinct organisms. The complete genomes of 130 microbes and 10 higher organisms such as human beings have been completely sequenced and are available on GenBank (Rapp and Wheeler 549) GenBank contained over 29 billion nucleotide bases in 2003, and doubles once every 10 months. GenBank staff provides an accession number for each submitted sequence and perform quality assurance before they accept it. Data in duplicate is accepted, since it can be useful for purposes of verification and quality control, and scientists often contribute unique information through biological annotation that accompanies sequence submission. Data redundancy and the scattering of pieces of biological annotation across different record can confound efforts to analyze and understand data and apply it for further research. Other databases, also apt of NCBI and accessible by Medline, remove the redundancies: UniGene, UniSTS, and RefSeq are some of them.

6.2.2. Molecular Modeling Database (MMDB)

This is a subset of the world repository Protein Data Bank (PDB) and SWISSPROT. MMDB takes structural, but not theoretical model, information from PDB.  It has three –dimensional structures of proteins obtained from X-ray crystallography and NMR spectroscopy, as well as future descriptions of biomolecules.

6.2.3 RefSeq

            RefSeq is a nonredundant source of reference sequences that can serve as sequence standards for genome annotation. It is a reference for gene characterization, mutation analysis, expression studies and polymorphism discovery. It is a database also for computationally derived transcript and protein sequences for human and more than 2,000 other organisms.  The most reliable of NCBI’s human gene models are produced from RefSeq transcript sequences aligned to the human genomic sequence and used as a basis of gene annotation for human genome. These transcript-based gene assignments can be supplemented by assignment based on the predictions of gene finding programs. For viruses, finding nonredundant information is tricky due to the high number of strains, isolates and mutants.

 

7.0 Brief Description of Some Other NCBI Databases

PopSets – This database contains aligned sequences as a set resulting from population, phylogenetic, or mutation study. It is valuable in the information from here can be used to analyze population variation or the evolutionary relatedness of organisms. It is derived from nucleotide and protein sequences of GenBank.

Genome – contains assembled genomic data of over 900 species with complete or incomplete sequences. Complete genomes of organisms can be viewed as a whole with  biological annotation of genetic and other biological features.

Unigene- This database contains clusters of transcript sequences from GenBank. It focuses on sequences of gene transcripts and was the first of several created sequence resources developed by NCBI. It removes redundancy in GenBank by automatically partitioning MRNA and EST sequences. EST are small pieces of DNA that represent genes with a known sequence. They are used as tags to identify a gene out of a portion of chromosomal DNA by matching base pairs.

UniSTS – This database takes sequence tagged sites (STS) out of the redundancy in GenBank. STS are short genomic sequences that occur in particular places in the genome and serve as markers.

Taxonomy – This database is a sequence-based taxonomy for the classification of organism. It uses NCBI’s taxonomy database. It contains over 133,000 species. About 1,400 organisms are added per month (Rapp and Wheeler 543).

Sequence Variations of Single Nucleotide Polymorphisms (SNPs) – SNPs occur frequently in a genome and are excellent biomarkers. This database is a central repository for SNPs. More than nine million academic and commercial records are contained in this database.

Online Mendelian Inheritance in Man (OMIM) – This database is a catalog of human genes and disorders produced at the Johns Hopkins University. It is mostly text-based Single Nucleotide Polymorphisms (SNPs) – SNPs occur frequently in a genome and are excellent biomarkers. This database provides information on gene phenotypes, diseases, inheritance patterns, gene locations on chromosomes, gene polymorphisms and relevant published literature. This is a highly curated database that is peer-reviewed often.

8.0 Text Searching in Biology

Bartlett and Toms (2005) conducted a study to find the information seeking behavior of researchers in bioinformatics. Scientist from different institutions in the US and Canada were observed, asked to think a loud and interviewed about finding sequence information across several databases. The authors found that the researchers tended to use an iterative “berry-picking” technique in which they combined information from several databases. Often, one step was repeated using multiple tools. Since each tool is based on a unique set and algorithms, it is possible to get different results even when conducting similar analyses. The more consistent the results using different tools, the higher was the level of confidence.

One interview question was what could be done to improve the procedures described by participants. The most common response was inconsistency among bioinformatics sources. Many tools had different structure, interface, set of parameters and output. The authors also found that, after an indication of gene function was found out by computer, the researchers verified the results by laboratory research. The computer work just gave them: an identification of the gene, or a higher probability of function of a gene or a narrowing down of which genes to research for a drug.  Time and money were saved. The authors came up with a sixteen-step protocol that can be followed to find the function of a gene:

1. Coding Assembly – obtains the full length of a sequence, whether by searching for it, or by putting smaller pieces of sequence together 
2. ORF Prediction – Predicts the open reading frame
3. Translation – Converts nucleic acid sequence to amino acid sequence
4. EST Expression – Determines where and when the gene is expressed
5. Genome Location – Identifies the location of the gene sequence in the genome
6.Decison Point I – choice between homology path or domain/motif path
7. Homology Searching – Looks for similar sequences
8. Multiple alignment – compares two or more sequences to identify and align areas of sequence identity
9. Phylogenetic Analysis - Determines the evolutionary relationship between genes
10. Decision Point 2 – Choice between one-step and multi-step approach to domain/motif analysis
11. Domain/motif analysis – Looks for sequence patterns characteristic of known structures or functions
12. Transmembrane Region Analysis – Identifies regions of sequence likely to form transmembrane regions
13. Cellular Localization – Predicts the location in the cell of the protein produced by the gene
14. Secondary Structure Analysis – identifies secondary structure elements  that form the basis of the three-dimensional structure of the protein
15. Threader – Finds similarity based on chemical characteristics of each sequence component (amino acid), rather than on sequence alone
16. Functional Analysis – Identifies characteristics chemical, structural, and functional) of the putative protein

This is not linear. Steps 1-5 are linear. Step 6 is the major decision point. The scientist decides if he or she will use the homology path or domain/motif path. In the homology path, the scientist focuses on analyzing a protein a complete protein sequence, and then identifying and comparing its proteins. The scientist relies on the proteins from a given family having similar functions. In the domain/motif path, the analysis is based on dissecting the sequence and finding significant pattern within it. Steps involve comparing the sequence against a database of characterized sequences and looking for a pattern characteristic of a certain function. This path is comprised of steps 7 to 9.

This research shows how much guesswork is involved in sequencing and finding protein function, and how many different tool s often need to be analyzed and cross-referenced. This path is comprised of steps 11 to 16.

The next major decision point is at step 10. Here the scientist decides between a multistep (steps 11 to 15) or single step (step 16) approach. Each pathway searches for sequence patterns characteristic of specific functions, but for step 16 the scientist would use bioinformatics tools not focused on a particular function. The scientist would identify sequence patterns associated with a variety of structures or functions. Step 16 is quicker but not as rigorous.

9.0 Matrices and Algorithms for Linking and Retrieving Information

            Finding similarities in gene or protein sequences lead to scientists’ finding that those sequences had a common ancestor. From this structural and functional characteristics of a protein can be assumed. Given the large amounts of information, mutations, different codons can coding for the same protein, and more, this can be a tricky process.

            As mentioned in Section 2.1, some proteins have different forms, such as pro-enzyme forms that need to be cleaved before a protein is active. A scientist may find that his partial amino acid sequence is similar to another known one, but one of them may be for a pro-enzyme. That is hard to determine. For use cases, see Section 12.0.

9.1 Matrices

Sophisticated algorithms are needed to determine similar sequence regions among genes and proteins, and to make this accessible and retrievable from databases. These algorithms contain various matrices (depending on the algorithm) that line up gene or protein sequences a certain way, and mathematically-computed methods of scoring the importance of sequence similarities in certain parts of the sequence. All nucleic acid (for genes) and all amino acid (for proteins) changes are not equally likely or harmful.

Assumptions that take place in sequence alignment are:

·         The sequences sought have a common ancestor;

·         The actual path of evolution requires the fewest evolutionary events;

·         All substitutions are not equally likely and should be weighted; and

·         Insertions and deletions are less likely than substitutions and weights should account for this.

Different matrices and scoring methods are appropriate for different animals or degrees of evolutionary divergence.  The algorithms, matrices and scoring methods determine if two nucleic acids or two amino acids are aligned due to a common ancestor gene or protein, or due to chance. The matrices for amino acids are more complicated since there are 20 amino acids, as opposed to five nucleic acids. A simple example of a score is this: A mutation that takes place between the two-ring nucleic acids A or G would be given a lower score than a change that takes place between the two-ring nucleic acids and the one-ring nucleic acids C and T. This assumes that it is more likely that a change between nucleic acids with one ring, a transition, is less likely to result in a significant change in function than a change between a one-ring and two-ring nucleic acid, transversion.

Two examples of amino acid sequence matrices are the PAM250 matrix, developed by Margaret Dayhoff, and the Blocks Substitution Matrix (BLOSUM) matrix. The PAM250 matrix is appropriate for finding how many changes have occurred for every 100 amino acids. For long sequences, BLOSUM has been found to be more effective. It compares a number of divergent sequences. PAM250 matrices are labeled based on how many sequence changes have occurred, while BLOSUM matrices are labeled based on how much entropy remains unmutated between all sequences. A high PAM250 re corresponds to a low BLOSUM score. Each matrix is more sensitive to something else. For instance, a BLOSUM matrix is more tolerant of mistmatches between the amino acids cysteine and tryptophan than the PAM250 matrix. Different matrices score the importance of mismatches based on different structures. For instance, the PAM 250 matrices use a tree-like evolutionary model of sequence changes while the BLOSUM matrices use highly conserved blocks of sequences to compare new sequences against. PAM250 matrices are based on mutations observed throughout a global alignment, including both highly conserved and highly mutable regions. The BLOSUM matrices are based only on highly conserved regions, or blocks or sequences. Describing this is beyond the scope of this paper. But this gives an idea of the complexity and difficulty of knowing that one is completely correct in finding matching sequences.

9.2 Algorithms

Each algorithm places different heuristic restrictions on the simple model of sequence evolution. The Needleman-Wunsch algorithm (NW) is an example of dynamic programming, and is guaranteed to find the alignment with the maximum score. Dynamic programming is a method for reducing the runtime of algorithms exhibiting the properties of overlapping subproblems and optimal substructure. Needleman-Wunsch is the first instance of dynamic programming being applied to biological sequence comparison. It was proposed in 1970 by Saul Needleman and Christian Wunsch. This algorithm found full length sequence, or global, similarities, with the first position or nucleotide aligned.

The Smith-Waterman algorithm (SW) is the most rigorous in that it does not place any heuristic restrictions on finding sequence similarities. It is the most sensitive and least specific algorithm. The pattern of changes between your sequence and a homologue in the database can be incompatible with heuristic restrictions of some algorithms.  This algorithm does not pose this concern. It was developed by Temple Smith and Michael Waterman in 1981. It is a dynamic programming algorithm, with the desirable property that it is guaranteed to find the optimum local alignment with respect to the scoring method being used. The Smith-Waterman algorithm places no heuristic restrictions on the evolutionary model. This algorithm performed local comparison, which is often needed, when a scientist finds just a part of a sequence. It is demanding of time and memory, and can take about a week to complete a sequence comparison. It is most often used in sequence comparisons, but the amount of computation and memory requirements are often problem. Because it is the most rigorous algorithm, it takes long to compute sequence similarities. NW and SW align the first and last nucleic acid or amino acid of full sequences together. This is called global alignment, and can be appropriate if a scientist has a full gene or protein sequence. Often, he may not know if he does.

Two tools that have become popular to overcome these problems are BLAST and FASTA. They are faster than Smith-Waterman because they place greater restrictions on the sequence alignments that they report. They are heuristic approximations to rigorous algorithms. BLAST has a free server, which makes it popular. The sensitivity of Smith-Waterman is higher. In other words, it will find more homologs in a sequence. But BLAST and FASTA are usually more selective or specific, i.e. they will find the exact homolog of the exact sequence for which you are searching. They are better for pathology cases. BLAST and FASTA place additional restrictions on alignments in terms of weights or scores in order to speed up the operation. Thus, the Smith-Waterman algorithm is more sensitive. Smith-Waterman needs special purpose hardware or a supercomputer to work.

BLAST is an algorithm used to compare DNA, RNA and protein sequences.  Given a library or database of sequences, a BLAST search enables a researcher to look for sequences that resemble a given sequence of interest. BLAST is one of the most widely used bioinformatics algorithms, probably because it addresses a fundamental problem and the algorithm emphasizes speed over sensitivity. This emphasis on speed is vital to making the algorithm practical on the huge genome databases currently available, although subsequent algorithms can be even faster. BLAST can do sequence comparisons against the GenBank DNA database in less than 15 seconds. BLAST includes the first and fourth terms of the Smith Waterman equation and uses a simplification of the Smith-Waterman algorithm called the maximal segment pairs algorithm. This does not allow for gaps – insertions or deletions – in the DNA sequence. This allows for faster computing, but it is not so accurate.

FASTA is a sequence alignment package first described by David J. Lipman and  William R. Pearson in 1985. It was originally designed for protein sequence similarity searching. Now it can run DNA-DNA , DNA-protein and protein-protein searches. BLAST is more sensitive than FASTA for proteins while FASTA is more sensitive than BLAST for nucleic acid sequences (DNA and RNA).

10.0 New Methods for Linking Human Biological Information in Gene and Protein Sequences

10.1 Algorithms

For Text

            Liu et al. (2004) extracted keywords from Medline abstracts that described the most prominent functions of certain genes and used the resulting weights as feature vectors for gene clustering. They used two keyword weighting schemes: normalized z-score and term frequency-inverse document frequency (TFIDF). They write that algorithms cluster genes based on similarities in expression profiles. The authors believe that clustering genes based on functional keyword associations could lead to discovering novel relationships among gene sets. The quality of keyword lists is important for clustering methods. They compared the z-score and TFIDF method. The latter outperforms the former in quality of keywords as judged by precision and recall analysis. This study shows how text in Medline can ultimately be used to find gene similarities and function and how ever-improving, sophisticated computer techniques are involved.

For Sequences

            Rodriguez, Carazo and Trelles (2005) saw how important discovering homologies and evolutionary relationships between sequences is in bioinformatics. They noted the most popular algorithms used: Needleman and Wunsch, Pearson and Lipman, BLAST and FASTA. They noted that, even after using these algorithms, no clear homologies may be present and no relationship can be inferred solely on the basis of similarity. They developed a strategy for applying Knowledge Discovery from Databases (KDD), also known as data mining. The goal of KDD is the abstraction and extraction of any type of pattern, perturbation, relationship or association from the analyzed data. The use of association rules is key. Validity of rules are assessed by: support (proportion of examples in the data that are covered by the rule); confidence (probability that a case will satisfy the rules’ consequent if it satisfies the antecedent); and improvement (indication of how much more strong is the confidence than normal) (Rodriguez, Carazo and Trelles, 494).

            The authors wanted to develop a data mining algorithm to see if low-magnitude signals (LMS) show a relationship between two proteins. These short similar fragments that proteins share are ones from which it is not possible to safely conclude a relation between two proteins. The authors set out to develop an algorithm that would allow a relationship to be expressed as an association rule with high level of accuracy. The authors cited previously–developed algorithms for mining association rules. They wrote that they did an inefficient job when searching for low-support or rare data links and associations. The key points of the algorithm are a progressive reduction of active transactions, a transaction-driven approach to search space generation, and the reduction of active elements in the transactions to speed up the whole process.

            To evaluate the biological success of the algorithm, they used a massive test to find relationships between LMS and keywords. The knowledge base was then used to predict the function of the nonclassifiable sequences or to discover new relationships between sequences that shared signals (Rodriguez, Carazo and Trelles, 501). For the study, LMS were taken from protein sequences of five different bacteria species. They used a clustering technique to find 1893 groups for 5,795 different sequences. A group was not found for 3,042 sequences. For these “orphan” sequences, their algorithm found association rules for 853. They concluded that their algorithm is suitable.  They will continue to work on improving the algorithm so that it could detect more similarities. This research is one example of using algorithms to find hidden knowledge in biological datasets.

            Chen, Lu and Ram (2004) developed a new compressed pattern matching algorithm for DNA sequences. It searched long DNA patterns more than 10 times faster than a software package called Agerp, known as the fastest pattern matching tool. They based their algorithm on the Boyer-Moore algorithm, developed by R.S. Boyer and J. S. Moore. This shows how scientific knowledge builds on itself, and how there is often room for improvement.

            Molla et. Al (2004) devised an algorithm to interpret SNPs on microarrays. Microarrays allow thousands of genes to be probed in a parallel fashion at one time (Liu and Wang,2004) Their technique needs only a low-resolution scanner like the ones used for microarray experiments. They used their technique to find the differences (SNPs) between two strains of the SARS virus, one of which had been sequenced. Their algorithm performed well. Advantages of their algorithm over previous ones used for this type of work were: simplicity with less calibration needed; most calibration is done on a single chip; no human is needed to label examples; and there was no requirement of high-resolution scanners.

            Ko et al., also in 2004, developed an algorithm to detect the DNA sequence from a known protein, in effect “going backwards” from what is usually done, which is detecting the protein that comes from a DNA sequence or its homolog. Introns and mutations complicate this task.          

10.2 Concept Space

            Chen et al. (1997) used a variation of automatic thesaurus generation techniques (the concept space approach) to create a Cerevisiae elegans (earthworm) thesaurus of 7.6578 worm-specific terms and a Drosophila melanogaster fly thesaurus of 15, 626 terms. Their goal was to contribute to solving the problem of nomenclature and semantic differences between biological domains. They used cluster analysis and artificial intelligence algorithms. They incorporated document and object list collections, object filtering and automatic indexing, co-occurrence analysis and association retrieval. The produced a fly-worm thesaurus that they incorporated into a major worm thesaurus and had Drosophila scientists search using its terms for 36 queries with and then without the new thesaurus. Scientists were observed and interviewed. Thirty percent of terms overlapped. They had worm and fly scientists search using their terms. They found that the scientists used context-drive term association and other domains for hints. With the addition of the authors’ thesaurus, more relevant documents were retrieved. Recall also improved. Scientists reported that the thesaurus helped them remember terms that they had forgotten, and to articulate their queries better. This study shows how information among scientific domains is interdependent, and how there are many problems, even in vocabulary for text searches, that must be overcome to take bioinformatics searching to the next level. 

            Toldo and Rippman used a variation of automatic thesaurus generation techniques  (2005) used a two-dimensional Concept Map™ to display a knowledge graph that allows causal connections among DNA sequences to be found. The databases that they worked with are Medline, GenBank, OMIM, the Kyoto Encyclopedia of Genes and Genomes, Unigene, UMLS, MeSH and LocusLink..  The authors worked with expressed sequence tags (ESTs). If a DNA sequence obtained in a laboratory experiment contains a known EST, chances are that that sequence will code for the same protein. However, biochemistry, cell biology and molecular biology experiments need to be conducted to verify this. The “target” that they wanted to find by using their concept map was a protein against which it is worth to develop a drug. They wanted the method to be accessible to a nonspecialist in bioinformatics and to return a list of hits of relevance and with links to “humanly understandable” annotation. The annotations should contain the reasons describing why the system thought the sequences were relevant to the query. This system improved the efficiency by which the targets were identified.

            They processed over a million EST sequences using flat files. Since the task involved dynamic data sources, they used relational platform as opposed to object-oriented database management. They used BLAST and other methods to compare the human ESTs against DNA sequences of known organisms. They devised an information-gathering and knowledge-condensation operation and interface to help scientists find homolog sequences and relevant Medline articles. They tested out their system with 44 scientists over a period of ten months. Each user used 18 full-text queries and inspected 28 annotations. Using questionnaires, the authors found that users, including a scientist with moderate biomedical knowledge, were happy with the system. They asked seven independent biologists to judge the results. They found that queries of one term delivered low precision results, while queries of multiple terms delivered higher precision results.  The authors also measured the efficiency of the system by measuring the number of patent applications before and after implementation of the system. The applications increased in number with the system.

10.3 Semantic Networks

            Leroy and Chen (2005) developed Genescene, a toolkit to provide an overview of published literature content. They combined a linguistic parser with concept space, a co-occurrence based semantic net. They extracted complementary biomedical relations between noun phrases from Medline abstracts. The parser extracts precise and semantically rich relations from abstracts, while Concept Space extracts relations that are true for the collection of abstracts. The user study focused on p53 literature. This gene is a very important gene that, when mutated, contributes to the formation of cancer. They processed all Medline abstracts that discussed p53 in Genescene. After two researchers reviewed Genescene, the terms and parser relations were found to be precise and relevant.

            The Genescene XML parser is a rule-based top-down algorithm that provides precise and semantically rich relations. The concept space is a bottom-up rechnique that captures relations between semantic concepts from large collections of text. The authors believe that Genescene can provide scientists with a more complete picture of cellular processes by integrating relations from different abstracts, and displaying a visual map of these relations in which scientists can search by keyword. Among the abstracts, there were 270,000 parser relations and three million concept space relations.

            Two cancer researchers tested out Genescene. The precision of terms extracted by the parser were 95 percent. The precision of terms for concept space were 94 percent. BY combining the two, recall of terms almost doubled.

10.4 Filters

            Efficient filters are needed to sift through redundant sequence data and get to homologous data. New gene sequences and functions are constantly being discovered. New molecules and molecule functions are also constantly being discovered. In addition, new variations of old molecules are being discovered. One such example is anon-coding RNA (NCRNA), which are small RNA molecules that serve as transcription factors. This function renders them very important in molecular biology; they can actually control what and how much of proteins gets produced when. Bafina and Zhang (2004) developed FastR, a search tool with a new search filter to search a typical bacterial database in minutes with high sensitivity and specificity on a standard personal computer, to find sequence  and structural homologous structures among ncRNA.

            Sensitivity was defined as the fraction of all members of the ncRNA family that are admitted by the filter, and should be as close to one as possible. Specificity was defined as the expected number of base pairs per hit in a random database, and should be as large as possible. FastR was quick and had high sensitivity.

10.5 Clustering

            According to Liu and Wang (2004), clustering is the most popular approach used to analyze gene expression data. It has proved successful in discovering gene pathway, gene classification, and function prediction. Li, Zhang and Jiang (2004) devised a clustering algorithm using Havrda’s and Charvat’s instead of Shannon’s entropy model. This algorithm helps to identify groups of genes that have similar expression patterns under various conditions across different tissue samples. Usually, genes expressed in the same cluster are involved in the same cellular pathway. Strong expression pattern correlations between these genes may indicate co-regulation of gene transcription. They tested their clustering algorithm on real and synthetic data, and found that their clustering algorithm performed better than other ones such as the hierarchical clustering technique.

            Yoshida, Higuchi and Imoto developed a hierarchical clustering technique to analyze microarray genes by tissue. Their method reduced the dimensionality of data, enabling users to extract genes and to detect genes expressed in combination.

            Liu and Wang (2004) developed an algorithm to detect clusters from DNA microarrays.  They write that the traditional clustering algorithm is incapable of discovering the gene expression pattern visible in only a subset of experimental conditions. Often a subset of genes are co-regulated and co-expressed under a subset of conditions, but behave independently under other conditions. They tested their algorithm on a breast tumor dataset of BRCA1, BRCA2 and Sporadic genes. Their algorithm found significant clusters that exhibit patterns discriminating among the three types of tissue consistently.

 

 

10.6 Wrappers

            Some scientists are working on wrappers. These are intermediate software layers used to access connected or integrated information sources. Programming wrapper requires substantial programming skill, and is time-consuming and hard to maintain. Hsu et al. (2005) provided a solution for rapidly building software agents to serve as Web wrappers for bioinformatics systems. They defined an XML-based language called Web Navigation Description Language (WNDL) to model a Web-browsing session. WNDL scripts showed how to locate, extract and combine data. The authors executed different WNDL scripts and described Information Extraction based on Pattern Discovery (IEPAD). IEPAD allowed their software agents to automatically discover extraction rules to extract contents of a structurally formatted Web page. With this tool, a user can generate a complete Web wrapper agent.

            Mediators decompose data queries into subqueries. Wrappers are translators between mediators and data sources and are required for each data source. The authors wanted to emphasize the reconfigurability of Web wrappers so they can be rapidly developed and easily maintained without skillful programming. They used the Web Service Description Language (WSDL) to do this. The authors devised a tool to generate and execute a wrapper for PubMed.

Proteins

            Han, Ma and Zhang (2004) produced software package called SPIDER to identify proteins from sequence tags with de novo sequencing errors. After this is done, BLAST can be used to match sequences. A scientist works to identify proteins using tandem mass spectrometry where the purified proteins are digested into short peptides with enzymes like trypsin. Then tandem mass spectra are taken. These have to be interpreted by computer software. SPIDER found proteins with a sequencing error and matched them with their correct sequence.

            Can et al. (2004) developed a technique to generate the SCOP classification of protein structure with high accuracy. Proteins are classified by different heuristics-structure, sequence, and similar metrics. The growing amount of proteins, their varying three-dimensional structure and changes they go through in reactions makes protein classification difficult to begin with, and having different classification systems is tough. The manually-generated SCOP system is the most respected. In SCOP, proteins are in the same family if they have a sequence identity of ≥30% or have a similar function with a sequence identity of ≥15%. They employed a decision tree approach to combine classification decisions made by two sequence-based and three structure-based classifiers. The accuracy was higher than that of SCOP. It seems like room for improvement is always there for bioinformatics.

11.0 Use Case with Medline

            Use cases will be used to find if database linkage within Entrez is thorough or if more work can be done.

Use Case I – Text

            In this section, a keyword search in natural and controlled language will be conducted in Medline to retrieve relevant articles, and then a protein search will be conducted to show how all Entrez databases can be searched for relevant information.

Medline is primarily used to obtain science journal articles. Keywords can be in natural or controlled vocabulary. The Metathesaurus makes sure that journal articles are retrieved if either type of keyword is entered. However, sometimes a different amount of journal article is retrieved. The term “heart attack” or its controlled vocabulary term “myocardial infarction” can be searched in the Medline search box. A total of 105,902 journal articles are retrieved. A user can look up related articles and links. If “myocardial infarction” is looked up, a total of 127,373 journal articles are retrieved. Again, related articles to each article or links can be found. For terms like these, mostly books will be found by links. Looking up these terms using the “all databases” button will not produce much non-text material.

Use Case II

            The enzyme “oxyreductase” was looked up. This was a misspelling. Five journal entries came up. The system suggested that “oxoreductase” be looked up. When it was, 65 entries came up. The system then recommended “oxidoreductase”. This was the intended enzyme from the beginning, but was not suggested by the system until “oxidoreductase” was searched. This shows that the system through MeSH can be more accurate in finding spelling errors.

            “Oxidoreductase” yielded 378,841 journal articles. Upon hitting “Links” to the right of the first three journal articles, no gene or protein sequence databases were shown. This indicates that there can be a problem in links on Entrez. If a user does not know to hit “All databases” if they want to find protein information in all databases, much data will not be picked up. Upon looking up “oxidoreductase” under “All databases”, many more results were obtained. For instance, it is mentioned in 165,264 times in the Protein database.

12.0 Use Cases on Entrez

            Use Case I

By going to http://www.ncbi.nlm.nih.gov, a user can choose what databases on NCBI to look up information. A user, for instance, can look up “AIDS”, which stands for “acquired immunodeficiency syndrome” on the database Genome. Since this is not a sequence or protein, only 28 results are found. However, with each of these results, a user can find links to other databases, such as “protein”. From these results, more links are found. This indicates the scope of links in Entrez and the amount of bioinformatics information available.

Use Case II

            One complication is not confusing a sequence of an active protein with a sequence of a pro-active form of protein. An example of this is thrombin, which is transcribed from DNA as pre-prothrombin. It is changed by enzymes to pro-thrombin and then to thrombin for activation. the pre-prothrombin and thrombin for cattle are very similar in sequences. One can look up “preprothrombin cattle” and “thrombin cattle”. (See Appendices IV and V) under “All databases” and then click on the “Nucleotide” database to find the amino acid sequences.  The accession numbers are different, yet the amino acid sequences are the same, although most are not shown in these examples. A scientist that finds sequences like the amino acids in thrombin and preprothrombin, will have to search further to find a database that shows the full amino acids, to see which protein he has, or will have to do laboratory and X-ray crystallography experiments to see which protein he ha. If a scientist finds a sequence that is the same or similar to part of thrombin and/or preprothromibn, it may be difficult for them to tell which sequence they have. In addition, thrombin and preprothrombin sequences of different species may be similar and may yield proteins of slightly different functions and structures. A scientist would have to search through many sequences to determine what protein he has. This example shows that different accession numbers can be given to similar sequences that code for different and yet similar proteins. This adds to the complexity of sequence matching. If a user conducts a search for thrombin and preprothrombin, for instance they will see that sequences from human and other organisms are show in no particular order. More work can be done to make Entrez more organized.

Use Case III

            Another protein that comes in different forms is troponin. Its forms include troponin t, c and i. They are all different and form a complex together that assists in muscle movement. Their structure changes as calcium binds to the complex and muscles move. When a scientist searches for a protein sequence, he is either looking for the genetic nuclei acid sequence or the amino acid sequence. In this use case, different forms of troponin were looked up to find how easy it is to find their amino acid sequences in human beings, or homo sapiens. Looking up “troponin” in “All Databases” yields many results. (See Appendix VI.) Clicking on Nucleotide (same for nucleic acid shows the first problem. The first few sequences are for tropomyosin, another muscle protein!

The user now gets more specific and searches under “All Databases” for troponin t, troponin, c and troponin I separately. Each search yields a different number of results. The user looks at the Nucleotide database for each of these protein forms to find their amino acid sequence and compare it. Tropomyosin and troponin proteins from other species are shown first before the human specific troponin sequence is shown. TO complicate matters further, troponin molecules do not only come in three different forms, but differ is they are from cardiac, involuntary (such as intestinal wall) or voluntary (such as arm) muscle. And voluntary muscle comes in two varieties – slow and fast twitch, depending on how oxygen is utilized. For human troponin I, the Accession Numbers, upon searching the first 30 results, of NMOO3281 and BCOI260 were found. The latter is much shorter than the first, but nucleic acid sequence positions 61 – 1021 are shown in both. And they differ! They should be very similar, even accounting for non-deleterious mutations. A scientist having a partial sequence may match to one of these troponin i’s but not be sure what he has.  Entrez does not seem to make it simpler.

Upon looking at the first 30 results of the Nucleotide database, two supposedly similar results were found for human troponin c type 2 fast twitch muscle. Sequences 61 – 661 shown, however, do not match at all. For troponin t, there are three variants, 61-63, for type 2 cardiac muscle. No way in Entrez exists to sort out all of this information in a more simple fashion so that a researcher can find an exact sequence match.

In addition, the links next to finds in different databases do not fully work. For instance, upon looking up troponin t in Nucleotide, the first link, which happened to be for tropomyosin, was hit next to that entry. A choice of databases for links comes up. PubMed was chosen. NO articles showed an article that was found upon a Yahoo search for tropomyosin. The 2003 journal article was by Thamas Palm,Norma J. Greenfield and Sarah E. Hitchcock-DeGregori in Biophysical Journal volume  84, pages 3181 to3189 and was called Tropomyosin Ends Determine the Stability and Functionality of Overlap and Troponin T Complexes. This article is in the PubMed databases. It just did not come up by going to the Links button and hitting “PubMed”.

Use Case IV

            Upon looking up “estrogen” under “Nucleotide” in Entrez, the first several links are on estrogen receptors, not estrogen. This is a simple problem, but one that can take up a lot of time for a scientist to find estrogen.

13.0 Conclusions

            The amount of biological, especially genetic and proteomic, information will continue to expand at a large pace. This paper did not cover the sequence databases of Europe and Japan, other sequence search tools such as the Single Modulator Architecture Tool (SMART) and several databases of Entrez. More sophisticated techniques are needed to link this information together, and to make sure information is not redundant. 

            As was seen in the use cases, not all information in the Entrez data warehouse was linked upon entering a term to search. Crucial information to a researcher may not be found unless ways to link all of the information are created and put into use. In addition, nothing seems to be done by the system to clear up or simplify problems, such as different sequences for the same or similar proteins, or the ordering of information by species.

            Algorithms that are sensitive and quick must be developed for finding homologous DNA and protein sequences, for matching DNA to proteins, and for matching DNA and proteins to protein structures. Redundancy of new genetic data must continue to be removed on databases such as RefSeq. There are various ways to line up sequences in matrices and none are foolproof. Different matrices and algorithms allow for “gaps” of dissimilar sequences to pass before two sequences are deemed not a match. BLAST allows for a shorter number of such dissimilar sequences than the Smith-Waterman algorithm, and so is quicker and retrieves less results. It is more specific, but may also lead to missing some similar sequences, especially taking into account non-deleterious mutations and that different codons can code for the same protein, as discussed earlier in the paper. More work needs to be done on creating matrices and algorithms that will lead to quick and specific results, and for ways in general to make sense of the ever-increasing amount of bioinformatics data.


 

Appendix I

 

What We Still Do Not Know

 

 

From http://www.ornl.gov/sci/techresources/Human_Genome/faq/seqfacts.shtml#post

Accessed on December 6, 2005


Appendix III
 
DNA and Protein Sequences
 
         Nucleotide: DNA sequence database (primarily GenBank)
         Protein: Protein sequence database
         Popset: Sequence alignmentsfrom population and phylogenetic studies
         Genome: Complete genome assemblies
         Unigene: Gene-oriented clusters of transcript sequences
         UniSTS: Markers and mapping data
 
Classification
 
Taxonomy: Organisms with sequence data in GenBank
 
Sequence Variation
 
         SNP: Single nucleotide polymorphism 
 
Gene Expression
         
         GEO: Gene expression hybridization array repository
         GEO Datasets: Curated gene expression and microarray datasets
 
Phenotype and Function
 
         OMIM: Online Mendelian Inheritance in Man