The U.S. Department of Energy Biological and Environmental Research program funds this site.

DNA Graphic Genome Database Guide
Resources for Learning about Genes and Proteins

Descriptions of Resources

Molecular Biology Basics

Learning about Genes and Their Products

Nucleotide Sequence Databases

Protein Sequence Databases

Sequence Similarity Searching

Gene-Mutation Resources

Protein Structure Resources

Helpful Terms, Tutorials, and Examples

Glossary of Bioinformatics Terms - A quick guide to some common terms used in bioinformatics resources

Introductory Bioinformatics Tutorials - Step-by-step instructions for first-time users

Gene Profiles - Case studies as examples of the kinds of information you can find using these resources



Molecular Biology Basics

In this guide we have provided descriptions and links to bioinformatics resources that can help you learn more about genes and proteins associated with genetic disorders or traits. Most were designed for students and professionals in the life sciences, so a certain level of familiarity with genetics and molecular biology is assumed. For those of you looking for an introduction to the science behind the Human Genome Project, we have included links to basic genetics and molecular biology resources that can be reviewed before you attempt to use the bioinformatics resources.

The Science Behind the Human Genome Project: Web pages that define some basic genetics concepts and explain how the Human Genome Project was implemented.

Genome Glossary: DOE Human Genome Program glossary of genetics terms. Can be searched or browsed alphabetically; links to other life science glossaries.

Each of the following resources is viewable on the Web and printable as an Acrobat (PDF) file.

Primer on Molecular Genetics: Department of Energy (DOE) publication. Defines basic genetic terms and overviews human genome mapping and sequencing, model organism research, informatics, and the impact of the Human Genome Project.

To Know Ourselves: DOE publication that explains the agency's involvement with the Human Genome Project. Includes sections on human genome physical mapping and sequencing; technological developments in laboratory instrumentation; management of genomic data; and the project's ethical, legal and social implications.


Return to Top

Molecular Biology Basics Gene Resources Nucleotide Sequences Protein Sequences
Sequence Similarity Searching Gene-Mutations Protein Structures


Learning about Genes and Their Products

These resources give a general overview of a gene, along with some of the following information: official symbol, locus, associated disorders or traits, mode of inheritance, name and function of gene product, and links to additional gene-specific resources.


Online Mendelian Inheritance in Man (OMIM)
Entry Last Reviewed: August 5, 2002

Overview: Created by Victor A. McKusick, MD and his colleagues at Johns Hopkins School of Medicine, OMIM maintains a large, searchable, up-to-date database of human genes, genetic traits, and disorders. The OMIM database contains text, pictures, and reference material plus copious links to NCBI's Entrez database of MEDLINE articles and sequence information. OMIM is intended for use by genetics researchers, advanced life-science students, and healthcare professionals concerned with genetic disorders. Three different interfaces are provided for exploring particular genes and genetic conditions: Search, Gene Map, and Morbid Map.

OMIM Search - Interface where users can enter query terms and use limits to refine search strategies. Users can also submit queries using the search box on the OMIM home page.

OMIM Gene Map - Single file that lists entries by chromosome and cytogenetic location starting with the p telomere of chromosome 1, continuing through the q telomere of chromosome 22, and ending with the p telomere of X through the q telomere of Y. This interface is useful for seeing the sequential order of genes as they occur on each chromosome.

OMIM Morbid Map - Single file that lists genetic disorders alphabetically and presents the cytogenetic map locations of associated genes.

Search Tips: As part of NCBI's Entrez system, OMIM offers an advanced search option that includes a variety of features and methods for refining search strategies. With OMIM's Search interface, users can select different fields (such as MIM number, gene name, chromosome, title word, and creation date) with the Limits option, browse index terms and preview query results using Preview/Index, combine searches using History, and store selected records from different searches on Clipboard. Boolean Operators AND, OR, and NOT must be in upper case. Phrase searching using double quotes and truncation using the asterisk (*) as a wild card also are supported. To learn more about OMIM Search, see our OMIM Search Tutorial or review the Help and FAQs pages on the OMIM site.

Both Gene Map and Morbid Map can be searched by gene symbol, chromosomal location (X and Y must be capitalized), and disorder keyword. For Gene Map it is best to search by gene symbol (e.g., CFTR for the cystic fibrosis gene) or gene locus (e.g., 7q31.2). When searching by disorder keyword in Morbid Map, use plain text terms. For example, instead of searching for alzheimer's disease, try searching for alzheimer disease. Characters such as quotation marks, apostrophes, parentheses, or dashes are not supported. Gene Map and Morbid Map searches will take users to the search term's first instance in the file and display 20 entries. To locate the next instance, hit the Find Next button.

Information Provided: Each OMIM record returned by a search contains MIM number, official gene symbol, gene product name, gene locus (if known), general description of disorder or gene-product function, and text summarizing information from journal articles on a particular gene, disorder, or trait. Links to citations in Medline, related OMIM entries and entries in other NCBI databases also are included.

OMIM Gene Map entries are organized by cytogenetic location and show gene symbols, gene title, MIM number, names of associated disorders (if known), and other information. Each entry includes links to NCBI Map Viewer and associated OMIM records.

OMIM Morbid Map entries, arranged alphabetically by disorder name, include gene symbols and disorder abbreviations, MIM number(s) linking to the OMIM record(s) for associated gene(s), and cytogenetic location.


NCBI LocusLink
Entry Last Reviewed: August 5, 2002

Overview: LocusLink is an NCBI database that serves as a single query interface to gene-specific information from a wide variety of bioinformatics sources. LocusLink includes descriptive information about genetic loci in human, fruit fly, mouse, rat, zebrafish, and human immunodeficiency virus type 1 genomes.

Search Tips: Users can query LocusLink by typing keywords (such as disease or protein name, gene symbol, accession numbers, or other database ID numbers) into the search box at the top of the main page. Query options include truncation using the asterisk (*) as a wild card, field restriction, and Boolean operators that are not case sensitive. Grouping phrases by parentheses or quotation marks is not supported. The Boolean AND is implied when multiple search terms are entered.

LocusLink has a system of controlled terms that can be used to retrieve only those records with a particular feature. One of the controlled terms is disease_known, which will return only loci associated with a known disorder. For example, if you wanted to find only those members of the ABC (ATP-binding cassette) protein subfamily that are associated with disease, you would enter ABC and disease_known. See the Query Tips in the Help file for a complete listing and more detailed descriptions of controlled terms.

Letters just below the search box link to lists of LocusLink records arranged alphabetically by official gene symbol.

For query tips and descriptions of fields included in each LocusLink report, see the Help file. LocusLink also provides a FAQ page.

Information Provided: Each LocusLink report may include the following types of information: links to gene-specific entries in other databases, official gene nomenclature, LocusID (identification number assigned to the gene by LocusLink), overview of protein function with links to scientific literature describing function, alternate symbols and aliases, phenotypes or expressed characteristics associated with the gene, other database ID numbers, homologous genes from other genomes, links to cytogenetic maps, links to sequence records, and links to other related information sources.


GeneCards
Entry Last Reviewed: August 5, 2002

Overview: Developed at the Weizmann Institute of Science in Israel, GeneCards is a database of human genes, their products, and their involvement in hereditary disorders. GeneCards automatically extracts gene-specific information from a variety of Web-based bioinformatics resources and integrates the data into each entry. The database was designed for scientists who want to use one interface to access multiple databases for information about human genes that have been assigned approved symbols. An important part of the GeneCards mission is to promote the use of standard nomenclature, such as official gene symbols approved by the HUGO Gene Nomenclature Committee.

Search Tips: Users can search GeneCards by keyword or gene symbol/alias using the search box on the home page. Keywords can be single or multiple terms, GenBank accession number, chromosome number, or gene locus. Truncation using an asterisk (*) as a wild card at the beginning or end of a search term is supported. The Boolean operators AND and OR can be used to connect terms. The Boolean operator NOT is not supported. Examples are provided for keyword searching. Users may also browse a complete listing of genes or a subset of disease genes featured in GeneCards. To learn more about searching GeneCards, check out the Quick Start, Guided Tour and Guidance System links.

Information Provided: A typical entry for a gene may include the following information: official gene name and symbol, synonyms or alternative names, ID numbers assigned to the gene in other databases, chromosomal location, chromosome map showing where the gene is found, domains and protein families associated with the gene's protein product, links to sequence records, expression patterns in human tissues, links to similar genes in other organisms, SNPs and variants, disorders and mutations, links to citations in Medline, and links to other related resources. Each GeneCard also links to sources used to create the entry. GeneCards encourages feedback from its users and provides a form for submitting comments and suggestions.


Genome Database (GDB)
Entry Last Reviewed: August 5, 2002

Overview: Developed at Johns Hopkins University and maintained by the Bioinformatics Supercomputing Centre at the Hospital for Sick Children in Toronto, Ontario, Canada, this curated database serves as a repository for genomic mapping data obtained from the Human Genome Initiative. GDB provides descriptions of human genomic segments or regions of the human genome, human genome maps, and genetic variations or polymorphisms. New users may want to read About GDB to learn about the different features of this database. For each database entry or object, scientists are invited to submit data, add annotations to existing data, and recommend links to related entries in other databases.

Search Tips: To search for information about a particular gene, in the simple search box on the main page, choose Genomic Segments and then Name/GDB ID if you know the official gene symbol. Users also may search by keyword or DNA Sequence ID such as GenBank Identifier (gi). In keyword searching, multiple terms entered in the search box will be treated as a phrase unless they are separated by a Boolean operator (AND, OR, and NOT). Truncation using the asterisk (*) as a wild card or single-character substitution using the question mark (?) are supported. In addition to the simple search, GDB offers other query options such as customized forms that include Gene Search by Name or Symbol, sequence-based search forms such as GDB BLAST, and generic search forms that let users search only for genes or specific types of maps. Browsing options available from the Search Options page are Genetic Diseases by Chromosome, Lists of Genes by Chromosome, and Lists of Genes by Symbol. To learn more about using GDB, see the Help document or Example Searches.

Information Provided: Submitting a search returns a list of matching objects. Clicking on Accession ID pulls up the GDB record. For each gene, the following may be provided: gene symbols, names, and aliases; cytogenetic localization; nucleic acid sequence links; protein sequence links; polymorphisms; mutations; phenotype links to related OMIM records; homology links; external links to other related Web sites or database entries; and citations that link to GDB journal articles. At the top of each record, a button labeled View Maps of Region pulls up a list of maps. Viewing GDB maps requires a Java-capable browser (Netscape or Internet Explorer 3.0 and higher) that will run GDB's Java applet Mapview. See the Mapview Help document for more information.


Return to Top

Molecular Biology Basics Gene Resources Nucleotide Sequences Protein Sequences
Sequence Similarity Searching Gene-Mutations Protein Structures


Nucleotide Sequence Databases

Entrez Nucleotides  
Entry Last Reviewed: August 5, 2002

Overview: Part of the National Center for Biotechnology Information (NCBI) Entrez system, this database contains sequence data from several sources such as GenBank, RefSeq, European Molecular Biology Laboratory (EMBL), DNA DataBank of Japan (DDBJ), and Protein Data Bank (PDB). (Refer to overviews below for general descriptions of NCBI databases GenBank and RefSeq.) GenBank is part of the International Nucleotide Sequence Database Collaboration, which also includes DDBJ and EMBL. Sequence data is exchanged daily among these three organizations. Although record formats and search systems may differ, information contained in each record (accession number, sequence data, annotations) will be the same for all three databases.

Search Tips: As with other Entrez databases, users can refine search strategies using fields available in Limits and Preview/Index, browse Index terms of a particular field, combine searches using History, and store selected records from different searches on a Clipboard. Some search-refining techniques available from the Limits page are to exclude certain types of sequences (e.g., ESTs) and limit the search by date or particular database (e.g., search only RefSeq). Boolean Operators AND, OR, and NOT must be in upper case. Phrase searching using double quotes and truncation using the asterisk (*) as a wild card also are supported. For more information about searching this and other NCBI Entrez databases, see Entrez Help Document. For step-by-step instructions on finding and interpreting sequence records, see the Accessing Sequence Records tutorial.

Information Provided: Each record returned in a search will include the nucleotide sequence and annotations such as accession numbers, keywords, source organism, and citations to reference articles. Sequence records also may contain the translated amino acid sequence. For more detailed descriptions of types of information in each sequence record, check the Sample GenBank Record provided by NCBI.

GenBank

Overview: GenBank is an NCBI database that serves as an archive for all publicly available DNA sequences from more than 75,000 organisms. Submitting scientists retain complete editorial control over their sequences, so they decide on gene symbols (which may not be the official ones) and additional information to include. Scientists contact NCBI if they wish to make any modifications to their sequence records. As an archival database, GenBank may contain redundant entries, even hundreds of records for the same gene. Besides redundancy, GenBank sequences may be contaminated with vector DNA. To address some problems associated with this archival database, NCBI developed the nonredundant RefSeq.

RefSeq

Overview: NCBI's database of reference sequences serves as a curated, nonredundant source of information about genomic DNA contigs (segments constructed by ordering cloned DNA fragments), mRNA transcripts, and proteins associated with known genes. Unlike GenBank records, RefSeq records are created and updated by NCBI staff. RefSeq records can be Predicted, Provisional, or Reviewed. Predicted and Provisional records are generated by an automatic process, but Reviewed records undergo a manual process that screens sequences for problems such as sequencing errors or vector contamination. Reviewed records also contain enhanced annotation such as additional gene-relevant publications, summary of nucleotide and protein features, and description of gene function in addition to annotation (official gene symbol or name, aliases, Locus ID number, MIM number, and map information) generated from automatic record processing. Each RefSeq entry features a distinct accession number (two characters followed by an underscore in which the first two characters describe the sequence type). For more information about RefSeq, see RefSeq FAQs or LocusLink & RefSeq Development.


Return to Top

Molecular Biology Basics Gene Resources Nucleotide Sequences Protein Sequences
Sequence Similarity Searching Gene-Mutations Protein Structures


Protein Sequence Databases


Entrez Proteins  
Entry Last Reviewed: August 5, 2002
Overview: Part of the National Center for Biotechnology Information (NCBI) Entrez system, this database includes sequence data compiled from a variety of sources, including SWISS-PROT, Protein Information Resource (PIR), Protein Data Bank (PDB), and Protein Resource Foundation (PRF) in Japan. Some protein sequences were created from translations of coding regions in DNA sequences stored in GenBank and RefSeq.

Search Tips: As with other Entrez databases, users can refine search strategies using fields available in Limits, preview the number of search results for a query, browse Index terms of a particular field, combine searches using History, and store selected records from different searches on Clipboard. Some of the indexed fields that can be used to narrow a search include accession number, gene name, journal name, molecular weight, organism, properties, protein name, sequence length, and text word. Users also can specify that only one particular database be searched (e.g., retrieve protein sequences from SWISS-PROT only). Boolean Operators AND, OR, and NOT must be in upper case. Phrase searching using double quotes and truncation using the asterisk (*) as a wild card also are supported. For more information about searching this and other NCBI Entrez databases, see the Entrez Help Document. For step-by-step instructions on finding and interpreting sequence records, see the Accessing Sequence Records tutorial.

Information Provided: Search results displayed using the default view will include locus name (a unique name assigned to each record), sequence length, protein description (definition), accession number, database source, keywords, organism, citations to references, comments concerning protein function or associated traits or disorders, information about sequence regions of biological significance, and the amino acid sequence. For detailed descriptions about fields presented in each NCBI sequence record, see the GenBank sample record.


SWISS-PROT/TrEMBL
Entry Last Reviewed: August 5, 2002

Overview: The protein sequence databases SWISS-PROT and TrEMBL were developed by groups at the Swiss Institute of Bioinformatics (SIB) and the European Bioinformatics Institute (EBI). SWISS-PROT uses three key criteria: high level of annotation, minimal redundancy, and high level of integration with other databases. SWISS-PROT includes as much information as possible in its annotations, and external experts review current literature and provide comments and updates on different protein groups. SWISS-PROT's depth of annotation, however, requires considerable time and effort. To keep a current database of protein sequences, a subset called TrEMBL (Translation of EMBL) was developed. Translations of EMBL nucleotide sequences are computer annotated and stored in TrEMBL until sequences can be fully annotated and integrated into SWISS-PROT.

Search Tips: SWISS-PROT sequence records can be accessed through the NCBI Entrez Proteins database. If users choose to access the SWISS-PROT/TrEMBL Web site for sequence searching, they can query the database using a variety of methods: quick search on the main page (Boolean operators not supported), Sequence Retrieval System (SRS), full-text search (Boolean operators, phrase searching, and wild cards supported), and advanced search. Forms for searching by accession number or ID, description (entry name, gene name, species, organelle), author, or citation (SWISS-PROT only) also are provided. To learn more about searching SWISS-PROT see the SWISS-PROT Documentation section which includes a downloadable PDF version of the user manual.

Information Provided: SWISS-PROT entries are described as containing two types of data: core data (consisting of sequence, bibliographic references, and description of the protein's biological origin) and the annotation. Detailed annotations in each entry describe protein function, post-translational modification (e.g., addition of sugars or phosphate groups after mRNA translation), domain and binding sites, secondary structure, quaternary structure (e.g., homodimer, heterodimer), disorders associated with altered protein forms or amounts, variants, and similarities to other proteins.


Protein Information Resource - Protein Sequence Database (PIR-PSD)
Entry Last Reviewed: August 5, 2002

Overview: Established in 1984, Protein Information Resource (PIR) is a division of the National Biomedical Research Foundation associated with Georgetown University Medical Center. In collaboration with Munich Information Center for Protein Sequences (MIPS) in Germany and the Japan International Protein Information Database (JIPID), PIR has developed the PIR-International Protein Sequence Database (PSD). Its mission is to be "the most comprehensive and expertly annotated protein sequence database in the public domain" with the primary objective of achieving "properties of Comprehensiveness, Timeliness, Non-Redundancy, Quality Annotation, and Full Classification."

Search Tips: PIR sequence records can be accessed through the NCBI Entrez Proteins database. If users choose to go to the PIR-PSD Web site, the following search options are provided: search by unique identifier or accession number, basic text search, and advanced text search. For basic text searches, the Boolean operators AND, OR, and NOT are not supported, and a space between terms is interpreted as "and." Advanced searches allow users to refine a strategy with fields such as Title, Species, Author, Keyword, and Gene Name. In advanced search, search terms are case sensitive and must be at least three characters long. Boolean operators OR and NOT are supported. A space between words is interpreted as "and," so users searching for a phrase must put a character between multiple terms (e.g., enter homo-sapiens to search for "homo sapiens"). For more on searching PIR-PSD, see Help Searching PIR Databases, Sample Entry, Demo Search, and FAQs.

Information Provided: Each record includes protein name; classification and origin; literature references; protein features such as domains and motifs; primary sequence data; and links to related entries in other databases. Users have the option to create submission forms for similarity searching in PIR and NCBI databases. At the top of each record are links to annotation and sequence data within the record and a link to a composition table that summarizes total amino acid composition expressed as percentages. At the bottom of the record are direct links to Protein Data Bank (PDB) structures and sequence similarity alignments associated with the protein.


Return to Top

Molecular Biology Basics Gene Resources Nucleotide Sequences Protein Sequences
Sequence Similarity Searching Gene-Mutations Protein Structures


Resources for Sequence Similarity Searching

Scientists frequently perform sequence-similarity searching to identify homologs for a particular sequence and to gain insight into any gene product's function and biological importance. Since more than one codon or triplet of nucleotides could code for a particular amino acid, the considerable amount of variability in nucleotide sequences could translate into the same amino acid sequence. The genetic code's degenerate nature is the reason that similarity searching using amino acid sequences generally is more informative than using nucleotide sequences.

Users who are new to sequence-similarity searching should check out NCBI's Introduction to Similarity Page, Rules of Thumb, and BLAST Guide's Glossary.


NCBI BLAST  
Entry Last Reviewed: August 5, 2002

Overview: BLAST (Basic Local Alignment Search Tool) is a set of programs designed to perform similarity searches on all available sequence data. BLAST uses an algorithm developed by the National Center for Biotechnology Information (NCBI) that seeks out local alignment (alignment of some portion of two sequences) as opposed to global alignment (alignment of two sequences over their entire length). By searching for local alignments, BLAST can identify regions of similarity in two sequences. Some similarity searches offered by NCBI include comparing an amino acid sequence to a protein sequence database (blastp), comparing a nucleotide query sequence to a nucleotide sequence database (blastn), and comparing a nucleotide sequence translated in all reading frames to a protein sequence database (blastx).

Search Tips: From the main BLAST page, users can choose among several NCBI services. For service descriptions, click on the question mark to the right of each section title or see the Description of BLAST Services. Clicking on the desired BLAST search option will lead to a search page with a box for entering the query sequence. Accepted input includes a sequence in FASTA format (a single-line description followed by sequence data), bare sequence (sequence data without the single-line description), and identifier. The identifier may be an accession number or GenBank ID (GI number), but must be entered as a single word without any spaces between characters. For more information about input, see NCBI's Search Format page. Each search or format option on the search page links to Help documentation with more detailed descriptions of each option. For more on how to use BLAST, see our Sequence Similarity Searching tutorial and NCBI's step-by-step BLAST GUIDE, Query Tutorial for new users, BLAST Tutorial, and BLAST Help.

Information Provided: After submitting a BLAST request, users are presented with a Formatting BLAST page that displays the query statement, domain information, request for ID number, and format options. After desired format options are selected, pressing the Format button will pull up the Results of BLAST page. Using pair-wise alignment (the default alignment view) in format options, the Results page will display an image map graphically depicting retrieved database sequences (subject sequences) aligned with query sequence (depicted as the numbered line at the top). Passing the mouse over each line below the query sequence will display a description of that sequence in the text box. Clicking on each line will jump down to the corresponding pairwise alignment between the query sequence and a particular subject sequence. Below the image map is a list of sequences producing significant alignments. Accession number or identifier for each alignment links to a sequence record. The score links to the corresponding pairwise alignment at the bottom of the Results page. The blue L seen in some results links to a related entry in LocusLink. See the Sequence Similarity Searching tutorial for more on interpreting BLAST results.


PIR FASTA Similarity Search 
Entry Last Reviewed: August 5, 2002

Overview: The FASTA Similarity Search tool is part of the Protein Information Resource (PIR) collection of protein databases and bioinformatics tools. This similarity-search tool uses the FASTA algorithm, which compares a query sequence to those in the Protein Sequence Database and other PIR databases.

Search Tips: Users can query the database by inserting the single-letter amino acid code into the query box or by entering the valid PIR-PSD entry code for a particular protein of interest. See the Demo Search for an example.

Information Provided: Query results are presented in a table that lists more-similar sequences at the top and less-similar sequences toward the bottom. Clicking on ID number for a result will pull up the database entry for that protein, and clicking on the colored bar on the right will link to pairwise alignment between the submitted sequence and the subject sequence retrieved from the database.


Return to Top

Molecular Biology Basics Gene Resources Nucleotide Sequences Protein Sequences
Sequence Similarity Searching Gene-Mutations Protein Structures


Gene Mutation Resources

Genes carry instructions for building proteins, molecules that do most of the body's work. Certain variations in a gene's nucleotide sequence can affect the resulting protein's function by altering amino acid sequence and protein structure. The inability of some variant proteins to function properly can cause genetic disorders or other distinctive phenotypes.


Online Mendelian Inheritance in Man (OMIM): Allelic Variants  
Entry Last Reviewed: August 5, 2002

Overview: OMIM records for many genes include an Allelic Variants section that summarizes published research concerning selected allelic variants or mutations, many of which cause disorders. Some criteria for selecting allelic variants for inclusion in OMIM are first mutation discovered, high population frequency, distinctive phenotype, and unusual disease-causing mechanism. Each variant is assigned a ten-digit number made up of the gene's six-digit OMIM number, followed by a period and four digits unique to the variant. For more information about this database, see the OMIM entry above in the Learning about Genes and Their Products section.

HGVbase
Entry Last Reviewed: August 5, 2002

Overview: The Human Genome Variation database (HGVbase) is a database of annotated records for known sequence variations in the human genome. This database was designed as a tool to help scientists understand how common genome sequence variations, such as single nucleotide polymorphisms, result in complex phenotypes such as disease susceptibility and reactions to drugs. Each HGVbase record features data extracted from publicly available genome databases or published literature that has been subjected to manual review and enhanced with annotations. HGVbase shares data with NCBI's dbSNP, and currently incorporates about 40% of dbSNP's records into its database. HGVbase is funded by the Karolinska Institute Center for Genomics and Bioinformatics in Sweden, the European Bioinformatics Institute, and the European Molecular Biology Laboratory (EMBL).

Search Tips: HGVbase provides text search and sequence search options for its users. In addition to the quick search box available on the HGVbase home page, there are links to four different search tools: Text Search, Text+ Search, Sequence Search, and Regional Search. The Text and Text+ search forms allow users to search for records by text strings that can be targeted to particular fields of a record. The Regional Search lets users search for SNPs by chromosomal location.

Since some characterized genes may lack standardized names, HGVbase recommends sequence searching over text-based searching. To search by sequence, simply paste DNA or RNA sequence data (in any format) into the Sequence Search form and click "Run." For more information about searching HGVbase see the "How to search" page available from the navigation menu on the left or click the "Help" link in the upper right corner of each search form.

Information provided: Some features included in each record are: the variant type, accession numbers that link to sequences that contain the variant, portions of the sequence that flank the variant, alleles or possible nucleotides at the site of the polymorphism, associated gene names and symbols, the region of the gene where the variant is found (e.g., exon, intron, etc.) and citations to source literature. For more information about the various fields of each HGVbase record see the Data Structure Record.

Human Gene Mutation Database 
Entry Last Reviewed: August 5, 2002

Overview: Human Gene Mutation Database (HGMD) is a collection of published gene lesions associated with human hereditary disorders. This database is maintained by the Institute for Medical Genetics at University of Wales College of Medicine. HGMD collaborates with Celera Genomics and is supported by Genome Database (GDB) and several biotechnology companies. The home page links to a useful overview of mutation nomenclature.

Search Tips: HGMD provides a simple search interface for querying its database by disease, gene name, and gene symbol. All punctuation marks (e.g., slashes, plus signs, double quotes, commas, and dashes) are ignored. Truncation using an asterisk (*) is supported. For more information on using HGMD, see the Help file.

Information Provided: Each search will pull up a list of gene symbols corresponding to search terms. Clicking on a gene symbol will access a record summarizing mutations and phenotypes and the number of entries associated with each mutation type and phenotype. Clicking on a mutation type will show the accession number, location, and associated phenotype and link to a reference citation for each mutation. The record for each gene also links to a mutation map, the gene's cDNA sequence, and gene-specific records in other databases.


NCBI dbSNP  
Entry Last Reviewed: August 5, 2002

Overview: One of the most common types of DNA sequence variation is the single nucleotide polymorphism (SNP), in which a single nucleotide base (A, C, T, or G) is substituted for another. NCBI's Database of Single Nucleotide Polymorphism (dbSNP) serves as a public repository for sequence variations such as small-scale insertions or deletions, polymorphic repetitive elements, and microsatellite variation, in addition to SNPs. Data can come from any part of a genome in any species. Sequence variations are submitted to the database by members of the scientific community. This database is separate from GenBank but is cross-linked to records in other NCBI resources such as GenBank, LocusLink, and PubMed.

For more about SNPs and why they are important to biomedical research, see the SNP Fact Sheet and NCBI's SNPs: Variation on a Theme.

Search Tips: Users can search dbSNP directly or access the database through other NCBI resources. One way to access SNP data mapped to a particular gene is to use NCBI LocusLink. Once you have found a gene's LocusLink record, clicking on the purple V or VAR link (if available) will open a list of SNPs mapped to that locus. Records in NCBI's sequence databases also may link to SNP data.

To search dbSNP directly, use Entrez SNP or dbSNP's Easy Search Form. dbSNP also provides a BLAST search option that compares the query sequence with sequence data contained in each SNP record. The BLAST option will generate a list of SNPs that can be found within the query sequence. See the Entrez SNP main page for descriptions of the different fields that can be used for searching the database.

NCBI will soon feature a quick how-to guide called GETTING STARTED. This guide should help novice users learn how to use and design search strategies for dbSNP. To learn more about dbSNP, see the FAQs page.

Information Provided: From LocusLink, after clicking on the purple V or VAR link, the SNP's linked from LocusLink page will open. This page provides Gene Model information with links to associated contig, mRNA, and protein sequence records. Each SNP is included in the graphic gene model and color-coded based on where the SNP is located (intron, exon, or untranslated region) and whether the change is synonymous or non-synonomous. For each SNP that occurs in an exon, the associated nucleotide, codon position, and amino acid residue are given.

Each SNP is assigned an identification number called a cluster id or rs number. The record for each cluster id is referred to as a cluster report and includes source organism, variation type (e.g., SNP (single nucleotide polymorphism) or DIP (deletion/insertion polymorphism)), the nucleotide sequence flanking the SNP in FASTA format, a LocusLink Analysis map depicting where the SNP is found within the gene, and links to other NCBI resources related to the particular SNP. Submitter records for each cluster provide one or more links to more detailed descriptions for each SNP submission.


Human Genome Variation Society: Variation Databases and Related Sites 
Entry Last Reviewed: August 5, 2002

Overview: This Web site is a collection of different types of mutation databases such as locus specific, disease-centered, national and ethnic, and non human. Locus-specific databases are arranged alphabetically by gene symbol. Links to other related databases and educational resources also are provided.


Genome Web: Human Mutation Databases
Entry Last Reviewed: August 5, 2002

Overview: This resource, from the UK Human Genome Mapping Project Resource Centre, is a collection of links to general mutation and locus-specific databases. A brief description of each database is found below the list of links.


Return to Top

Molecular Biology Basics Gene Resources Nucleotide Sequences Protein Sequences
Sequence Similarity Searching Gene-Mutations Protein Structures


Protein Structure Resources

Databases described in this section can provide a better understanding of what a gene's protein product looks like. For some well-studied proteins, users also may find structures of mutant forms that can be compared with structures of nonmutated or wild-type proteins.

A good, basic introduction to protein structures, X-ray crystallography, and nuclear magnetic resonance spectroscopy (NMR) can be found in the National Institute of General Medical Sciences (NIGMS) 2001 publication The Structures of Life (67 pp.). A free copy can be ordered from the NIGMS Publication List or downloaded as a PDF file (requires Adobe Acrobat Reader).

For more information:

Nature of 3-D Structural Data: The Protein Data Bank's brief introduction to X-ray crystallography and NMR

Crystallography 101: Tutorial by Dr. Bernhard Rupp at Lawrence Livermore National Laboratory

The Basics of NMR: Online text book by Dr. Joseph P. Hornak, professor of Chemistry and Imaging Science at the Rochester Institute of Technology


Protein Data Bank
Entry Last Reviewed: August 5, 2002

Overview: Protein Data Bank (PDB) is an international archive of 3D structural information for biological macromolecules. PDB is managed by the Research Collaboratory for Structural Bioinformatics (RCSD), a nonprofit consortium involving Rutgers, the State University of New Jersey; National Institute of Standards and Technology (NIST); and San Diego Supercomputer Center at the University of California, San Diego.

Search Tips: Users can query the archive by PDB ID or keyword using the search box on the main page. Other query options include SearchLite (keyword search form with examples), SearchFields (an advanced search option with customizable fields), and Status Search (used to find structures being processed by PDB). To learn more about searching PDB, take the Query Tutorial or examine the User Guides.

Information Provided: Each structure record includes a summary, structure viewing options, download and display options, links to records of structural neighbors, geometry, links to other protein information sources, and details about the structure's sequence. For step-by-step instructions on interacting with 3-D structures, see Examining a Protein's Structure.


Entrez Structure
Entry Last Reviewed: August 5, 2002

Overview: The National Center for Biotechnology Information (NCBI) database of three-dimensional molecular structure is called the Molecular Modeling Database (MMDB). The database is searchable via NCBI's Entrez retrieval system. Structure data is derived from X-ray crystallography and Nuclear Magnetic Resonance (NMR) structure determinations from Protein Data Bank (PDB). This database is considerably smaller than Entrez's nucleotide and protein sequence databases. If a structure for a known sequence is not included, the structure of a protein homolog may be available for examination.

Search Tips: Users can use the query interface to search by keyword, or access structure records directly through links in PubMed citations and nucleotide and protein sequence records. Links to instructions for searching by keyword, protein sequence, and nucleotide sequence are on the main search page. As in other Entrez databases, users can refine searches using fields available in Limits, preview query results and browse index terms in Preview/Index, combine searches using History, and store selected records from different searches on Clipboard. Some indexed fields that can be used to narrow a search include accession number, substance name, author name, journal name, organism, properties, and text word. Boolean Operators AND, OR, and NOT must be in upper case. Phrase searching using double quotes and truncation using the asterisk (*) as a wild card also are supported. For more information about searching this and other NCBI Entrez databases, see the Entrez Help Document.

Information provided: Each structure record or summary includes MMDB and PDB identifiers, links to protein and nucleotide sequences and related MEDLINE documents, taxonomy assignments, structure authors, date the structure was deposited into PDB, PDB classification and macromolecular content, links to sequence and structure neighbors, and structure-viewing options. Entries in MMDB are cross-linked to bibliographic information, sequence database entries, and NCBI taxonomy. To view a structure, users must download NCBI's free 3D structure viewer Cn3D, which is supported by Windows, Macintosh, and UNIX platforms. To learn more about using this viewer, see NCBI's Cn3D Tutorial, Help, and FAQs.


Return to Top

Molecular Biology Basics Gene Resources Nucleotide Sequences Protein Sequences
Sequence Similarity Searching Gene-Mutations Protein Structures



Feedback and comments about this site, contact site designer, Jennifer Bownas of HGMIS. To order a poster, click here.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y
Home Site Index Chromosome Viewer Genetic Disorder Guide Genome Database Guide Bioinformatics Tutorials
Bioinformatics Terms Sample Profiles Evaluating Medical Information Links FAQs Order Poster


The online presentation of this poster is a special feature of the U.S. Department of Energy (DOE) Human Genome Project Information Web site. The DOE Biological and Environmental Research program of the Office of Science funds this site.