 |
Genome
Database Guide
Resources for Learning about Genes and Proteins |
Molecular
Biology Basics
In this guide we
have provided descriptions and links to bioinformatics resources that
can help you learn more about genes and proteins associated with genetic
disorders or traits. Most were designed for students and professionals
in the life sciences, so a certain level of familiarity with genetics
and molecular biology is assumed. For those of you looking for an introduction
to the science behind the Human Genome Project, we have included links
to basic genetics and molecular biology resources that can be reviewed
before you attempt to use the bioinformatics resources.
The
Science Behind the Human Genome Project: Web pages that define some
basic genetics concepts and explain how the Human Genome Project was
implemented.
Genome
Glossary: DOE Human Genome Program glossary of genetics terms. Can
be searched or browsed alphabetically; links to other life science glossaries.
Each of the following
resources is viewable on the Web and printable as an Acrobat (PDF) file.
Primer
on Molecular Genetics: Department of Energy (DOE) publication. Defines
basic genetic terms and overviews human genome mapping and sequencing,
model organism research, informatics, and the impact of the Human Genome
Project.
To
Know Ourselves: DOE publication that explains the agency's involvement
with the Human Genome Project. Includes sections on human genome physical
mapping and sequencing; technological developments in laboratory instrumentation;
management of genomic data; and the project's ethical, legal and social
implications.
Return
to Top
Molecular
Biology Basics Gene
Resources Nucleotide
Sequences Protein
Sequences
Sequence Similarity Searching
Gene-Mutations
Protein Structures
Learning
about Genes and Their Products
These resources give
a general overview of a gene, along with some of the following information:
official symbol, locus, associated disorders or traits, mode of inheritance,
name and function of gene product, and links to additional gene-specific
resources.
Overview:
Created by Victor A. McKusick, MD and his colleagues at Johns Hopkins
School of Medicine, OMIM maintains a large, searchable, up-to-date database
of human genes, genetic traits, and disorders. The OMIM database contains
text, pictures, and reference material plus copious links to NCBI's
Entrez database of MEDLINE articles and sequence information. OMIM is
intended for use by genetics researchers, advanced life-science students,
and healthcare professionals concerned with genetic disorders. Three
different interfaces are provided for exploring particular genes and
genetic conditions: Search, Gene Map, and Morbid Map.
OMIM Search
- Interface where users can enter query terms and use limits to refine
search strategies. Users can also submit queries using the search
box on the OMIM home page.
OMIM Gene
Map - Single file that lists entries by chromosome and cytogenetic
location starting with the p telomere of chromosome 1, continuing
through the q telomere of chromosome 22, and ending with the p telomere
of X through the q telomere of Y. This interface is useful for seeing
the sequential order of genes as they occur on each chromosome.
OMIM Morbid
Map - Single file that lists genetic disorders alphabetically
and presents the cytogenetic map locations of associated genes.
Search Tips:
As part of NCBI's Entrez system, OMIM offers an advanced search option
that includes a variety of features and methods for refining search
strategies. With OMIM's Search interface, users can select different
fields (such as MIM number, gene name, chromosome, title word, and creation
date) with the Limits option, browse index terms and preview query results
using Preview/Index, combine searches using History, and store selected
records from different searches on Clipboard. Boolean Operators AND,
OR, and NOT must be in upper case. Phrase searching using double quotes
and truncation using the asterisk (*) as a wild card also are supported.
To learn more about OMIM Search, see our OMIM Search
Tutorial or review the Help
and FAQs
pages on the OMIM site.
Both Gene Map and
Morbid Map can be searched by gene symbol, chromosomal location (X and
Y must be capitalized), and disorder keyword. For Gene Map it is best
to search by gene symbol (e.g., CFTR for the cystic fibrosis gene) or
gene locus (e.g., 7q31.2). When searching by disorder keyword in Morbid
Map, use plain text terms. For example, instead of searching for alzheimer's
disease, try searching for alzheimer disease. Characters
such as quotation marks, apostrophes, parentheses, or dashes are not
supported. Gene Map and Morbid Map searches will take users to the search
term's first instance in the file and display 20 entries. To locate
the next instance, hit the Find Next button.
Information
Provided: Each OMIM record returned by a search contains MIM number,
official gene symbol, gene product name, gene locus (if known), general
description of disorder or gene-product function, and text summarizing
information from journal articles on a particular gene, disorder, or
trait. Links to citations in Medline, related OMIM entries and entries
in other NCBI databases also are included.
OMIM Gene Map entries
are organized by cytogenetic location and show gene symbols, gene title,
MIM number, names of associated disorders (if known), and other information.
Each entry includes links to NCBI Map Viewer and associated OMIM records.
OMIM Morbid Map
entries, arranged alphabetically by disorder name, include gene symbols
and disorder abbreviations, MIM number(s) linking to the OMIM record(s)
for associated gene(s), and cytogenetic location.
Overview:
LocusLink is an NCBI database that serves as a single query interface
to gene-specific information from a wide variety of bioinformatics sources.
LocusLink includes descriptive information about genetic loci in human,
fruit fly, mouse, rat, zebrafish, and human immunodeficiency virus type
1 genomes.
Search Tips:
Users can query LocusLink by typing keywords (such as disease or protein
name, gene symbol, accession numbers, or other database ID numbers)
into the search box at the top of the main page. Query options include
truncation using the asterisk (*) as a wild card, field restriction,
and Boolean operators that are not case sensitive. Grouping phrases
by parentheses or quotation marks is not supported. The Boolean AND
is implied when multiple search terms are entered.
LocusLink has a
system of controlled terms that can be used to retrieve only those records
with a particular feature. One of the controlled terms is disease_known,
which will return only loci associated with a known disorder. For example,
if you wanted to find only those members of the ABC (ATP-binding cassette)
protein subfamily that are associated with disease, you would enter
ABC and disease_known. See the Query
Tips in the Help file for a complete listing and more detailed descriptions
of controlled terms.
Letters just below
the search box link to lists of LocusLink records arranged alphabetically
by official gene symbol.
For query tips
and descriptions of fields included in each LocusLink report, see the
Help file.
LocusLink also provides a FAQ
page.
Information
Provided: Each LocusLink report may include the following types
of information: links to gene-specific entries in other databases, official
gene nomenclature, LocusID (identification number assigned to the gene
by LocusLink), overview of protein function with links to scientific
literature describing function, alternate symbols and aliases, phenotypes
or expressed characteristics associated with the gene, other database
ID numbers, homologous genes from other genomes, links to cytogenetic
maps, links to sequence records, and links to other related information
sources.
| GeneCards
|
Entry
Last Reviewed:
August 5, 2002
|
Overview:
Developed at the Weizmann Institute
of Science in Israel, GeneCards is a database of human genes, their
products, and their involvement in hereditary disorders. GeneCards automatically
extracts gene-specific information from a variety of Web-based bioinformatics
resources and integrates the data into each entry. The database was
designed for scientists who want to use one interface to access multiple
databases for information about human genes that have been assigned
approved symbols. An important part of the GeneCards mission is to promote
the use of standard nomenclature, such as official gene symbols approved
by the HUGO Gene Nomenclature
Committee.
Search Tips:
Users can search GeneCards by keyword or gene symbol/alias using the
search box on the home page. Keywords can be single or multiple terms,
GenBank accession number, chromosome number, or gene locus. Truncation
using an asterisk (*) as a wild card at the beginning or end of a search
term is supported. The Boolean operators AND and OR can be used to connect
terms. The Boolean operator NOT is not supported. Examples
are provided for keyword searching. Users may also browse a complete
listing of genes or a subset of disease
genes featured in GeneCards. To learn more about searching GeneCards,
check out the Quick
Start, Guided
Tour and Guidance
System links.
Information
Provided: A typical entry for a gene may include the following information:
official gene name and symbol, synonyms or alternative names, ID numbers
assigned to the gene in other databases, chromosomal location, chromosome
map showing where the gene is found, domains and protein families associated
with the gene's protein product, links to sequence records, expression
patterns in human tissues, links to similar genes in other organisms,
SNPs and variants, disorders and mutations, links to citations in Medline,
and links to other related resources. Each GeneCard also links to sources
used to create the entry. GeneCards encourages feedback from its users
and provides a form
for submitting comments and suggestions.
Overview:
Developed at Johns Hopkins University and maintained by the Bioinformatics
Supercomputing Centre at the Hospital for Sick Children in Toronto,
Ontario, Canada, this curated database serves as a repository for genomic
mapping data obtained from the Human Genome Initiative. GDB provides
descriptions of human genomic segments or regions of the human genome,
human genome maps, and genetic variations or polymorphisms. New users
may want to read About
GDB to learn about the different features of this database. For
each database entry or object, scientists are invited to submit data,
add annotations to existing data, and recommend links to related entries
in other databases.
Search Tips:
To search for information about a particular gene, in the simple search
box on the main page, choose Genomic Segments and then Name/GDB ID if
you know the official gene symbol. Users also may search by keyword
or DNA Sequence ID such as GenBank Identifier (gi). In keyword searching,
multiple terms entered in the search box will be treated as a phrase
unless they are separated by a Boolean operator (AND, OR, and NOT).
Truncation using the asterisk (*) as a wild card or single-character
substitution using the question mark (?) are supported. In addition
to the simple search, GDB offers other query options such as customized
forms that include Gene
Search by Name or Symbol, sequence-based search forms such as GDB
BLAST, and generic search forms that let users search only for genes
or specific types of maps. Browsing options available from the Search
Options page are Genetic Diseases by Chromosome, Lists of Genes
by Chromosome, and Lists of Genes by Symbol. To learn more about using
GDB, see the Help
document or Example Searches.
Information
Provided: Submitting a search returns a list of matching objects.
Clicking on Accession ID pulls up the GDB record. For each gene,
the following may be provided: gene symbols, names, and aliases; cytogenetic
localization; nucleic acid sequence links; protein sequence links; polymorphisms;
mutations; phenotype links to related OMIM records; homology links;
external links to other related Web sites or database entries; and citations
that link to GDB journal articles. At the top of each record, a button
labeled View Maps of Region pulls up a list of maps. Viewing
GDB maps requires a Java-capable browser (Netscape or Internet Explorer
3.0 and higher) that will run GDB's Java applet Mapview. See the Mapview
Help document for more information.
Return
to Top
Molecular
Biology Basics Gene
Resources Nucleotide
Sequences Protein
Sequences
Sequence Similarity Searching
Gene-Mutations
Protein Structures
Nucleotide
Sequence Databases
Overview:
Part of the National Center for Biotechnology Information (NCBI) Entrez
system, this database contains sequence data from several sources such
as GenBank, RefSeq, European Molecular Biology Laboratory (EMBL), DNA
DataBank of Japan (DDBJ), and Protein Data Bank (PDB). (Refer to overviews
below for general descriptions of NCBI databases GenBank and RefSeq.)
GenBank is part of the International
Nucleotide Sequence Database Collaboration, which also includes
DDBJ and EMBL.
Sequence data is exchanged daily among these three organizations. Although
record formats and search systems may differ, information contained
in each record (accession number, sequence data, annotations) will be
the same for all three databases.
Search Tips:
As with other Entrez databases, users can refine search strategies using
fields available in Limits and Preview/Index, browse Index terms of
a particular field, combine searches using History, and store selected
records from different searches on a Clipboard. Some search-refining
techniques available from the Limits page are to exclude certain types
of sequences (e.g., ESTs) and limit the search by date or particular
database (e.g., search only RefSeq). Boolean Operators AND, OR, and
NOT must be in upper case. Phrase searching using double quotes and
truncation using the asterisk (*) as a wild card also are supported.
For more information about searching this and other NCBI Entrez databases,
see Entrez
Help Document. For step-by-step instructions on finding and interpreting
sequence records, see the Accessing
Sequence Records tutorial.
Information
Provided: Each record returned in a search will include the nucleotide
sequence and annotations such as accession numbers, keywords, source
organism, and citations to reference articles. Sequence records also
may contain the translated amino acid sequence. For more detailed descriptions
of types of information in each sequence record, check the Sample
GenBank Record provided by NCBI.
GenBank
Overview:
GenBank is an NCBI database that serves as an archive for all publicly
available DNA sequences from more than 75,000 organisms. Submitting
scientists retain complete editorial control over their sequences, so
they decide on gene symbols (which may not be the official ones) and
additional information to include. Scientists contact NCBI if they wish
to make any modifications to their sequence records. As an archival
database, GenBank may contain redundant entries, even hundreds of records
for the same gene. Besides redundancy, GenBank sequences may be contaminated
with vector DNA. To address some problems associated with this archival
database, NCBI developed the nonredundant RefSeq.
RefSeq
Overview:
NCBI's database of reference sequences serves as a curated, nonredundant
source of information about genomic DNA contigs (segments constructed
by ordering cloned DNA fragments), mRNA transcripts, and proteins associated
with known genes. Unlike GenBank records, RefSeq records are created
and updated by NCBI staff. RefSeq records can be Predicted, Provisional,
or Reviewed. Predicted and Provisional records are generated by an automatic
process, but Reviewed records undergo a manual process that screens
sequences for problems such as sequencing errors or vector contamination.
Reviewed records also contain enhanced annotation such as additional
gene-relevant publications, summary of nucleotide and protein features,
and description of gene function in addition to annotation (official
gene symbol or name, aliases, Locus ID number, MIM number, and map information)
generated from automatic record processing. Each RefSeq entry features
a distinct accession
number (two characters followed by an underscore in which the first
two characters describe the sequence type). For more information about
RefSeq, see
RefSeq FAQs or LocusLink
& RefSeq Development.
Return
to Top
Molecular
Biology Basics Gene
Resources Nucleotide
Sequences Protein
Sequences
Sequence Similarity Searching
Gene-Mutations
Protein Structures
Protein
Sequence Databases
Overview:
Part of the National Center for Biotechnology Information (NCBI) Entrez
system, this database includes sequence data compiled from a variety of
sources, including SWISS-PROT, Protein Information Resource (PIR), Protein
Data Bank (PDB), and Protein Resource Foundation (PRF) in Japan. Some
protein sequences were created from translations of coding regions in
DNA sequences stored in GenBank and RefSeq.
Search Tips:
As with other Entrez databases, users can refine search strategies using
fields available in Limits, preview the number of search results for
a query, browse Index terms of a particular field, combine searches
using History, and store selected records from different searches on
Clipboard. Some of the indexed fields that can be used to narrow a search
include accession number, gene name, journal name, molecular weight,
organism, properties, protein name, sequence length, and text word.
Users also can specify that only one particular database be searched
(e.g., retrieve protein sequences from SWISS-PROT only). Boolean Operators
AND, OR, and NOT must be in upper case. Phrase searching using double
quotes and truncation using the asterisk (*) as a wild card also are
supported. For more information about searching this and other NCBI
Entrez databases, see the Entrez
Help Document. For step-by-step instructions on finding and interpreting
sequence records, see the Accessing
Sequence Records tutorial.
Information
Provided: Search results displayed using the default view will include
locus name (a unique name assigned to each record), sequence length,
protein description (definition), accession number, database source,
keywords, organism, citations to references, comments concerning protein
function or associated traits or disorders, information about sequence
regions of biological significance, and the amino acid sequence. For
detailed descriptions about fields presented in each NCBI sequence record,
see the GenBank
sample record.
Overview:
The protein sequence databases SWISS-PROT and TrEMBL were developed
by groups at the Swiss Institute of
Bioinformatics (SIB) and the European
Bioinformatics Institute (EBI). SWISS-PROT uses three key criteria:
high level of annotation, minimal redundancy, and high level of integration
with other databases. SWISS-PROT includes as much information as possible
in its annotations, and external experts review current literature and
provide comments and updates on different protein groups. SWISS-PROT's
depth of annotation, however, requires considerable time and effort.
To keep a current database of protein sequences, a subset called TrEMBL
(Translation of EMBL) was developed. Translations of EMBL nucleotide
sequences are computer annotated and stored in TrEMBL until sequences
can be fully annotated and integrated into SWISS-PROT.
Search Tips:
SWISS-PROT sequence records can be accessed through the NCBI Entrez
Proteins database. If users choose to access the SWISS-PROT/TrEMBL Web
site for sequence searching, they can query the database using a variety
of methods: quick search on the main page (Boolean operators not supported),
Sequence Retrieval System (SRS),
full-text search
(Boolean operators, phrase searching, and wild cards supported), and
advanced search.
Forms for searching by accession number or ID, description (entry name,
gene name, species, organelle), author, or citation (SWISS-PROT only)
also are provided. To learn more about searching SWISS-PROT see the
SWISS-PROT Documentation
section which includes a downloadable PDF version of the user manual.
Information
Provided: SWISS-PROT entries are described as containing two types
of data: core data (consisting of sequence, bibliographic references,
and description of the protein's biological origin) and the annotation.
Detailed annotations in each entry describe protein function, post-translational
modification (e.g., addition of sugars or phosphate groups after mRNA
translation), domain and binding sites, secondary structure, quaternary
structure (e.g., homodimer, heterodimer), disorders associated with
altered protein forms or amounts, variants, and similarities to other
proteins.
Overview:
Established in 1984, Protein Information
Resource (PIR) is a division of the National Biomedical Research
Foundation associated with Georgetown University Medical Center. In
collaboration with Munich
Information Center for Protein Sequences (MIPS) in Germany and the
Japan International Protein Information Database (JIPID), PIR has developed
the PIR-International Protein Sequence Database (PSD). Its mission is
to be "the most comprehensive and expertly annotated protein sequence
database in the public domain" with the primary objective of achieving
"properties of Comprehensiveness, Timeliness, Non-Redundancy, Quality
Annotation, and Full Classification."
Search Tips:
PIR sequence records can be accessed through the NCBI Entrez Proteins
database. If users choose to go to the PIR-PSD Web site, the following
search options are provided: search by unique identifier or accession
number, basic text search, and advanced text search. For basic text
searches, the Boolean operators AND, OR, and NOT are not supported,
and a space between terms is interpreted as "and." Advanced searches
allow users to refine a strategy with fields such as Title, Species,
Author, Keyword, and Gene Name. In advanced search, search terms are
case sensitive and must be at least three characters long. Boolean operators
OR and NOT are supported. A space between words is interpreted as "and,"
so users searching for a phrase must put a character between multiple
terms (e.g., enter homo-sapiens to search for "homo sapiens"). For more
on searching PIR-PSD, see Help
Searching PIR Databases, Sample
Entry, Demo
Search, and FAQs.
Information
Provided: Each record includes protein name; classification and
origin; literature references; protein features such as domains and
motifs; primary sequence data; and links to related entries in other
databases. Users have the option to create submission forms for similarity
searching in PIR and NCBI databases. At the top of each record are links
to annotation and sequence data within the record and a link to a composition
table that summarizes total amino acid composition expressed as percentages.
At the bottom of the record are direct links to Protein Data Bank (PDB)
structures and sequence similarity alignments associated with the protein.
Return
to Top
Molecular
Biology Basics Gene
Resources Nucleotide
Sequences Protein Sequences
Sequence Similarity Searching
Gene-Mutations
Protein Structures
Resources
for Sequence Similarity Searching
Scientists frequently
perform sequence-similarity searching to identify homologs for a particular
sequence and to gain insight into any gene product's function and biological
importance. Since more than one codon or triplet of nucleotides could
code for a particular amino acid, the considerable amount of variability
in nucleotide sequences could translate into the same amino acid sequence.
The genetic code's degenerate nature is the reason that similarity searching
using amino acid sequences generally is more informative than using nucleotide
sequences.
Users who are new
to sequence-similarity searching should check out NCBI's Introduction
to Similarity Page, Rules
of Thumb, and BLAST
Guide's Glossary.
Overview:
BLAST (Basic Local Alignment Search Tool) is a set of programs designed
to perform similarity searches on all available sequence data. BLAST
uses an algorithm developed by the National Center for Biotechnology
Information (NCBI) that seeks out local alignment (alignment of some
portion of two sequences) as opposed to global alignment (alignment
of two sequences over their entire length). By searching for local alignments,
BLAST can identify regions of similarity in two sequences. Some similarity
searches offered by NCBI include comparing an amino acid sequence to
a protein sequence database (blastp), comparing a nucleotide query sequence
to a nucleotide sequence database (blastn), and comparing a nucleotide
sequence translated in all reading frames to a protein sequence database
(blastx).
Search Tips:
From the main BLAST page, users can choose among several NCBI services.
For service descriptions, click on the question mark to the right of
each section title or see the Description
of BLAST Services. Clicking on the desired BLAST search option will
lead to a search page with a box for entering the query sequence. Accepted
input includes a sequence in FASTA format (a single-line description
followed by sequence data), bare sequence (sequence data without the
single-line description), and identifier. The identifier may be an accession
number or GenBank ID (GI number), but must be entered as a single word
without any spaces between characters. For more information about input,
see NCBI's
Search Format page. Each search or format option on the search page
links to Help documentation with more detailed descriptions of each
option. For more on how to use BLAST, see our Sequence
Similarity Searching tutorial and NCBI's step-by-step BLAST
GUIDE, Query
Tutorial for new users,
BLAST Tutorial, and BLAST
Help.
Information
Provided: After submitting a BLAST request, users are presented
with a Formatting BLAST page that displays the query statement,
domain information, request for ID number, and format options. After
desired format options are selected, pressing the Format button
will pull up the Results of BLAST page. Using pair-wise alignment
(the default alignment view) in format options, the Results page
will display an image map graphically depicting retrieved database sequences
(subject sequences) aligned with query sequence (depicted as the numbered
line at the top). Passing the mouse over each line below the query sequence
will display a description of that sequence in the text box. Clicking
on each line will jump down to the corresponding pairwise alignment
between the query sequence and a particular subject sequence. Below
the image map is a list of sequences producing significant alignments.
Accession number or identifier for each alignment links to a sequence
record. The score links to the corresponding pairwise alignment at the
bottom of the Results page. The blue L seen in some results
links to a related entry in LocusLink. See the Sequence
Similarity Searching tutorial for more on interpreting BLAST results.
Overview:
The FASTA Similarity Search tool is part of the Protein
Information Resource (PIR) collection of protein databases and bioinformatics
tools. This similarity-search tool uses the FASTA algorithm, which compares
a query sequence to those in the Protein Sequence Database and other
PIR databases.
Search Tips:
Users can query the database by inserting the single-letter amino acid
code into the query box or by entering the valid PIR-PSD entry code
for a particular protein of interest. See the Demo
Search for an example.
Information
Provided: Query results are presented in a table that lists more-similar
sequences at the top and less-similar sequences toward the bottom. Clicking
on ID number for a result will pull up the database entry for that protein,
and clicking on the colored bar on the right will link to pairwise alignment
between the submitted sequence and the subject sequence retrieved from
the database.
Return
to Top
Molecular
Biology Basics Gene
Resources Nucleotide
Sequences Protein
Sequences
Sequence Similarity Searching
Gene-Mutations
Protein Structures
Gene
Mutation Resources
Genes carry instructions
for building proteins, molecules that do most of the body's work. Certain
variations in a gene's nucleotide sequence can affect the resulting protein's
function by altering amino acid sequence and protein structure. The inability
of some variant proteins to function properly can cause genetic disorders
or other distinctive phenotypes.
Overview:
OMIM records for many genes include an Allelic Variants section that
summarizes published research concerning selected allelic variants or
mutations, many of which cause disorders. Some criteria for selecting
allelic variants for inclusion in OMIM are first mutation discovered,
high population frequency, distinctive phenotype, and unusual disease-causing
mechanism. Each variant is assigned a ten-digit number made up of the
gene's six-digit OMIM number, followed by a period and four digits unique
to the variant. For more information about this database, see the OMIM
entry above in the Learning about Genes and Their
Products section.
| HGVbase
|
Entry
Last Reviewed:
August 5, 2002
|
Overview:
The Human Genome Variation database (HGVbase) is a database of annotated
records for known sequence variations in the human genome. This database
was designed as a tool to help scientists understand how common genome
sequence variations, such as single nucleotide polymorphisms, result
in complex phenotypes such as disease susceptibility and reactions to
drugs. Each HGVbase record features data extracted from publicly available
genome databases or published literature that has been subjected to
manual review and enhanced with annotations. HGVbase shares data with
NCBI's dbSNP,
and currently incorporates about 40% of dbSNP's records into its database.
HGVbase is funded by the Karolinska Institute Center for Genomics and
Bioinformatics in Sweden, the European Bioinformatics Institute, and
the European Molecular Biology Laboratory (EMBL).
Search Tips:
HGVbase provides text search and sequence search options for its users.
In addition to the quick search box available on the HGVbase home page,
there are links to four different search tools: Text Search, Text+ Search,
Sequence Search, and Regional Search. The Text and Text+ search forms
allow users to search for records by text strings that can be targeted
to particular fields of a record. The Regional Search lets users search
for SNPs by chromosomal location.
Since some characterized
genes may lack standardized names, HGVbase recommends sequence searching
over text-based searching. To search by sequence, simply paste DNA or
RNA sequence data (in any format) into the Sequence Search form and
click "Run." For more information about searching HGVbase see the "How
to search" page available from the navigation menu on the left or click
the "Help" link in the upper right corner of each search form.
Information
provided: Some features included in each record are: the variant
type, accession numbers that link to sequences that contain the variant,
portions of the sequence that flank the variant, alleles or possible
nucleotides at the site of the polymorphism, associated gene names and
symbols, the region of the gene where the variant is found (e.g., exon,
intron, etc.) and citations to source literature. For more information
about the various fields of each HGVbase record see the Data
Structure Record.
Overview:
Human Gene Mutation Database (HGMD) is a collection of published gene
lesions associated with human hereditary disorders. This database is
maintained by the Institute for Medical Genetics at University of Wales
College of Medicine. HGMD collaborates with Celera Genomics and is supported
by Genome Database (GDB) and several biotechnology companies. The home
page links to a useful overview of mutation
nomenclature.
Search Tips:
HGMD provides a simple search
interface for querying its database by disease, gene name, and gene
symbol. All punctuation marks (e.g., slashes, plus signs, double quotes,
commas, and dashes) are ignored. Truncation using an asterisk (*) is
supported. For more information on using HGMD, see the Help
file.
Information
Provided: Each search will pull up a list of gene symbols corresponding
to search terms. Clicking on a gene symbol will access a record summarizing
mutations and phenotypes and the number of entries associated with each
mutation type and phenotype. Clicking on a mutation type will show the
accession number, location, and associated phenotype and link to a reference
citation for each mutation. The record for each gene also links to a
mutation map, the gene's cDNA sequence, and gene-specific records in
other databases.
Overview:
One of the most common types of DNA sequence variation is the single
nucleotide polymorphism (SNP), in which a single nucleotide base (A,
C, T, or G) is substituted for another. NCBI's Database of Single Nucleotide
Polymorphism (dbSNP) serves as a public repository for sequence variations
such as small-scale insertions or deletions, polymorphic repetitive
elements, and microsatellite variation, in addition to SNPs. Data can
come from any part of a genome in any species. Sequence variations are
submitted to the database by members of the scientific community. This
database is separate from GenBank but is cross-linked to records in
other NCBI resources such as GenBank, LocusLink, and PubMed.
For more about
SNPs and why they are important to biomedical research, see the SNP
Fact Sheet and NCBI's SNPs:
Variation on a Theme.
Search Tips:
Users can search dbSNP directly or access the database through other
NCBI resources. One way to access SNP data mapped to a particular gene
is to use NCBI LocusLink. Once you have found a gene's LocusLink record,
clicking on the purple V or VAR link (if available) will
open a list of SNPs mapped to that locus. Records in NCBI's sequence
databases also may link to SNP data.
To search dbSNP
directly, use Entrez
SNP or dbSNP's Easy
Search Form. dbSNP also provides a BLAST
search option that compares the query sequence with sequence data
contained in each SNP record. The BLAST option will generate a list
of SNPs that can be found within the query sequence. See the Entrez
SNP main page for descriptions of the different fields that can
be used for searching the database.
NCBI will soon
feature a quick how-to guide called GETTING STARTED. This guide should
help novice users learn how to use and design search strategies for
dbSNP. To learn more about dbSNP, see the FAQs
page.
Information
Provided: From LocusLink, after clicking on the purple V
or VAR link, the SNP's linked from LocusLink page will
open. This page provides Gene Model information with links to associated
contig, mRNA, and protein sequence records. Each SNP is included in
the graphic gene model and color-coded based on where the SNP is located
(intron, exon, or untranslated region) and whether the change is synonymous
or non-synonomous. For each SNP that occurs in an exon, the associated
nucleotide, codon position, and amino acid residue are given.
Each SNP is assigned
an identification number called a cluster id or rs number. The record
for each cluster id is referred to as a cluster report and includes
source organism, variation type (e.g., SNP (single nucleotide polymorphism)
or DIP (deletion/insertion polymorphism)), the nucleotide sequence flanking
the SNP in FASTA format, a LocusLink Analysis map depicting where the
SNP is found within the gene, and links to other NCBI resources related
to the particular SNP. Submitter records for each cluster provide one
or more links to more detailed descriptions for each SNP submission.
Overview:
This Web site is a collection of different types of mutation databases
such as locus specific, disease-centered, national and ethnic, and non
human. Locus-specific databases are arranged alphabetically by gene
symbol. Links to other related databases and educational resources also
are provided.
Overview:
This resource, from the UK Human Genome Mapping Project Resource Centre,
is a collection of links to general mutation and locus-specific databases.
A brief description of each database is found below the list of links.
Return
to Top
Molecular
Biology Basics Gene
Resources Nucleotide
Sequences Protein
Sequences
Sequence Similarity Searching
Gene-Mutations Protein
Structures
Protein
Structure Resources
Databases described
in this section can provide a better understanding of what a gene's protein
product looks like. For some well-studied proteins, users also may find
structures of mutant forms that can be compared with structures of nonmutated
or wild-type proteins.
A good, basic introduction
to protein structures, X-ray crystallography, and nuclear magnetic resonance
spectroscopy (NMR) can be found in the National Institute of General Medical
Sciences (NIGMS) 2001 publication The Structures of Life (67 pp.). A free
copy can be ordered from the NIGMS
Publication List or downloaded as a PDF
file (requires Adobe
Acrobat Reader).
For more information:
Nature
of 3-D Structural Data: The Protein Data Bank's brief introduction
to X-ray crystallography and NMR
Crystallography
101: Tutorial by Dr. Bernhard Rupp at Lawrence Livermore National
Laboratory
The
Basics of NMR: Online text book by Dr. Joseph P. Hornak, professor
of Chemistry and Imaging Science at the Rochester Institute of Technology
Overview:
Protein Data Bank (PDB) is an international archive of 3D structural
information for biological macromolecules. PDB is managed by the Research
Collaboratory for Structural Bioinformatics (RCSD), a nonprofit consortium
involving Rutgers, the State University of New Jersey; National Institute
of Standards and Technology (NIST); and San Diego Supercomputer Center
at the University of California, San Diego.
Search Tips:
Users can query the archive by PDB ID or keyword using the search box
on the main page. Other query options include SearchLite
(keyword search form with examples), SearchFields
(an advanced search option with customizable fields), and Status
Search (used to find structures being processed by PDB). To learn
more about searching PDB, take the Query
Tutorial or examine the User
Guides.
Information
Provided: Each structure record includes a summary, structure viewing
options, download and display options, links to records of structural
neighbors, geometry, links to other protein information sources, and
details about the structure's sequence. For step-by-step instructions
on interacting with 3-D structures, see Examining
a Protein's Structure.
Overview:
The National Center for Biotechnology Information (NCBI) database of
three-dimensional molecular structure is called the Molecular Modeling
Database (MMDB). The database is searchable via NCBI's Entrez retrieval
system. Structure data is derived from X-ray crystallography and Nuclear
Magnetic Resonance (NMR) structure determinations from Protein Data
Bank (PDB). This database is considerably smaller than Entrez's nucleotide
and protein sequence databases. If a structure for a known sequence
is not included, the structure of a protein homolog may be available
for examination.
Search Tips:
Users can use the query interface to search by keyword, or access structure
records directly through links in PubMed citations and nucleotide and
protein sequence records. Links to instructions for searching by keyword,
protein sequence, and nucleotide sequence are on the main search page.
As in other Entrez databases, users can refine searches using fields
available in Limits, preview query results and browse index terms in
Preview/Index, combine searches using History, and store selected records
from different searches on Clipboard. Some indexed fields that can be
used to narrow a search include accession number, substance name, author
name, journal name, organism, properties, and text word. Boolean Operators
AND, OR, and NOT must be in upper case. Phrase searching using double
quotes and truncation using the asterisk (*) as a wild card also are
supported. For more information about searching this and other NCBI
Entrez databases, see the Entrez
Help Document.
Information
provided: Each structure record or summary includes MMDB and PDB
identifiers, links to protein and nucleotide sequences and related MEDLINE
documents, taxonomy assignments, structure authors, date the structure
was deposited into PDB, PDB classification and macromolecular content,
links to sequence and structure neighbors, and structure-viewing options.
Entries in MMDB are cross-linked to bibliographic information, sequence
database entries, and NCBI taxonomy. To view a structure, users must
download NCBI's free 3D structure viewer Cn3D,
which is supported by Windows, Macintosh, and UNIX platforms. To learn
more about using this viewer, see NCBI's Cn3D
Tutorial, Help,
and FAQs.
Return
to Top
Molecular
Biology Basics Gene
Resources Nucleotide
Sequences Protein
Sequences
Sequence Similarity Searching
Gene-Mutations
Protein Structures
|