| Spring 2002 |
Seminar in Biotechnology |
Douglas W. Smith |
| 2130 Bonner Hall |
BILD94 |
5254 Muir Biology Building |
|
Suresh Subramani, Instructor |
x42620; dsmith@ucsd.edu |
Introduction
to Bioinformatics
Search Google et al for "bild94 bioinformatics"
| DNASYSTEM
| Lecture |
Topics:
- I. What is Bioinformatics?
-
- II. Sequence Databases and
Their Use
- A. Primary Sequence Databases
- B. Uses of Sequence Databases
- C. Retrieve Information from
Sequence Databases
- D. Analysis using Sequences:
finding Homologues
- 1. BLAST: Basics, Filters,
Variations
- 2. FASTA
- E. Analysis using Sequences:
finding Genes in DNA
- 1. Grail
- 2. BCM Gene Finder - BCM
Search Launcher
- 3. GeneMark
- F. Analysis using Sequences:
finding Motifs
- 1. Motifs
- 2. Protein Family Classifications
- G. Multiple Sequence Analysis
- 1. Clustal W
- 2. Web Databases of Multiple
Sequence Alignments
- H. Phylogenetics
- 1. Basics and Methods
- 2. Confidence Levels
-
- III. Whole Genomes
- A. Implications
- B. TIGR
-
- IV. Organism and Other Databases
- A. Need for Organism Databases
- B. Paradigm: ACeDB
- C. Advantages of ACeDB for
Organism DBs
- D. ACeDB on Web: WebAce and
AceBrowser
- 1. ACeDB: Disadvantages
- 2. WebAce and AceBrowser
- E. Example of ACeDB: DictyDB
- 1. Graphics, Text Displays
- 2. WebAce and AceBrowser
- F. Other Web Organism Databases
-
- V. Problems ... Directions
to Go
- A. Problems
- B. Need: "smart"
Analysis Packages
- C. "Training" of
"smart" Analysis Packages
-
- VI. Additional Materials
- A. Books
- B. Recent Short Articles
I. What is Bioinformatics?
Bioinformatics:
- computerized annotation of genomic and biological
information and data (databases)
- transformation and manipulation of these
data (software tools)
- computational analysis of biological data
Overall Aim of Bioinformatics:
- provide biologically important predictions
from annotated data and transformation / manipulation of these
data
Databases:
- Primary and added-value databases
- Sequence vs organismic databases
- 'Federated' databases: global computer networks
... WWW
- Software tools access multiple databases,
often at different sites
Software tools (computer programs):
- Software tools: sequence analysis, database
construction and management, evolutionary relations, structural
analyses, pathways, microarray analysis, proteomic analysis
- Software tools integrated into databases
Key words from some of the
review articles listed at end:
- Finding genes, locating coding regions, predicting
function: automate
- Function, Evolution, Sequence, Structure
(FESS relationships)
- Metabolic genotype, phenotype, redundancy
- Genes to Pathways; Genes to biological knowledge
- Assigning gene sets to different species:
homologs vs paralogs
- Finding conserved proteins common to all
life
- Expression profiles, relation to metabolic
pathways / genetic networks
- Gene synteny between species: gene adjacency
in genomes
The Need for Bioinformatics:
- Whole Genome Analyses and Sequences
- Experimental Analyses involving Thousands
of Genes simultaneously
- DNA Chips and Array Analyses
- Expression Arrays
- Comparative Analyses between Species and
Strains
- Proteomics: 'Proteome' of an Organism ...
2D gels, Mass Spec
- Medical applications: Genetic Disease ...
SNPs
- Pharmaceutical and Biotech Industry
- Forensic applications
- Agricultural applications
Bioinformatics is New, Hot,
and Growing:
Chronology of Review Articles:
446 at PubMed for 'bioinformatics AND review',
(101 for 'bioinformatics AND review AND 2002')
257 for 'bioinformatics AND review AND 2001'
144 for 'bioinformatics AND review AND 2000'
67 for 'bioinformatics AND review AND 1999'
52 for 'bioinformatics AND review AND 1998'
18 for 'bioinformatics AND review AND 1997'
13 for 'bioinformatics AND review AND 1996'
6 for 'bioinformatics AND review AND 1995'
7 for 'bioinformatics AND review AND 1994'
1 for 'bioinformatics AND review AND 1993'
2 for 'bioinformatics AND review AND 1992'
0 for 'bioinformatics AND review AND 1991'
II. Sequence Databases and
Their Use:
A. Primary Sequence Databases:
- Nucleic Acid Databases
- NCBI (Natl Center Biotech
Information) - GenBank
- EBI (European Bioinformatics Institute) -
EMBL
- DISC - DNA Information and Stock Center,
Japan
- Protein Databases
- NCBI - GenPept
- ExPASy - SwissProt and TrEMBL
- EBI (European Bioinformatics Institute)
- DISC - DNA Information and Stock Center,
Japan
B. Uses of Sequence Databases:
- Information Retrieval
- Analysis: "given a new
DNA sequence, what's in it?"
- Finding Homologues
- Finding Genes
- Finding Motifs - DNA Binding
Sites
C. Retrieve Info from Sequence
Databases:
- NCBI - Entrez
- ExPASy - SwissProt and TrEMBL
- http://www.expasy.ch/
- SwissProt - Bairoch well-annotated non-redundant
protein DB
- TrEMBL - Translation of EMBL DNA coding sequences
- EBI - SwissProt, TrEMBL, PIR
- http://www.ebi.ac.uk/
- SRS - Sequence Retrieval System
- Software Tools - FASTA, WU-Blast2, ClustalW
- EBI2 - second server at EBI
- DISC - DNA Information and Stock Center,
Japan - DDBJ
- PDB - Protein DataBank
D. Sequence Analysis: finding
Homologues
- Homologues - sequences descending
from common ancestor
- Comparison of Sequences using Distance Matrix
approach
- PAM (Point Adjusted Mutation) matrices
- BLOSUM matrices
- Unitary matrix for nucleic acid comparisons
- DOT PLOTS - 2D graph of alignment
of two sequences
- Use same sequence to find
Direct Repeats, e.g. DNA tandem repeats
- Use DNA strand against complementary
strand to find Inverted Repeats - e.g. RNA stem-loop secondary
structure
- 'BLAST 2 Sequences' at NCBI one of the few Web sites to do this ... but
just gives the best alignment
- BCM Search Launcher does similarly
- BLAST - Basic Local Alignment
Sequence Tool
- FASTA - fast, global database
search tool of Pearson and Lipman
- Others:
- WU-Blast - Washington Univ
gapped BLAST version
- SSEARCH - Smith-Waterman
algorithm used
- BLITZ - Smith-Waterman on
Parallel Processing computer
- MpSrch - similar to BLITZ
1. BLAST:
Basic BLAST Algorithm:
- Objective: find all Local regions of similarity
distinguishable from random
- Only Local alignments permitted - no Gaps
- Mathematically sound - Karlin/Altschul statistics
- 3 step algorithm:
- Compile list of high scoring words of length
w
- w = 4 for proteins, w = 12 for nucleic acids
- Scan for 'word hits' of score greater than
Threshold T
- Extend word hits in both directions to find
High Scoring Pairs (HSPs) with scores greater than S
- Word list compiled from words in Query sequence:
Hash Table
- S determined from EXPECT parameter: number
of hits expected to be found at random (default: EXPECT = 10)
- Distribution of hits found with given Score
is an Extreme Value Distribution (distribution of Maximum, rather
than Sum, of many indep random variables)
BLAST Programs:
- BLASTN - NA query against NA database
- BLASTP - Protein query against Protein database
- BLASTX - NA query (translated) against Protein
database
- TBLASTN - Protein query against NA (translated)
database
- TBLASTX - NA query (translated) against NA
(translated) database
Filters:
- SEG filter - ignor Low Information Content
/ Low Complexity seqs
- DUST filter - like SEG but used for DNA
- XNU filter - used to ignor repeated sequences
(not used much)
Variations on basic BLAST algorithm:
- Gapped-BLAST or BLAST 2.0
- Extend Word Pairs via 'Two-hit Method': require
two on same 'diagonal' (no gaps) - then extend
- Generate Gapped Alignments: use best Word
Pair Extension, do NWS, accepting other Extensions that drop
Score no more than an amount XG
- PSI-BLAST (Position-Specific Iterated BLAST)
- Gapped-BLAST used
- Iterate BLAST searches using 'Position Specific
Score Matrix' generated in one iteration for the next iteration.
- This 'Position Specific Score Matrix' or
'Profile' is used instead of the Input Sequence and Distance
Matrix.
- PHI-BLAST (Pattern-Hit Initiated BLAST)
- Combines matching of 'Regular Expressions'
with BLAST Extensions to give local alignments about the match.
- Used for Protein Motif tasks
- Can be combined with PSI-BLAST
Output: example of BLASTP with filter, defaults here
- Java Applet Graphic - color coded hits, MouseOver
links
- One-liners - Genbank entry info, description,
Score (bits), E value
- Alignments - statistics, Xs for filtered
residues, links to Entrez
- Statistics - BLAST run statistics
2. FASTA:
Rapid, heuristic (not mathematically
sound) global alignment
FASTA 4-Step Algorithm:
- Find best alignments on diagonals ... ktup
= n: number of perfect matches required for Word ... join Words
on given diagonal
- ktup = 1-2 for proteins; ktup = 4-6 for nucleic
acids
- Rescue 10 best Regions, using Distance Matrix
scoring
- Best single Diagonal Score - init1 score
- Join Initial Regions ... use only hits where
init1 > Cutoff
- Gaps and Gap Penalty - Distance between Diagonals
< GapPen
- Best joined Score - initn score
- Optimize alignment via NWS Dynamic Programming
Output:
example here
- Histogram - Extreme Value Distribution
- One-liners - initn, init1, opt, Z-score,
E value
- Alignments - with gaps
E. Sequence Analysis: finding
Genes in DNA
Methods:
- Gene Search by Signal
- Algorithm: Postion Specific Weight Matrix
- Sliding Window
- Look for Signals - Promoter Sites, Splice
Sites, ...
- Gene Search by Content
- Open Reading Frames
- Use of Statistical Properties of Protein
Coding Regions
- Unequal use of amino acids
- Unequal numbers of codons per amino acid
- Codons available not equally used - Codon
Usage
- Methods:
- Positional Base Frequencies - nucleotide
composition is highly dependent on Reading Frame
- Codon Usage / Codon Preference
- Base Composition bias
- The major problem - small Exons ( < 50
nucs)
Grail:
- Basic Algorithm: 7 'Sensor Algorithms' -
into Neural Net
- Current version of Grail per se: Grail
1.3
- Recent Developments:
- GrailEXP - Clusters of overlapping exon candidates
- 13 'Sensor Algorithms'
- Coding Region Candidates -> Gene Modeling
- InDel Error Detection and 'Correction'
- Splice Junction recognition
- Repetitive DNA recognition
- CpG Island recognition
- RNA polymerase II promoters (TATA, CAAT,
...)
- Developed from Grail 2.0 and Grail 3.0
- GAP III (Gene Assembly Program)
- Optimal and consistent gene model built from
exon models
- Other ORNL directions:
- Automatic - automated annotation of senomic
sequences
- Batch GRAIL - automated analysis during sequencing
project
- Interactive XGRAIL and genQuest
- Genome Channel
BCM
Gene Finder:
- Series of Gene Finding programs at Baylor
College of Medicine
- Associated with the BCM
Search Launcher set of programs
- Excellent variety of useful programs, with
Help facility
GeneMark: more info
here.
- Algorithm - inhomogeneous Markov chain models
- Now combined with Maximum Likelihood approach
to coding region
- General resource on Markov models: Durbin
et al book
- Used extensively in analyses of complete
bacterial genomes - data
Other Programs:
- Many, many available on the Web
- Links here
and here
and here
F. Sequence Analysis: finding
Motifs
Motifs:
- Motif - a recurrent thematic element
- Structural motifs - pieces of folded 3D structure
- Sequence motifs - conserved "blocks"
of sequences
DNA Motifs:
- Protein binding sites ... regulatory elements
- Relatively short
- Statistically difficult
- Cooperative binding often important
- Structural elements may be important - bends,
kinks
Protein Motifs:
- Secondary structure - alpha helices, beta
sheets
- Super secondary structure - 4 helix bundle,
TIM barrel, etc
Basic Methods:
- Consensus sequence - single, best sequence
- Regular Expression - multiple characters
per site
- Weight-Matrix - any character per site, with
score - Profile
- Hidden Markov Model
Protein Family Classifications
Prosite:
- Database of protein families and domains
- at ExPASy and elsewhere
- Regular expressions (Patterns) and Profiles
- Programs
- Search Prosite for Pattern or Profile
- ScanProsite - scan a sequence
against ProSite, or pattern against SwissProt
- ProfileScan - scan a sequence
against Profile Database
BLOCKS:
- Multiply aligned ungapped segments of most
highly conserved protein regions
- BLOCKS Database
- Programs or Tools available - others also
available
DOMO:
- > 9000 multiple sequence alignments of
> 100,000 protein domains
- Example: APPLE domain - a C-rich domain
Pfam:
- Database of protein multiple sequence alignments
and Profile -HMMs (Hidden Markov Models) for common protein domains
- Programs
- Search Pfam
- Download Pfam databases via ftp
- Example: C2H2 family of
Zinc Fingers
Protein Structural Classifications
- Many efforts underway ...
- Links to these Motif Databases are present
in SwissProt entries
SCOP:
- 'Structural Classification Of Proteins'
- Heuristic classification based on crystallography
- Family: > 30 % identity, clear homologues
- Superfamily: low identity, probably homologues
- Fold - major secondary structures in same
arrangement
- Example: Cytochrome C
CATH:
- Systematic semi-automatic process, clearly
defined
- CATH: Class, Architecture, Topology, Homologous
superfamily
- Class - alpha, alpha/beta, etc
- Architecture - overall shape of the domain
- Topology - fold family
- Homologue - share a common ancestor
- Example: Elongation Factor
Ts, domain 2
G. Multiple Sequence Analysis
Basics
- Progressive Sequence Alignment
- Pairwise alignment of most similar, then
next most similar, etc
- Steps
- Do pairwise alignment for all sequences
- Get Matrix of approximate Distances between
each pair
- Create an approximate phylogenetic tree -
Guide Tree
- Use this to determine order of addition of
sequences to alignment
- Align: two sequences; seq to subalignment;
two subalignments
- Keep GAPS that appear early - 'Once a gap,
always a gap'
- Web sites
for Multiple Sequence Alignment
Clustal W:
- Weighting - different weights given to unequally
sampled sequences
- Position Dependent Weights
- Position-Specific Gap Penalties (Opening
vs Extension)
- Sequence Weighting
- Weights for Adding New Sequences to existing
Alignment - extra weight to sequences most similar to alignment
- Clustal W Servers
Other Web Programs
Web Databases of Multiple Sequence
Alignments
- Fold Classification via Structure-Structure
Protein alignments (FSSP)
- Homology derived Secondary Structure Assignments
(HSSP)
- Database of Secondary Structure Assignments
(DSSP)
H. Phylogenetics
Basics:
- Trees - Rooted vs Unrooted
- Rooted Tree - position of Ancestor is known
- Unrooted Tree - no Ancestral Node
- Topology - Branching Pattern of the Tree
- Terminology
1, 2, 3, 4, 5: Taxa or External Nodes (or OTUs)
X, Y, Z: Internal Nodes
R1: Root
a, b, c, d, e: External Branches
f, g: Internal Branches
h: Internal Branch ONLY IF tree is Rooted;
else h is part of e
Outgroup: Taxan 5 ... used to "root trees"
Methods:
- Distance Matrix methods
- UPGMA - Unweighted Pair Group Method of Averages
- Fixed 'clock', averages used to get distances
- Fitch & Margoliash - 3 branches calculated
at a time
- Neighbor Joining - Pairs of taxa, finding
closest pair
- tree with smallest sum of Branch Lengths
- Other methods also available
- Parsimony methods
- Find tree with fewest inferred mutations
- Programs: PHYLIP package; PAUP
- Maximum Likelihood methods
- Use a mathematical model of process of evolution
- Model contains a parameter which is used
to Maximize the Likelihood that observed changes took place
- Example - Unrooted Tree of A. thaliana ST Phosphatases
w Homologues
Confidence - "How good
is the Tree?"
- Bootstrap - permutation resampling of the
sequences
- How robust is the tree to such resampling?
always same tree?
- How much better is this "best"
tree than other trees?
- Use set of "User defined" Trees
... how good is each?
- PHYLIP programs
III. Whole Genomes
A. Implications
- TOTAL information on Heritable Properties
of an Organism
- What an Organism CAN do ... and CAN NOT do
...
- Major step toward Understanding an Organism
and toward making Biology a PREDICTIVE SCIENCE
- Current: identify Genes, predict Function
- Next:
- Deduce Life Style of the Organism
- Predict Metabolic and Genetic Pathways
- Predict Adaptive Responses, Developmental
Pathways
- These Implications implicitly lead to ORGANISM
DATABASES
B. TIGR - The Institute for Genomic Research
- First to Sequence whole Genome of Freeliving
Organism
- Sequenced the first Three Eubacteria and
First Two Archae
- TIGR Database (TDB)
- links to specific organisms
- TIGR Microbial Database:
- Example: Methanococcus
jannaschii - first archae
- Search facilities - Locus, Text, Sequence
- Download facilities
- Genome Browser for M. jannaschii
- TIGR Gene Indices (TGI)
- Analyses of ESTs from sequencing projects
- Human, Mouse, Rat, Fly, Rice, Arabidopsis,
Zebrafish, ...
- Comprehensive Microbial Resource (CMR)
- Access all bacterial genome sequences completed
to date
- Database approach to Genome Annotation (information)
- Other facilities: software, links, ftp site
IV. Organism and Other Databases
A. Need for Organism Databases
- Direct result of Genome Physical Mapping
efforts
- Need for Maps, Genes, Sequences, References
- Incomplete Genome Information plus other
Information
- NOW: Complete Genome Information
B. Paradigm: ACeDB
- ACeDB - A C. elegans Data Base
- Created by Durbin and Thierry-Mieg for Sulston
R.Mapping Program
- Over 40 organisms represented in ACeDB databases
- Highly variable Types of Information in each
- Examples: C. elegans, yeast, fly, grains,
Arabidopsis, human chroms
C. Advantages of ACeDB for
Organism DBs
- Schema: highly adaptable, easily changed
and modifiable AFTER the database has been built.
- Data Transfer: Text .ace files, easy to completely
rebuild a database, easy to transfer updates over the Web, maintain
at ftp sites.
- Good Graphics: maps, sequences, tables, grids,
pathways.
- Easy to use via Browsing Approach: Hypertext
- And also has: Sophisticated SQL Search capabilities
- Contains some Analysis Tools: GeneFinder,
Dotter, Blixim, Metabolic Pathways, Sequence Assembly.
- Can link to other Tools: RasMol, Pictures,
Video
D. ACeDB on Web: WebAce and
AceBrowser
1. ACeDB - Disadvantages
- Not truly relational - XREF
only for crosslinking
- Data input requires .ace
file construction, with correct XREFs
- Largely a Monolithic system:
implemented on local computers
- Issues of "scalability",
although works well with the worm C. elegans
- Designed for Unix ... MacAce
and WinAce less fully implemented
2. However: WebAce and AceBrowser
- Two current Web implimentations:
- WebAce - from NAL and Durbin
at Cambridge
- AceBrowser - from L. Stein
at CSH
- Links to Genbank, SwissProt,
PubMed ... federated DBs
- Automated Link and Submission
to other Tools, eg BLAST
- JavaScript version is rapid
E. Example of ACeDB: DictyDB
1. MacAce: Graphics, Text
Displays
- Data Items in Released Versions of DictyDB
here
- MacAce Main Window and Dicty_cDB cDNA Clones
here
- Dicty_cDB cDNA Clone Groups here
- Dicty_cDB cDNA Seqs as SubSeqs; Homologue
info here
- Sequence and Genetic Map Displays - mitoDNA
here
- cox1/2 mito gene -Sequence and Text Displays
here
2. Web ACeDB: WebAce and AceBrowser
at Cornell
- WebAce2: Main Window here
- WebAce2: Simple Searches
here
- WebAce2: Text Displays and
Graphics Configuration here
- WebAce2: Genetic Map and
Locus Text Displays here
- WebAce2: Sequence Graphics
and Text Displays here
- AceBrowser: Main Window here
- AceBrowser: Simple and "In
Depth" Searches here
- AceBrowser: Reference Text
and Enhanced Annotation here
- AceBrowser: Genetic Map and
Locus Text Displays here
- AceBrowser: Sequence Graphics
and Text Displays here
- Large DNA Sequence - 700
Mb C.elegans Sequence here
- AceBrowser Data Enhancements - Metabolic
Pathways here
F. Other Web Organism Databases
- Many are available
- Saccharomyces Genome Database (SGD)
- Basic database is Web enhancement over ACeDB
SacchDB
- Excellent interface to yeast genome maps:
Genomic
View
- Many resources including analysis tools
- BLAST and FASTA facilities
- SacchDB extended to include
- Genome Deletion Project
- Yeast Evolution Project
- Sacch3D - protein 3D structure information
- Worm and Mammalian Homology to Yeast
- Yeast SAGE data
- The Arabidopsis Information Resource (TAIR): Arabidopsis thaliana
- Database based on Oracle relational database
system
- Much underlying information from ACeDB AatDB
- Analysis tools and Viewers, including BLAST
and FASTA
- Arabidopsis Genome Initiative (AGI)
- PlantsP: Plant Phosphorylation Proteins (kinases, phosphatases)
- underlying MySQL database
- display and usage is Web based
- many other resources, links, download, etc
- Berkeley Drosophila Genome Project (BDGP)
- Outgrowth of Encyclopedia of Drosophila (EofD)
- Excellent Map
Viewers - largely Java applets
- Includes FlyBase, ACeDB database of Drosophila
- Mouse Genome Informatics (MGI)
- Integrated access to mouse genetics and biology
- Mouse Genome Database (MGD)
- Mouse Gene Expression Database (GXD)
- Encyclopedia of the Mouse Genome
- links to
- Mouse Tumor Biology database
- Rat Data resource
- Human Genome Resources
at NCBI
- Information and links to Human Genome Project
- Human Genes
- OMIM - Online Mendelian Inheritance in Man
- McKusick catalog of human genes and disorders
- Over 10,000 entries
- LocusLink - single interface to all human locus info
- Human/Mouse
Homology Relationships
- Examples of Info on Candidate
Human Genes for Hypertension
VI. Problems ... Directions
to Go
A. Problems:
- Sequence DBs and Others are Flat File Database
- one piece of information at a time
- Analysis Tools are largely Single Task oriented
- from Task to Task, User must make Decisions
- Automate Basic Analytical Tasks for new DNA
Sequences
- This is now done currently in some facilities
and in some expensive commercial packages
- Examples: Pangea, Incyte
B. Need: "smart"
Analysis Packages
- Need "smart" Analysis Packages
that can "learn" from DB info
- Such "smart" Analysis Packages
can "predict" next best options for User
- Analysis: DNA seq --> gene --> protein
--> motifs --> 3D structure
- Automate this as completely as possible
- "smart" Analysis Packages -->
deduce function
- Extend Analysis to Metabolic and Genetic
Pathways
C. "Training" of
"smart" Analysis Packages:
- Use Current Information: Sequence, Protein
Structure, sites, motifs
- Use Physical and Genetic Maps, Pathway information
- Use new Information: Expression Arrays, Polymorphisms,
Pathways
- Also need Additional Information:
- Kinetic information - order of formation
of complexes
- Concentration info - key control molecule
concentrations
- Integrate Sequence DBs, Analysis Tools, Organism
DBs
- Combine with "smart" Algorithms
that can "learn" from the DB info to provide User with
Options, new Information
- => High Predictive Value - ask "what
if" questions
- Basic Problem with Biology becoming a "Predictive
Science"
- Large number of Different Molecules, eg Proteins
- Large Variety per Cell
- Variety Changes with Type of Cell in Organism
- Often a Small number of Each Molecule
- Thus: Statistical Analysis is often not Appropriate
VI. Additional Materials:
A. Books:
43 hits to "books" and "bioinformatics"
at "amazon.com" in March, 2002 ...
1. "Bioinformatics: Sequence and Genome
Analysis." David W. Mount. Cold Spring Harbor Press, 2001.
Recent, authoritative, most emphasis on Sequence
Analysis, some on genomics and organismal databases.
2. "Bioinformatics : A Practical Guide
to Analysis of Genes and Proteins". Second Edition. Ed.,
Andreas Baxevanis and B.F.Francis Ouellette. John Wiley, 2000.
Recent text emphasizing how to use Web ...
database searches ... software tools available.
3. "Bioinformatics Basics: Applications
in Biological Science and Medicine." Hooman Rashidi and Lukas
Buehler. 1999.
Recent text providing introduction to many
topics, by two UCSD Biology personnel.
4. "Computational Molecular Biology: An
Algorithmic Approach." Pavel A. Pevzner. 1999.
Advanced computational approach to algorithms
used in bioinformatics and molecular biology, by UCSD CSE professor.
5. "Biological Sequence Analysis."
R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Cambridge University
Press, 1998.
Recent text emphasizing automata and hidden
Markov model (HMM) approaches
6. "Bioinformatics: The Machine Learning
Approach." Pierre Baldi and Soren Brunak. MIT Press, 1998.
Another recent text emphasizing automata and
HMMs
7. "Sequence Analysis Primer." Ed.,
Michael Gribskov and John Devereux. Oxford University Press, 1992.
Bit dated, but still good ... GCG program emphasis
... detailed sequence analysis example ... Gribskov now at SDSC.
New Bioinformatics books are now coming out
regularly
Check at Amazon.com under "Books" and "bioinformatics"
for latest ... (amazon.com lists 5 books about to come out ...)
Others:
"Intro to Computational Molecular Biology."
Joao Meidanis and Joao Carlos Setubal. PWS Publishing, Boston,
1997.
"Computational Methods in Molecular Biology."
Ed., S. Salzberg, D. Searls, and S. Kasif. Elsevier Science, 1999.
"Intro to Computational Biology: Maps,
Sequences, and Genomes." Michael S. Waterman. Chapman and
Hall, 1997. Math and algorithm intensive.
"Algorithms on Strings, Trees, and Sequences:
Computer Science and Computational Biology." Dan Gusfield.
Cambridge Univ Press, 1997. Definitive algorithm text ... very
math intensive.
"The Secrets of Life: A Mathematician's
Introduction to Molecular Biology." Ed., Eric S. Lander and
Michael S. Waterman. Natl Acad Sci Press, 1998. Very readable
...
"DNA Sequencing : From Experimental Methods
to Bioinformatics." Luke Alphey. Springer Verlag, 1997.
"Computer Methods for Macromolecular Sequence
Analysis." Ed., Russell F. Doolittle. Methods of Enzymology,
Vol 266. 1996.
"Computer Analysis of Sequence Data",
parts I and II. Ed, Annette M. Griffin and Hugh G. Griffin. Humana
Press, 1994.
"Biocomputing: Bioinformatics and Genome
Projects." Ed, Douglas W. Smith. Academic Press, 1993.
"Molecular Evolution: Computer Analysis
of Protein and Nucleic Acid Sequences." Ed., Russell F. Doolittle.
Methods of Enzymology, Vol 183. 1990.
"Of URFs and ORFs: A Primer on How to
Analyze Derived Amino Acid Sequences." Russell F. Doolittle.
University Science Books. 1986.
B. Selected Recent Review
Articles on Bioinformatics:
446 at PubMed for 'bioinformatics AND review',
101 for 'bioinformatics AND review AND 2002'
257 for 'bioinformatics AND review AND 2001'
144 for 'bioinformatics AND review AND 2000'
67 for 'bioinformatics AND review AND 1999'
52 for 'bioinformatics AND review AND 1998'
18 for 'bioinformatics AND review AND 1997'
13 for 'bioinformatics AND review AND 1996'
6 for 'bioinformatics AND review AND 1995'
7 for 'bioinformatics AND review AND 1994'
1 for 'bioinformatics AND review AND 1993'
2 for 'bioinformatics AND review AND 1992'
0 for 'bioinformatics AND review AND 1991'
"Molecular Biologist's Guide to Proteomics."
Graves, PR, and Haystead, TA. 2002. Microbiol Mol Biol Rev 66:
39-63. [PubMed]
"A Genomic Regulatory Network for Development."
Davidson, EH, et al (26 authors). 2002. Science 295:1669-1678.
[PubMed]
"Systems Biology: A Brief Overview."
Kitano, H. 2002. Science 295: 1662-1664. [PubMed]
"Biological data becomes computer literate:
new advances in bioinformatics." Goodman, N. 2002. Curr Opin
Biotechnol 13: 68-71. [PubMed]
"Insights into Protein Function through
Large-Scale Computational Analysis of Sequence and Structure."
Weir, M., Swindells, M., and Overington, J. 2001. Trends Biotechnol
19(10Suppl): S1-S6. [PubMed]
"Exploring the Protein Interactome using
Comprehensive Two-Hybrid Projects." Ito, T., Chiba, T., and
Yoshida, M. 2001. Trends Biotechnol 19(10Suppl): S23-S27.
[PubMed]
"A Genomic View of Alternative Splicing."
Modrek, B., and Lee, C. 2001. Nat Genet 30: 13-19. [PubMed]
"Recent Advances in Computational Genomics."
Claverie, JM., Abergel, C., Audic, S., and Ogata, H. 2001. Pharmacogenomics
2: 361-372. [PubMed]
"The Impact of Microbial Genomics on Antimicrobial
Drug Development." Tang, CM., and Moxon, ER. 2001. Annu Rev
Genomics Hum Genet 2: 259-269. [PubMed]
"Bioinformatics Tools for Whole Genomes."
Searls, DB. 2000. Annu Rev Genomics Hum Genet 1: 251-279.
[PubMed]
"Of Mice and Genome Sequence." Hamilton,
BA, and Frankel, WN. 2001. Cell 107: 13-16. [PubMed]
"A Tour of Structural Genomics."
Brenner, SE. 2001. Nat Rev Genet 2: 801-809. [PubMed]
"Gene Expression Data Analysis."
Brazma, A., and Vilo, J. 2001. Microbes Infect 3: 823-829.
[PubMed]
"Analysing Gene Expression Data from DNA
Microarrays to Identify Candidate Genes." Wu, TD. 2001. J
Pathol 195: 53-65. [PubMed]
"What is Bioinformatics? A Proposed Definition
and Overview of the Field." Luscombe, NM., Greenbaum, D.,
and Gerstein, M. 2001. Methods Inf Med 40: 346-358. [PubMed]
"Sequencing the entire genomes of free-living
organisms: the foundation of pharmacology in the new millennium."
Broder, S, and Venter, JC. 2000. Annu Rev Pharmacol Toxicol 40:
97-132. [PubMed]
"Protein function in the post-genomic
era." Eisenberg, D, Marcotte, EM, Xenarios, I, and Yeates,
TO. 2000. Nature 405: 823-826. [PubMed]
"Who's your neighbor? New computational
approaches for functional genomics." Galperin, MY, and Koonin,
EV. 2000. Nature Biotech 18: 609-613. [PubMed]
"Structural genomics: beyond the human
genome project." Burley SK, et al. 1999. Nat Genet. 23:
151-157. [PubMed]
"Computational methods for the identification
of differential and coordinated gene expression." Claverie
JM. 1999. Hum Mol Genet. 8: 1821-1832. [PubMed]
"Multiple sequence alignment: algorithms
and applications." Gotoh O. 1999. Adv Biophys. 36:
159-206. [PubMed]
"Mapping regulatory networks in microbial
cells." VanBogelen RA, et al. 1999. Trends Microbiol. 7:
320-328. [PubMed]
"How will bioinformatics influence metabolic
engineering?" Edwards JS and Palsson B. 1998. Biotechnol
Bioeng. 58: 162-169. [PubMed]
Bernhard Palsson group in Bioengineering.
"Computational aspects
of expression data." Vingron M, et al. 1999. J Mol Med. 77:
3-7. [PubMed]
"Functional genomics:
going forwards from the databases." Rastan S and Beeley LJ.
1997. Curr Opin Genet Dev. 7: 777-783. [PubMed]
"Informatics--genome and
genetic databases." Ashburner M and Goodman N. 1997. Curr
Opin Genet Dev. 7:750-756. [PubMed]
"Bioinformatics: from genome data to biological
knowledge." Andrade MA and Sander C. 1997. Curr Opin Biotechnol.
8: 675-683. [PubMed]
"Functional Genomics - Bioinformatics
is Ready for the Challenge." T.F. Smith. 1998. Trends Genet.
14: 291-293. [PubMed]
"Bioinformatics in a post-genomics age."
Gershon et al. 1997. Nature 389: 417-422. [PubMed]
"Bioinformatics." Boguski MS. 1994.
Curr Opin Genet Dev. 4: 383-388. [PubMed]
... and many more ...
| DNASYSTEM
| Lecture |
If you have problems or comments, send email to Doug
Smith