|
Home
Home Page for the UCSD SNP Discovery programs
PPG
Program
PPG program aims, organization,
projects, personnel, flow
chart
PPG
Program Data
Public Data and UCSD Account Data
PPG
SNP Data
Excel Workbooks, SNP ID Names, Public Domain and Proprietary Data
Human
Chromosomes
Human Genome info
Candidate
Loci
Mouse
Chromosomes
Mouse Genome info
Cand
Loci Homologues
SNPs
and Haplotypes
Single Nucleotide Polymorphism
and Haplotype info
Usage
Usage and help for Program software
Biological
Protocols
Protocols used in laboratory
research by the PPG program
|
Recent
Changes
31
Oct 2006: Complete Renewal and Update of SNP ID Names
SNP ID Names have now been assigned to all SNPs known to be of interest
to Hypertension personnel as of May, 2006, including those analysed by
Sequenom, UCSD resequencing, TGCA, and Weber. Names are provided based
on CAP site (transcription start), on AUG site (translation start), or
on Mature Peptide start site for nonCDS SNPs, or on Protein or Mature
Peptide protein start sites for CDS SNPs. Locus information provided includes
four coordinate systems (Locus start site, 5000 bp upstream of locus start
site, NT start site (NCBI SNP coordinates, NC start site, with comparison
to UCSC chromosomal coordinates), structure coordinates, locus cartoon,
and summary of locus function. Complete data are provided for each transcriptional
Variant and protein Isoform having data at NCBI. Flanking sequence is
provided for all SNPs. For two loci of interest (CHGA and TH), complete
data is also provided in spreadsheet "SNPnames-dbSNPall" for
all SNPs present in dbSNP at NCBI as of Sept, 2006, including 5000 bp
into the promoter (upstream) and 2000 bp 3' (downstream) of the locus.
TH with its 3 transcriptional variants is used as an example for comparison
of SNP ID names between the 3 variants. More information is available
here .
7 Sept 2005: CHI-square Analysis of Select Subgroup Pairs
CHI-square analyses of Sequenom data for pairs of subgroups of the nnnn.Unrelated population are now available. CHI-sq values were calculated in two ways, with identical results. The pairs of subgroups are:
- Ethnicity: White vs Black
- Gender: Male vs Female
- Blood Pressure Status: Normotensive vs Hypertensive
- Normotensive and Hypertensive subgroups of Black, White, Male, and Female
subgroups
13 July
2005: Biological Protocols of the PPG Program
Biological or laboratory protocols or "standard operating procedures"
(SOPs) have been documented and recorded here, to assist in maintaining
a consistent and stable operating environment.
3
June 2005: Subset Analyses of Sequenom Resequencing Data
We have begun analyses of subgroups of the resequencing data
received from Sequenom. Initial analyses are for the nnnn.Unrelated population,
and include data analysis of the following subgroups of the total data
(data received Nov, 2003, Jan, 2004, and Feb, 2004):
- All data
- Ethnicity: White, Black, Hispanic, Asian
- Gender: Male, Female
- Blood Pressure Status: Normotensive, Hypertensive
- Normotensive and Hypertensive subgroups of each of the 4 Ethnicity
subgroups
- Normotensive and Hypertensive subgroups of each of the 2 Gender subgroups
Results for each subgroup are presented in Excel worksheets having the
same columnar format as developed for analysis of the UCSD
Data Workbooks described below. In addition to these worksheets of
complete information, worksheets containing major comparative data columns
from each subgroup are developed. These worksheets present 8 columns of
data for each subgroup in side-by-side style, permitting relatively easy
direct comparison between different subgroups; for example, comparison
of SNP data for black normotensives vs black hypertensives, or for white
hypertensives vs black hypertensives. These types of subgroup analyses
can be readily extended to other populations and to data obtained from
other sources. Data from different sources can also be combined, and new
subgroups can be analysed. Note that these subgroups are of zero order
(all data), first order (ethnicity, gender, BP status), and second order
(Ethnicity-BP, Gender-BP); others can be developed as desired. For more
information see Subgroup Data Workbooks.
|
Recent
Changes
Sep 2005: CHIsquares of Subgroups
Biological
Protocols
June
2005: Subgroup Analysis of Seq Data
May
2005: Presentation about SNP Analysis
May
2005: Analysis of TCGA Data
April
2005 Enhancements
Jan
2005 Enhancements
Dec
2004 Enhancements
Oct
2004 Enhancements
Quality
Control AdditionsPrese
Duplicate Data Clarification
Standardized
SNP ID Names
SNR
Cluster Plots
Jan,
Feb, 2004 Data Analysis
Nov,
2003, Data Analysis
Sep, 2006: SNP ID Names
|
|
Recent
Changes
Changes in Web Site Data
DNASYSTEM
Web pages with Links to other Bioinformatic Sites
|
10
May 2005: Presentation to Nik Schork Group on "Primary Analysis of
new SNP Genotype Data"
A Powerpoint presentation was given to the Nik Schork group in
May, 2005, on the rationale, analysis methodology, quality control (QC)
methodology, format of resulting SNP data analysis workbooks, and usage
of the analysis and data files available for downloading from this Web
site. The presentation emphasized the following topics: standard format
of raw genotype data, a format independent of sequencing source; format
of the analysis worksheets (one row per SNP, issues of unique SNP ID name);
QC analyses; additional data present in analysis worksheets; types of
SNP ID names; analysis of subsets or subgroups, eg specific gender or
ethnic group, from within a total population. More information is available
here.
4
May 2005: Analysis of TCGA Resequencing Data
To determine the ease of using TCGA as an outsourcing center
for generation of resequencing data, data were obtained for the CHRNA5
locus from the Tnnn.Twin population. These data were converted into a
facsimile of the format obtained from Sequenom and analysed. For more
information see TCGA Data Workbooks.
24
Apr 2005: Further Analyses of UCSD Resequencing Data
"Step 1" analysis of resequencing or genotyping data
from 6 loci (ADCY6, ADORA2B, DBH, RSG2, RSG4, SNX13, SNX14) is now available.
"Step 1" analysis is analysis of the SNP genotype data but not
necessarily with position-determined SNP ID Names assigned. In lieu of
such names, a SNP Number SNP ID name is used; this name is based on 1)
Locus, 2)Source or Sequencer name, and 3) chronological Number of the
SNP. Design of this name permits sorting on this name in Excel files.
Analysis includes raw data in the SNPdata spreadsheet and analysis in
the SNPseqSumAlphaQC spreadsheets. Data are presented for the nnnn, Tnnn,
and Hnnn populations, in separate spreadsheets. Data analysis includes
Quality Control (QC) determinations of CHI-square analysis of Hardy-Weinberg
Equilibrium as well as a second statistic suggested by Miguel Robinson.
The downloadable single Excel workbook includes reanalysis and QC analysis
of the data on the 10 loci made available in Dec, 2004. More information
is available under UCSD
Data Workbooks, including detailed description of the spreadsheets
and columns of information.
24
Apr 2005: Analysis of Population Subsets: Gender, Ethnicity, Age, BP status,
Fam Hist
As a test case, genotype data analysis has been done, including
QC analysis of HWE, for subsets of the RGS2 data generated by Kenton (13
SNPs on 79 nnnn.Unrelateds). The data subsets include: All data; Gender
(Male vs Female); Ethnicity (White, Black, Hispanic, Filipino, Asian);
Age (ages 19-29, 30-49, 50-69, 70-82); Blood Pressure (Normotensive vs
Hypertensive); Family History: have or do not have (control data: genotype
should be indep of whether researchers has a family history or not). Excel
Workbook format is the same as for UCSD resequencing data. More information
is available here.
11
Jan 2005: Update and Extension of Locus-specific Sequence Files
The Locus-specific Sequence Files have been updated and extended
to include old and new versions of 3 types of files, two of which are
GenBank-formatted, to provide nearly all of NCBI information about a given
locus, and one of which is FASTA-formatted, to provide sequence for analysis
purposes. The reference position differs in the two GenBank-formatted
files; one has position 5001 at the CAP site, and the other has position
5001 at the AUG site (if the AUG site is distant from the CAP site, this
latter position can be 10001, 50001, etc). Thus, it is now easy to determine,
for example, position of any dbSNP SNP relative to either CAP or AUG sites
for any Locus of interest. The "old" versions are retained because
NCBI continues to update the RefSeq genomic NT sequences, the basis for
these Sequence files, with each new Genome Assembly, which can be as often
as every 3 months. Thus, one can compare "old" with "new,
eg via BLAST2SEQ, to see where changes have occurred in a given locus.
Further, info is included about "unusual" loci, to alert the
user to multiple, eg alternatively spliced, mRNA species and to presence
of different protein isoforms; information on all of these mRNA and protein
species is inherently part of the NCBI GenBank annotation. More information
is provided under Locus-specific
Sequence Files.
15
Dec 2004: Initial Analysis of UCSD Resequencing Data
Two Excel workbooks are now available under UCSD
Data Workbooks containing raw data and initial analyses of ten loci
of the some 25 human loci for which resequencing has been done under this
program. The analysis is patterned after that for the Sequenom data, and
attempts to present the raw data and the analyses in formats that preserve
a consisten look and feel independent of resequencing source. Two files
are available, one for nnnn.Random population data ( loci CHGA, CHGB,
TH, KCNMB1, NPY2R, PMX2B, PNMT, PYY, and SCG2) and one for TH Tnnn.Twin
population data. In preparing these data, additional SNPs have been added
to the SNP ID Name Excel workbook, and standards are being developed for
the handling of Loci which encode multiple isoforms. The TH locus, with
its three isoforms (a, b, c) is being used as an initial test case, as
documented under UCSD
SNP ID Names.
8
Dec 2004: Enhancements to SNPnames.xls file
Additional columns were added to Summary Information, for Locus
Name, SNP Posn relative to CAP site (mRNA start site), NCBI NT sequence
name and SNP posn relative to this NT sequence. The Locus and SNP posn
data permit sorting on Locus and Locus posn to yield progressive positions
of SNPs within a given locus. The NT data permit correlation of SNP and
posn of SNPs here with those documented at NCBI dbSNP.
Additional columns were added for Sequencing Source information, to permit
adding SNPs found by UCSD investigators and other sources (TCGA, UCLAS,
Harvard, etc) to those found by Sequenom. Finally, a column was added
to indicate how the SNP position was determined (by SNP sequencing person
or de novo from flanking sequence via NCBI BLAST2SEQ determinations)
3
Oct 2004: Enhancements to Sequenom SNP Analysis Workbooks
The following enhancements were made to the Sequenom SNP Population-specific
Excel workbooks: 1) all Sequenom data (from Nov, 2003, Jan, 2004, and
Feb, 2004) are present in one workbook; 2) three new columns for Locus
information were added adjacent to the SNP ID Name column; 3) groupings
of columns containing similar types of data were improved. The three new
columns contain 1) locus name; 2) SNP locus position relative to CAP site;
and 3) SNP position relative to locus "item" (exon, intron,
promoter, etc). These new columns enhance sorting operations and visualization
of SNP locus position directly.
15
Sept 2004: Addition of Quality Control Methodologies
Three methodologies for Quality Control were added to the data analysis
Excel workbooks, namely: 1) CHI-square and P-value analysis of HWE values;
2) deviation and RMS analysis of <Het> / SQRT (<Hom1> * <Hom2>)
from 2.00; and 3) error rates found in repeat data, data from repeated
SNP analysis for same individuals. "Grades" were assigned to
each of these QC analyses, complementing the Ambiguity
analysis.
23
May 2004: Clarification of "Duplicate" Data
Some data from SNPs initially assigned different SNP ID Names, and
hence thought to be duplicate data, were in fact data from different individuals
in the population. These issues are now corrected, and data reanalysed.
See Introduction spreadsheets of any Data
Analysis Excel workbook.
18
May 2004: More Summary Info on Additional Sequenom Info
Additional timeline date information was added, together with Sequenom
plate number info, to the data analysis sheets.
12
May 2004: Coalescence of Duplicate Entries plus More Seq Info
Duplicate SNP entries and redundancies in SNP ID Names were removed
and/or coalesced. New spreadsheets containing all returned data, with
Allele 1 defined alphabetically, were created for each population. Timeline
dates are now included in the data analysis files.
6
May 2004: Standardized SNP ID Names plus dbSNP rs Names
Determination of "Standarized" SNP ID Names, based on BLAST
analyses of SNP flanking sequences, for all Sequenom analysed SNPs was
completed. These determinations provide considerable locus structure information
about each SNP. In addition, the data analysis Excel workbooks now contain
these SNP ID names, cognate dbSNP rsSNP name, and more complete individual
and genotype call information for each SNP in each population.
13
April 2004: SNR "Cluster Plots" for Seq Jan, Feb Data
Cluster plots of SNR1 vs SNR2 for each SNP in the Sequenom January
and February, 2004, returned data were created, as both linear and log-log
plots. These plots provide a ready visual display of the quality of the
genotype calls, as well as the noCalls, made by Sequenom. noCall information
was also added to the data analysis Excel files.
5
Mar 2004: Analysis of Sequenom SNP Data of Feb 2004
Analysis of SNP data returned by Sequenom in February, 2004, was completed
and made available. Analysis is similar to that for the Nov 2003 data,
except that Sequenom is now using a Signal to Noise Ratio (SNR) method
for genotype determination. As a result, the analysis is based solely
on the Sequenom genotype calls.
5
Feb 2004: Analysis of Sequenom SNP Data of Jan 2004
Analysis of SNP data returned by Sequenom in January, 2004, was completed
and made available. Analysis is similar to that for the Nov 2003 data,
except that Sequenom is now using a Signal to Noise Ratio (SNR) method
for genotype determination. As a result, the analysis is based solely
on the Sequenom genotype calls.
12
Dec 2003: Summary Analysis of Sequenom SNP Data for all Populations
A summary sample Excel spreadsheet was prepared containing data for
all twelve populations for three data items: 1) total assayed individuals
in each population; 2) data quality "grades" assigned; and 3)
alleles: minor/MAJOR for each population. These summary data exemplify
a few ways in which the data can be manipulated.
10
Dec 2003: Analysis of Sequenom SNP Data returned Nov 2003
Analysis of SNP data returned by Sequenom in November, 2003, was completed
and made available. Analysis includes genotype values for all individuals
in a given population for a given SNP, together with p,q and HWE values.
SNP genotype calls were ambiguous in some cases, with data values falling
within "ambiguous regions". Five such ambiguous regions were
defined, yielding genotype calls for comparison with the Sequenom genotype
calls. Based on these comparisons, qualitative "grades" were
assigned to assess quality of the data. Data are presented in MS Excel
workbooks.
Latest modification: October,
2006
If you have comments or queries,
send email to Doug Smith
|