dChip: Information files


Note: the information files in this page are relatively old. Please see the respective sections for details about getting the latest information files.

Genome information file         RefGene and cytoband file      Common probe set file

 

Gene information file

 

Also see dChip manual. Please use the new method to make the latest gene information files.

It is recommended to use Internet Explorer as the browser to download these files. Right-click the following links and select “Save Target As” to download the files. If the file is in ZIP format, just unzip all files into the same directory. If you open the files in Excel and edit it, be sure to save the files in tab-delimited text format.

 

Files made by the new method:

12/14/02: HG_U95Av2, HG-U133A, MG-U74Av2

MOE430A (6/5/03), MOE430B (8/6/03), RAE230A (6/20/03), HG-U133A (7/8/03), RG-U34A (7/10/03)

 

Files made by the old method:

 

Hu6800 (a.k.a. HuGeneFL, 2/5/02)

Hu35K (2/5/02, also download GeneOntology.txt and ProteinDomain.txt to the same directory): all, subA, subB, subC, subD

HG_U95A, HG_U95B, HG_U95C, HG_U95D, HG_U95E

HG_U95Av2/HC_G110: 12/30/01; HG-U133A/ HG Focus Array: 3/12/02, HG-133B: 3/12/02

Mu11ksubA, Mu11ksubB, Mu19ksubA, Mu19ksubB, Mu19ksubC, Mu6500

MG_U74A, MG_U74B, MG_U74BV2, MG_U74C, MG_U74CV2 (10/4/01), MG_U74Av2: 1/3/02,   

RG_U34A/RN_U34, RG_U34B, RG_U34C

Drosophila chip

Arabidopsis 8K: 11/6/01 (based on TAIR file GO_annotations (4/22/01) and affychip.071001)

            9/12/02 (made by John Okyere at Nottingham Arabidopsis Stock Center; has more gene annotation terms)

ATH (9/2/02; based on Affy GIN file (thanks to John Okyere) and TAIR file ATH_GO.20020828.txt; need dChip V1.2+)

YG_S98 (1/12/02; based on SGD file orf_geneontology.tab (1/5/02) and ORF_table.txt (11/2/01); no GO graph tracing is performed and all the GO terms in orf_geneontology.tab are used)

Ecoli (3/13/02, based on UW E. coli Genome Project file m52orfs.txt)

 

Old construction method for gene information file:

 

The “gene information file” augments gene information in Affymetrix EAZI database (use “Export” function) or GIN file (“Probe Set”, “Identifier”, “Description” columns). They combine Affymetrix’s annotation and some gene functional classification from NCBI’s LocusLink database.

 

The following database files are downloaded from NCBI and GeneOnotology website (the sequential downloading dates in parenthesis):

Unigene: Hs.data (10/15/01, 12/19/01), Mm.data (9/7/01, 1/2/02), Rn.data (4/9/01)

GeneOntology: process.ontology (03/01, 11/6/01, 12/22/01), function.ontology (03/01, 11/6/01, 12/22/01), component.ontology (03/01, 10/18/01, 12/22/01)

LocusLink: LL_tmpl (08/01, 10/17/01), LL3_011228 (12/28/01), loc2acc (12/27/01).

 

The downloaded files are pre-processed (C++ code provided on request) using the following steps:

 

1. Parse() (a C++ function, same below) reads in the Unigene files and output “Accession to LocusLink” mapping, this mapping is combined with “loc2acc” file in Access.

2. GO.ReadFile() reads in the three GeneOntology files (“obsolete” terms are deleted from the files manually) and construct the graph of GO terms ; ProcessLocusLink() reads in Locuslink file and for “GO” lines extracts the GO term and traces up to the top levels of GO trees. Two-round process of LocusLink file obtains up to 1000 most abundant GeneOntology and Protein domains terms.

3. The output files loaded in Access and queries are used to produce the final gene information files.

 

Also the "GeneOntology.txt" and "Protein Domain.txt" are constructed to speed the reading process. The contents came from the LocusLink database. Column 1 is GO ID or Pham ID which is used as index in the gene information file, and column 3 is the number of occurrences of a ID/term in the LL3_011228 file.

 

The gene information file names are followed by the most recent updating date (otherwise it’s before Aug. 2001). Comparing the downloading dates of the database files gives which database files are used for a certain gene information file.

 

Genome information file


Also see dChip manual. Please use the new method to make the latest genome information files.

 

The following files made by the old method. Right-click the following links and select “Save Target As” to download the files, and be sure to save the files in text (tab-delimited) format if opened and modified in Excel. Files created on or after 7/27/02 require using dChip version 1.2+.

 

The genome alignment information in the NetAffx March 2003 release is based on UCSC Genome Bioinformatics database: November 2002 Human genome assembly (hg13), February 2002 Mouse genome assembly (mm2), and November 2002 Rat  genome assembly (rn1).

 

Hu6800: 12/12/01 (based on hg8 genome assembly), 7/27/02 (hg11); HG_U95Av2: 12/12/01 (hg8), 7/27/02 (hg11); HG-U133A: 7/27/02 (hg11)

MG_U74Av2: 7/27/02 (mm2)

YG-S98 (1/12/02; using SGD database file ORF_table.txt (11/2/01))

Ecoli (3/12/02, using EAZI database)

 

SNP arrays: HuSNP: 7/27/02 (hg11), 9/14/02 (hg12); 10K SNP array: 1/2/03 (based on Affymetrix annotation file)

 

Old construction method:

1. For human and mouse arrays, the following UCSC Genome Bioinformatics annotation data files are downloaded (select organism/“Annotation database”; links below may be outdated):

Human refGene.txt (RefSeq ID to chromosome position mapping, format; hg11 April 02, hg12 June 02), refLink.txt (LocusLink ID to RefSeq ID mapping, format)

Mouse: refGene.txt (format; mm2 Feb 02), refLink.txt (format)

Human SNP files (dbSNP rs# to chromosome position mapping): snpNih.txt (format), snpTsc.txt (format)

 

2. The genome information files are obtained by reading these data tables in Microsoft Access and making a query to link from probe set à Accession à LocusLink à(refLink.txt) à RefSeq ID à (refGene.txt) à chromosome positions. To generate the latest files, one may use the “Accession à LocusLink” information in the gene information files, and download the latest UCSC files and follow this procedure.

 

(Updated 10/5/05)