dChip: Hierarchical
Clustering and Enrichment Analysis
After obtaining model-based expression values, we
can perform high-level analysis such as hierarchical clustering (Eisen et al. 1998). Unsupervised sample
clustering using genes obtained by Analysis/Filter
genes can be used to identify novel sample clusters and their associated
“signature genes”, to check the data quality to see if replicate samples or
samples under similar conditions are clustered together (if not what might be
possible reasons), and to identify unexpected clustering (e.g. samples
generated in same date or lab cluster together). Select the menu
“Analysis/Hierarchical clustering”:

A “gene list file” is a tab-delimited text file with
probe set name in the first column of each line. It can be generated by
“Analysis/Filter genes”, “Analysis/Compare samples” or “Tools/Gene list file”. It may also be
a “Tree file” saved by the “Clustering/Save tree”
function so that an existing tree structure saved before can be used. dChip
will use genes in the file for clustering.
One may check “Tools/Options/Analysis/Mask redundant
probe sets” to exclude the redundant probe sets (having the same LocusLink ID)
from a gene list and only keep the first occurring probe set, since multiple
probe sets for the same gene tend to bias the result of sample clustering and functionally significant gene clusters. However, if
the replicate probe sets are both selected by some filtering or comparison
criteria, and cluster closely in the clustering, this is a good indication of
meaningfulness of the selected gene list. On the other hand, if a selected gene
list seems to have genes not related to each other (e.g. not many replicate
probe sets), we may doubt its validity and often a FDR by permutation can
result in similar number of genes and thus supports this suspicion. The same
conclusion can be extended to probe sets for the genes in the same gene
families or same pathways.
The samples used for clustering are either all the
arrays, or the samples in the “Array list
file” if it is specified. When a “Filter genes” gene list is used for
clustering, it is often desired to use the same “Array list file” used in
filtering genes to do gene clustering and sample clustering. This is an
unsupervised sample clustering since the genes are selected by large variation
across samples and the sample group information is not used. When one specifies
a “Compare samples” gene list generated by using only a subset of samples, it
is often desired to only specify and order the relevant samples in “Array list
file” and view them without sample clustering. In this case the main interest
lies in viewing the genes obtained by comparison, and one can often get good
sample clustering since the genes are selected by using the sample group
information. It is also interesting to cluster both samples used for selecting
genes and samples not used for selecting genes (e.g. samples from an
independent study) together, one can predict the group membership of the latter
samples.
The default clustering algorithm of
genes is as follows: the distance between two genes is
defined as 1 - r where r is the Pearson correlation coefficient between the
standardized expression values (make mean 0 and standard deviation 1) of the
two genes across the samples used. Two genes with the closest distance are
first merged into a super-gene and connected by branches with length
representing their distance, and are then excluded for subsequent merging
events. The expression values of the newly formed super-gene is the average of
standardized expression values of the two genes (centroid-linkage)
across samples. Then the next pair of genes (super-genes) with the smallest
distance is chosen to merge and the process is repeated n – 1 times to merge all the n genes. A
similar procedure is used to cluster samples. These standardization and
clustering methods follow Golub et al. 1999 and Eisen et al. 1998. Centroid linkage can
produce branch inversion when the distance between two clusters is smaller than
the height of either cluster, dChip truncates the distance to be the larger of
the two heights. This prevents the branch inversion in visualization, but the
further distance computation is still based on the averaged profile.
One may choose alternative
“Distance metric” as 1 - |r| (r is the correlation coefficient) as the distance
measure. This is useful if we want to cluster negatively correlated genes
clustered together. The “Average
linkage” method can be specified, where the distance between two gene
clusters (super-gene) is the average of all pair-wise distances between two
genes not belonging to the same gene cluster. Tao Shi has observed that dChip
produces the same clustering result as the R function hclust (using 1 –
correlation matrix of row-wise standardized expression values) when the average
linkage is used, but not when the centroid linkage is used.
Click “Options” (or “Tools/Options/Clustering”) to
specify additional clustering options:

We can choose to cluster samples as well as genes.
Uncheck the “Cluster genes” button to cluster samples without clustering genes,
and this is useful if genes need to be put in a particular order when
clustering samples. The option “Only draw lines for standard separator” (moved
to the “Tools/Array list file” dialog for V1.2+) is discussed in the section “Array list file”.
Before clustering, the expression
values for a gene across all samples are standardized (linearly scaled) to have
mean 0 and standard deviation 1, and these standardized values are used to
calculated correlations between genes and samples and serve as the basis for
merging nodes. If the scale of the data is already adjusted, one may choose not
to standardize a gene’s expression value across samples by unchecking the
“Standardize rows” option. By default the samples are clustered using row-wise
standardized or un-standardized values. One can check “Standardize columns” to
standardize the raw expression data column-wise for sample clustering. Since
the raw expression values are comparable row-wise but not column-wise, the
column-wise standardization may not be meaningful when different genes have
different magnitude of expression values. A user is advised to try to cluster
samples with or without “Standardize columns” checked to judge which option
yields more reasonable sample clustering.
If “Tools/Analysis/Treat outlier
expression as missing values” is checked, the expression value called as
“array-outlier” will be ignored when computing correlations and their data
points are displayed as black (Blue/Red coloring) or white (Green/Red coloring)
boxes in the clustering picture.
If the number of genes is large (e.g. 10,000), dChip
may report “out of memory” or perform slowly, since storing all the pair-wise
distances requires too much memory and may cause virtual-memory swapping. The
solution is to uncheck the “Tools/Options/Clustering/Pre-calculate distances”
button to calculate the pair-wise distances between genes on the fly.
Click “OK” to start clustering, and select
“Analysis/Stop Analysis” or press “ESC” to stop the ongoing analysis. Following
the analysis output as follows, the clustering picture will be displayed
immediately. Click the “Analysis” icon on the left to view the analysis output:
{Hierarchical clustering
Treat 24
arrays as 24 experiments
Read in genes listed in file
D:\array\out\iglehart filtered gene.xls...
Found 191 genes
Begin
clustering...
Calcuate
distance 190
Merge
event 189
Calcuate
distance 20
Merge
event 16
Finding
significant functional clusters...
Found 6
chaperone genes in a 49-cluster (all: 61/5009, PValue: 9.84e-013)
Found 10 structural
protein genes in a 47-cluster (all: 361/5009, PValue: 9.58e-005)
Found 8
extracellular genes in a 29-cluster (all: 400/5009, PValue: 4.93e-005)
Finished
in 00 hours 01 minutes}
Here 191 genes are selected for clustering. dChip
also automatically searches for functionally significant clusters in the
resulting clustering tree.
Clustering image
Click the “Clustering” icon on the left to display
the “Clustering View” (Data courtesy of Dan Tang):

In the clustering picture each row
represents a gene and each column represent a sample. The gene clustering tree
is on the left, and genes close to each other have high similarity in their
standardized expression values across the 24 samples. The sample clustering
tree is on the top. Click anywhere in the right pane to activate the “Cluster
View”. Arrow keys can enlarge or reduce the size of the clustering picture,
Control+Arrow keys can change the size of the clustering trees, and Shift+Arrow
keys can adjust other aspects such as the height of sample information blocks.
Use the option “Tools/Options/ Clustering/Sample names always visible” to make
the samples names always visible on the top when scrolling vertically.
On the bottom of the clustering picture is the color
scale: the red color represents expression level above mean expression of a
gene across all samples, the white color represents mean expression and the
blue color represents expression lower than the mean. Since the expression
levels for each gene is standardized to have mean 0 and standard deviation 1,
the standardized expression values most likely fall within [-3, 3]. By default,
dChip uses pure white to represent 0, pure red to represent 3 or higher, and
pure blue to represent –3 or lower (Golub et al.
1999). This “displaying range of standardized values” (3) can be changed in
the “Tools/Options/Clustering” dialog. Select menu
“Tools/Options/Clustering/Use traditional red/black/green coloring scheme” to
use the coloring scheme adopted by the TreeView software (Eisen et al. 1998). The height of the color
scale can be adjusted by Shift+Up/Down arrows.
Click inside the clustering picture will highlight a
data point with a surrounding blinking square:

The array name, probe set name with absolute call,
the standardized value with original expression value and standard error for
this probe set are displayed in the status bar on the bottom. Zooming in with
arrow keys when a data point is highlighted will always place this data point
in the center of the viewable picture area. To deselect an active data point,
press “ESC” key or select the menu “Clustering/Selected Branches/Clear”. Gene
names or descriptions are displayed on the right side; use the button
“Tools/Options/Clustering/Gene descriptions from LocusLink when available” to
toggle between displaying only Affymetrix descriptions or a mixture of
LocusLink name (when available) and Affymetrix descriptions. If the gene descriptions
are truncated on the right, use “Control+Right Arrow” to widen the gene
clustering tree as well as the gene description area.
If an Internet browser is properly set up, the menu
“View/Online Database” will start the browser to access the database page for
the current probe set. Linking to online resources may not work on some
computers. Checking "Tools/Options/Analysis/Show online link dialog"
to show a dialog containing the web address and also automatically copy the
address to the clipboard, then one can manually paste it into the address bar
of Internet browser.
One can also right-click a non-gene node in the
clustering tree to exchange the positions of its two branches, in order to interactively
adjust the ordering of genes and samples in the clustering picture. This
changes the visual perception of the clustering picture but not the clustering
result. The original order has been determined so that the tighter child
cluster (with smaller distance at its final merging evenet) is on the top.
There is also research work on determining “optimal” orders of leaf nodes in a
clustering by some criterion (e.g. gene’s peak expression time during a time
course), but in principle all the 2^(N-1) orderings are equivalent.
Select “View/Next View” or press “Enter” to go to
other views such as “PM/MM Data” or “CEL Image” to look at the probe level data
of the current expression data point, and press “Enter” again to come back to
the clustering picture. This is useful if one observes unusual data points in
the clustering picture (such as large negative expressions values), and wants
to check the probe level data. For
example, sometimes we may see bright strips of genes with very high or low
expression values in particular samples. Usually we should pay caution to such
samples, since this may imply the normalization did not bring such array
comparable to others. It is then good to click an extreme red/blue data point
in the “Clustering view” and then use “View/PM/MM data” to check the
probe-level data. This way we can confirm whether a high/low-valued data point
is real or due to noise/outlier.
Selected gene
branch
Clicking any node in the gene or sample clustering
tree will highlight the corresponding clustering branch in blue. Some items in
the “Clustering” menu are activated only after this. Checking
“Tools/Options/Clustering/Averaged gene profile pattern” will display a profile
plot for the selected gene cluster:

The Y-axis has the same range as the color scale
([-3, 3] by default) on the bottom of the clustering picture. The value of the
profile curve for each sample is the average of the standardized expression
values of all selected genes in this sample (standardization is a linear
scaling for each gene so its expression values across all samples have mean 0
and standard deviation 1). The error bar extends 1 standard deviation (of the
selected genes’ standardized expression values in a sample) on both sides. A
shorter error bar indicates tighter clustering of genes at the corresponding
sample. In V1.3, if only a single gene is selected in the branch, the Y-axis
range of the profile plot is from 0 to the maximal raw expression data of this
gene across the samples. Thus the relative fold changes of this gene across
samples can be inferred from the plot.
In the menu “Clustering”, we can clear the selected
genes or select all genes, as well as delete the selected gene clusters and
redo clustering using the rest of genes. Select “Clustering/Export branch
image” (check "BMP" format) to export or copy the clustering image of
the selected main gene cluster outlined by blue lines; however, the sample
clustering tree is not attached to this image.
Select the “Clustering/Export branch data” menu to
export the raw or gene-wise standardized expression data of selected branch. In
the later case the data of the averaged profile will also be exported. If no
sample branches are selected, expression value for all samples will be
exported; otherwise only the data of the selected sample branch will be
exported. The exported file can be used as the “gene list file” in the
“Analysis/Hierarchical Clustering” dialog to perform clustering using only this
subset of genes. In V1.3+, if “Clustering/Export branch data/Cut the tree at
the height of current branch and export all branches” is checked, one may
export gene expression data grouped in clusters. These clusters are obtained by
cutting the gene clustering tree at the height of the selected blue branch.
Use Control+Click to select and color multiple gene
or sample clusters. The multiple colored clusters can be exported (for sample
branches only) or deleted (for gene branch only) by the functions in the menu
“Clustering”. In contrast, clicking selects the main gene cluster (outlined by
blue lines) used as described and in resampling
clusters.
The "Clustering/Similar
Profile" function can search and export genes with high positive or negative
correlations across samples with the current highlighted gene or gene branch.
The resultant list can be used as the "gene list file" in the
"Analysis/Hierarchical Clustering" dialog to view these genes.
dChip also provides a resampling
method to assess the reliability of
clusters by using the standard errors
for expression values (Li and
Wong 2001b page 6, section “Standard errors help to assess clustering
results”). We resample each expression value from a Normal distribution with
mean equal to the estimated expression value and standard deviation equal to
the attached standard error. Clicking the tree branches to highlight a gene or
sample cluster in blue, then select menu “Clustering/Resample once” to resample
all the data points and redo the clustering; select “Clustering/Go to original”
to go back to the original clustering.
Significant
sample cluster
Similarly, during sample clustering, the sample
information specified in the “Sample
information file” is used to calculate the sample clusters enriched by
samples of a certain description (Data courtesy of Andrea Richardson):

The sample cluster p-values are calculated with
regard to the samples used in the “Array list file” (if no “array list file” is
specified then all the arrays in the group). By default p-value < 0.05 will
be reported and the threshold can be set at “Tools/Options/Clustering”. Note
that the p-values for gene clusters are calculated with regard to all the genes
on the array, since a gene list is filtered or obtained by some means; but for
sample cases those samples not in the “array list file” may not be of our
interest.
The discrete categories of each sample description
column in the “sample information file” has limit 10 currently. The color of
sample category boxes can be set by “Control+Click”, and the height of sample
information blocks can be adjusted by Shift+Up/Down arrows.
[New] In the clustering or chromosome view, set “Options/Clustering/Number
of letters shown for sample information” to be greater than 1 to display 1 or
more letters above samples, overlaying on the color blocks representing sample
categories.
Save clustering tree
Select the menu “Clustering/Save
Tree” to save the structural information of the clustering tree. The file can
be used as “Analysis/Hierarchical Clustering/Gene list or tree file” later to
avoid the clustering computation again. In the tree file, each gene/supergene
has its parent and children’s ID, the weight (how many genes are included in
this branch), and the distance when merging its two children. The sample tree
information follows the gene tree information. Thus it is also possible that
clustering is performed by using other software such as S-PLUS or R, but the
result is exported into a file with the dChip tree file format (apply
“Clustering/Save Tree” once to get its format), and then the clustering is
viewed in dChip after reading in the data using “Open group” or “Get external
data”.
Export clustering image
The clustering image can be
exported by the “View/Export Image”
menu. However, sometimes “View/Export image” does not produce any output file,
or the exported image is altered or incomplete. This is an unfixed bug. One may
try one of the following to get around the problem:
· Save the file as a
different format (JPG, BMP or EMF)
· Use the Arrow or
Control+Arrow keys to make the image smaller and then run “View/Export image”.
· Use the
"PrintScreen" key (at the upper-right corner of the keyboard) to copy
the whole screen image and then paste into the Microsoft Paint software; if
necessary do this several times to compose the whole image.
· Change to a different
Windows platform or a PC with larger memory.
Another known problem on Windows
95/98 is that after zooming in or out the clustering image many times, the
clustering tree may disappear. One may restart dChip in this case, or upgrade
the Windows system to a newer version.
Remove irrelevant genes from
clustering analysis
Often it is desirable to exclude
some genes from the clustering analysis. For example, MHC and immunogolbulin
families genes vary for reasons that are irrelevant to the experiments and
analysis of clonal B-cell populations (Bradley Messmer, pers. comm.).
One can make a gene list file (a
text file containing probe set names on each line) of collagens genes, by using
“Tools/Gene list file/By keywords”
or from literature search, then in the “Analysis/Filter genes” or
“Analysis/Compare samples” dialog specify it as the “Filter on gene list:
excluding …” file. The resultant filtering or comparison gene list will not
contain these genes.
To remove a list of probe sets
completely from the array, specify the gene list as the “Analysis/Open
group/other information/probe mask
file” to make the CDF.bin file does not contain these gene anymore.
Lastly, if seeing an undesired gene
cluster in clustering picture, one can click to highlight this cluster and use
“Cluster/Delete selected gene” to exclude these genes so that they do not
affect the sample clustering.