% % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % \documentclass[12pt]{article} \usepackage{amsmath,pstricks} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\bi}{\begin{enumerate}} \newcommand{\ei}{\end{enumerate}} \textwidth=6.2in \bibliographystyle{plainnat} \begin{document} %\setkeys{Gin}{width=0.55\textwidth} \title{Preliminaries for intermediate course, \\ Boston, March 2008 \\ Revised Final version} \author{VJ Carey} \maketitle \tableofcontents \section{Discussion} \begin{enumerate} \item You will be using R 2.7.0 (devel version). A distribution will be supplied for all windows users that will be independent of any installations resident on the laptop. \item You are expected to have a reasonable facility with the R programming language. Please see www.r-project.org/Manuals. Read An Introduction to R and work through the sample session. \item R is a language for data analysis and visualization. It is also provides vehicles for developing and disseminating software tools. These vehicles are a basis for, and have been extended by, the Bioconductor project. \bi \item R is persistently \textit{extended} by \textit{packages}. Examples are provided at cran.r-project.org. Packages contain software and documentation, and often data for illustration or as a central component. Packages at CRAN and Bioconductor are subjected to standard quality control, often on a nightly basis, as they evolve and as R evolves. \item Documentation for R and Bioconductor exists at three levels. \bi \item At the highest level are monograph-like documents such as Modern Applied Statistics with S (Venables and Ripley) or Introductory Statistics with R (Dalgaard). The manuals at www.r-project.org are at a similar level, covering programming processes and analysis methods over a wide variety of packages. A monograph on Bioconductor is also available (Bioinformatics and Computational Biology Solutions with R and Bioconductor, eds. Gentleman, Carey et al.) Access: Bookstores. \item At the next level are vignettes. Typically a vignette is associated with an R package. A vignette will describe a workflow, typically using several functions in one package. Vignettes are 'computable documents'. We will show how to create such documents at the end of the course. Access: CRAN/bioconductor pages devoted to packages; R function openVignette(). \item At the lowest level are manual pages for package functions. These documents describe details of function calls. Executable examples are typically provided with each manual page. Access: help([topicname]), example([topicname]), help.search(), help.start(). \ei \item Acquiring packages will not be a major part of the course. All software will be supplied at the start of the course. But it should be noted that package acquisition is supported in the R language with function install.packages(). at http://www.bioconductor.org, a script called biocLite.R can be 'sourced', so that biocLite("[packagename]") will acquire a package for installation. A given package may depend upon other packages and biocLite will arrange to resolve all the dependencies and install all needed packages. \ei \item Bioconductor is a project devoted to the creation of data structures and algorithms supporting modern computational biology. The primary products are R packages, available through the Bioconductor portal www.bioconductor.org. \bi \item The Biobase package defines most of the primary data structures for genome-scale data. \item Various biological metadata packages, such as hgu133plus2, illuminaHumanv2, or org.Hs.eg.db, define mappings between identifiers, and packages GO.db and KEGG.db define metadata structures relating to Gene Ontology and the KEGG pathway catalog. \item Analytical packages such as limma, MLInterfaces, GOstats, Category, graph, RBGL, address various approaches to interpretation of genome scale data. \item Preprocessing packages such as oligo, arrayQualityMetrics, affyPLM, lumi, beadarray, beadarraySNP, help to transform raw assay outputs into interpretable measures and to assess assay quality. \item Experimental data packages such as MAQCsubset, yeastCC, bronchialIL13 illustrate how experiments can be organized for convenient analysis and evaluation. \ei \item This `preliminaries' document attempts to cover relevant mechanical practices needed to work effectively with microarray and allied data. The review section following illustrations covers essential repertory \bi \item in R: calling functions, creating numerical data, attaching software packages, running documentation examples, loading stored data, capturing data from files, creating or rewriting complex objects, use indexing on vectors and matrices to extract information, understanding the recycling rule, basic visualization, statistical procedure invocation and report generation, creation of simple functions, and use of applicative programming \item in Bioconductor: obtaining Bioconductor packages for procedures and data, understanding introspective capacities of Bioconductor containers for high-throughput datasets, investigating values of expression assays, understanding resolution of identifiers using annotation maps \ei The preliminaries are thus weighted towards building facilities in R, so that we can readily exploit high-level facilities in Bioconductor to answer the biological questions that interest us. You should feel comfortable giving examples of or writing brief paragraphs on \textit{almost every item} in the laundry list of repertory topics given above before coming to the course. \ei \clearpage \section{Illustrations} The following operations and their results should be completely straightforward after light study. You can set up a version of R that supports the dialogue below by running <>= source("http://www.bioconductor.org/biocLite.R") biocLite(c("Biobase", "graph", "Rgraphviz", "MAQCsubset", "hgu133plus2.db", "GO.db", "annotate")) @ \subsection{Typography} The typography above works as follows. What you type into R is rendered thusly: <>= x = rnorm(20) @ What R does is rendered thusly: <>= x @ When you see plus-signs in italics on the left, you are seeing continuation symbols that are not to be typed in -- they show that input to R is spanning several lines. \subsection{Dialogue} \subsubsection*{Class inspection} <>= # # attach the base package for Bioconductor # and check the structure of eSet class # library(Biobase) getClass("eSet") @ \subsubsection*{Network structures} <>= # # get packages that define network structures and # support their visualiation; plot a random graph # library(graph) example(randomGraph) @ \setkeys{Gin}{width=0.65\textwidth} <>= library(Rgraphviz) plot(g1) @ @ \subsubsection*{MAQC data inspection} <>= # # get data from MAQC # library(MAQCsubset) data(afxsubRMAES) experimentData(afxsubRMAES) # # look at some expression values # exprs(afxsubRMAES)[1:5,1:5] # # what 'phenoData' values are present # names(pData(afxsubRMAES)) @ \vspace*{3cm} [This space reserved for jelly stains.] \clearpage \subsubsection*{Boxplots} <>= # # show distributions of expression of a gene # stratified by percent ambion brain hybridized # boxplot(split(exprs(afxsubRMAES)[4000,], afxsubRMAES$pctBrain), xlab="percent brain", ylab="expression of gene 4000") @ @ \subsubsection*{Regression modeling} <>= # # compute the linear regression for this relationship and # summarize: mylm = lm( exprs(afxsubRMAES)[4000,] ~ afxsubRMAES$pctBrain) summary(mylm) # # obtain the p-value for the test of zero slope # summary(mylm)$coef[2,4] # @ \subsubsection*{Annotation} @ <>= # # attach a package for annotation management, get the # annotation data for hgu133 plus 2.0 chip, GO # library(annotate) library(hgu133plus2.db) library(GO.db) # # get the affy identifier for gene 4000 (in the # ordering of featureNames) # fn = featureNames(afxsubRMAES)[4000] # # now look it up in various dictionaries # lookUp(fn, "hgu133plus2", "ENTREZID") lookUp(fn, "hgu133plus2", "GENENAME") lookUp(fn, "hgu133plus2", "SYMBOL") gotags = lookUp(fn, "hgu133plus2", "GO") length(gotags) @ \subsubsection*{List manipulation} <>= gotags[[1]][1:2] # show only two hits # # find the interpretation of an associated GO tag # lookUp("GO:0004674", "GO", "TERM") # # HARD: get all the interpretations as 'terms' # an easier approach is to use aafTableAnn # sapply(lookUp(names(gotags[[1]]), "GO", "TERM"), Term) @ \clearpage \section{Review} \subsection{Essential R repertory} Fundamental operations reviewed above include \begin{enumerate} \item \textit{Most generally:} \textbf{calling functions.} Every action in R results from the evaluation of a function. Typically the lexical form of a function call is \texttt{f(x,y,z)}, where "\texttt{f}" names a function that has been defined in some package and \texttt{x}, \texttt{y}, \texttt{z} are inputs suitable for use with \texttt{f}. This mention of `suitability' is vague but inevitable at this point. We learn what is suitable for a given function by reading the manual pages and other documentation resources. Some operations in R do not look like function calls, but they are: \texttt{x[1,2]} can be written \texttt{"["(x,1,2)}. Distinguish carefully between vector inputs to functions and multiple parameters. \texttt{x[1,2]} and \texttt{x[c(1,2)]} are very different computations. \item \textbf{create numerical data via simulation; create (pseudo) random samples from given data.} Random number generation is accomplished using the r[dist] family of functions: \setkeys{Gin}{width=0.65\textwidth} <>= set.seed(1234) xn = rnorm(100, 3, 2) xp = rpois(100, 3) xg = rgamma(100, 2, 2) xb = rbinom(100, 4, .4) par(mfrow=c(2,2)) hist(xn) hist(xp) hist(xg) hist(xb) par(mfrow=c(1,1)) @ To draw a random sample from a vector, use sample: <>= mean(sample(xn, size=20)) mean(sample(xn, size=20)) @ Note that set.seed can be used to control reproducibility of 'random' events. \item \textbf{package attachment} -- \texttt{library([packagename])} Note that you can get information on manual pages available for a package by doing \texttt{help(package=[packagename])}. \item \textbf{run examples} -- \texttt{example([topic])} \item \textbf{load stored data} -- \texttt{data([objectname])} You can get information on data sets provided by a package with the command \texttt{data(package=[packagename])} \item \textbf{data capture from files}: functions \texttt{read.csv}, \texttt{read.delim}, \texttt{readLines}, \texttt{scan} are relevant in different contexts. Practice `round trips': create some data in excel, export as CSV, import to R, analyze/extend, export results using write.table, and import to excel. \item \textbf{create new (or rewrite old) objects in an R session} -- \texttt{y = f(x)} or \texttt{x = f(x)} \item \textbf{use indexing on vector-like or matrix-like data structures to refer to subsets} -- the forms \texttt{x[i]} and \texttt{x[i,j]} need to be fully appreciated; \texttt{i} and \texttt{j} may themselves be vectors or matrices \begin{enumerate} \item named vector elements <>= x = c(1,2,3,4) names(x) = c("a", "b", "A", "B") x[3] x["A"] x[c("A", "a")] @ \item named matrix margins <>= y = matrix(1:16, nr=4, nc=4) y # no names rownames(y) = colnames(y) = c("a", "b", "c", "d") y # names on both margins y["b",] # extract a row y[c("b","c"), c("b", "d")] # extract a submatrix @ \end{enumerate} \item \textbf{recycling rule for binary operations}: when a binary operation is applied to two vectors of different lengths, the shorter is replicated to the length of the longer (with truncation and warning if the length of the longer is not an exact multiple of that of the shorter) <>= sum(y) sum(y+1) sum(y+c(1,1,1)) # will warn @ Note also that we can compose operations within indexing contexts (above, creating a vector within []). \item \textbf{visualize data}: \texttt{plot(x, y)}, \texttt{boxplot(split(x,y))}, \texttt{pairs} are important for exploration; the lattice package is essential for visualizing multivariate data \item \textbf{compute simple statistics}: function \texttt{mean}, \texttt{median}, \texttt{sd}, \texttt{cor} are self-explanatory, but behavior with incomplete data (i.e., some elements take value NA) must be understood with care \item \textbf{execute statistical procedures and derive summaries:} \texttt{lm} and its formula interface illustrates a basic paradigm for working with linear and nonlinear models \item \textbf{create simple functions.} Example: suppose we are given a set of numeric strings of different lengths and we wish to pad them to a common length with "0" on the left. We break this up into two phases. First we write a program that solves the problem of creating the padding prefix. Given a prefix length, the following function creates the padding string: <>= makePref = function(prefLen) paste(rep("0", prefLen), collapse="") makePref(4) # works try( makePref(c(3,4,5))) # fails! @ We see that our program fails on a vector input. An exercise is to create a variation that succeeds on a vector input. However, for our purposes, this function is sufficient. We now build a function that solves the original problem, using makePref iteratively to get the padded strings. This function MUST SUCCEED on vector inputs. <>= padTo = function(x, length=10) { nx = length(x) # number of elements passed lens = nchar(x) # character length of each element add = length-lens # deficits pads = sapply(add, makePref) paste(pads, x, sep="") } strs = c("12334", "1322", "144445") padTo(strs,10) @ \item \textbf{use applicative programming.} In the example function above, we used \texttt{sapply} to quickly iterate over the elements of a vector (the vector \texttt{add}). Instead of using a 'for' loop, we can use a single statement to have a more concise and efficient calculation. Other applicative programming tools are \texttt{lapply}, \texttt{sapply}. \end{enumerate} \subsection{Essential Bioconductor repertory} \begin{enumerate} \item \textbf{obtain desired software and data packages}, and resolve relevant dependencies: \texttt{biocLite} \item \textbf{understand the introspective capacity of Bioconductor's high-throughput containers} \begin{enumerate} \item view (and populate) MIAME schemas <>= experimentData(afxsubRMAES) @ \item abstracts <>= substr(abstract(afxsubRMAES),1,70) @ \item sample-level variables <>= names(pData(afxsubRMAES)) @ \item for some `custom' assays, \texttt{featureData} details can be important for understanding assay reporters; for manufactured assays these are factored out into annotation packages \end{enumerate} \item \textbf{investigate values of expression assays} -- \texttt{exprs([esetname])[r,c]}where \texttt{r} and \texttt{c} delimit `genes' and samples of interest respectively. Note that Bioconductor structures support associative memory/polymorphic indexing in various ways: <>= afxsubRMAES["1557970_s_at", ] # explain exprs(afxsubRMAES)[4000,1:3] exprs(afxsubRMAES)["1557970_s_at",1:3] exprs(afxsubRMAES)["1557970_s_at",c("AFX_1_A1.CEL", "AFX_1_A2.CEL", "AFX_1_B1.CEL")] @ \item \textbf{understand resolution of identifiers using annotation maps} <>= library(annotate) library(hgu133plus2.db) lookUp("1557970_s_at", "hgu133plus2", "GENENAME") @ The annotate lookUp facility is very basic. It was designed when Bioconductor annotation maps were one-directional, with probe identifiers as keys and biological terms or tokens as values, as shown above. Newer facilities (with which original facilities currently coexist) are based on the relational database system SQLite. A simple application, commonly required, finds the probe identifier(s) for a certain gene symbol: <>= rmap = revmap(hgu133plus2SYMBOL) get("A2M", rmap) mget(c("A2M", "CRP"), rmap) @ A deeper view of what is going on here can be obtained via DBI (database interface) facilities. This requires an acquaintance with SQL. <>= co = hgu133plus2_dbconn() # hook to the database co dbListTables(co) lk133 = function(x) dbGetQuery(co, x) # shorthand helper lk133("select * from probes limit 200,5") lk133("select * from gene_info limit 200,5") @ From the AnnotationDbi vignette, we adapt an example. We wish to find all the probes on illuminaHumanv2 that have a GO MF function ascribed, with evidence code TAS (traceable author statement). {\small <>= mft = lk133("SELECT symbol FROM go_mf INNER JOIN gene_info USING(_id) WHERE go_mf.evidence = 'TAS'") dim(mft) # will be data.frame mft[1:3,] @ } \item \textbf{Perform analyses of high-throughput experiments}. This `essential' repertory component is not regarded as a preliminary for the course. This is the intended content of the course. \end{enumerate} \clearpage \section{Exercises} \begin{enumerate} \item \textbf{Vector element annotation.} Attach MAQCsubset package, and load the afxsubRMAES dataset: <>= library(MAQCsubset) data(afxsubRMAES) @ Now grab some expression values: <>= myv = exprs(afxsubRMAES)[1000:1004,1] myv @ Replace the probe names on this vector with the HUGO symbols. You should get: \begin{verbatim} > myv MGC15705 FAM98B C17orf57 C17orf57 FAM71B 3.772405 2.688321 4.047763 3.229765 4.777014 \end{verbatim} \item \textbf{Revising phenoData values}. Create a revised version of afxsubRMAES where the percentage of Ambion brain has the customary units of percentage measures ranging from 0-100 (the current representation has proportions, even though the name pctBrain suggests percentages). \item \textbf{Improving the self-documentation of an ExpressionSet}. MAQCsubset package has illumina data in an object called \texttt{ilmMAQCsubR}. Access it <>= data(ilmMAQCsubR) @ Create a vector, \texttt{myb}, defining the percentage of Ambion brain present in each sample. Attach it to the phenoData of the ilmMAQCsubR through the operation <>= ilmMAQCsubR$pctBrain = myb @ Now mention the resulting ExpressionSet to R and note the difference in the reports. An NA is shown in the report; modify the varMetadata component to repair this. Insert correct information for experimentData and annotation slots as well. The experimentData can be borrowed from the afxsubRMAES, and the annotation is illuminaHumanv1. \item \textbf{Writing a statistical function.} A two-sided $t$ distribution-based $1-\alpha$\% confidence interval for a mean takes the form $\bar{x} \pm k_{1-\alpha/2,n-1} s_x/\sqrt{n}$ where $\bar{x}$ is the sample mean, $s_x$ is the sample standard deviation, $n$ is the number of observations used for the mean, and $k_{1-\alpha/2,n-1}$ is the $1-\alpha/2$ quantile of the $t$ distribution with $n-1$ degrees of freedom. Write a function that takes a vector as input along with a parameter clev = $1-\alpha$ and returns a vector with the lower and upper bounds of the two-sided $1-\alpha$ level confidence interval for the mean. Solution: <>= myCI_t = function(x, clev) { mx = mean(x) n = length(x) sxc = sd(x)/sqrt(n) k = abs(qt((1-clev)/2,df=n-1)) c(-k*sxc+mx, k*sxc+mx) } set.seed(1234) tt = rnorm(100) myCI_t(tt, .95) @ Additional problems. \begin{enumerate} \item Verify that this function computes the correct CI by comparing with calls to t.test for one sample testing. \item create a new function, \verb+myCI_n+, in which you substitute \texttt{qnorm} for the \texttt{qt} call in \verb+myCI_t+. Compare widths of CIs generated under normal- and t-based radius lengths for samples of size 10, 50 and 100. \end{enumerate} \item \textbf{Damping outlier effects: winsorization.} The $k$-winsorization of an $n$-sample $x$ with order statistics $(x_{(1)}, \ldots, x_{(n)})$ is the sample with $2k$ modifications leading to order statistics $(x_{(k)}^{*k}, x_{(k+1)}, \ldots, x_{(n-k)}, x_{(n-k+1)}^{*k})$, where the operator $x^{*k}$ produces $k$ copies of datum $x$. Concretely, suppose the sorted data are 1,2,3,4,5,6,7, then the 2-winsorization is 2,2,3,4,5,6,6 -- we replace the $k$ extreme values at each end with copies of the $k$th order statistic at the low end and the $n-k+1$st at the high end. Write a function \verb+k_wins+, with parameters \verb+x, k+ that $k$-winsorizes a vector \verb+x+. <>= k_wins = function (x, k) { if (length(x) < 2 * k) stop("need k < x/2") n = length(x) sx = sort(x) x[x < sx[k]] = sx[k] x[x > sx[n - k + 1]] = sx[n - k + 1] x } @ \item \textbf{Simulating a test to evaluate its size.} Explain the following: <>= ps = sapply(1:5000, function(x) t.test(rnorm(20))$p.value) mean(ps < 0.05) psw = sapply(1:5000, function(x) t.test(k_wins(rnorm(20),2))$p.value) mean(psw < 0.05) @ \end{enumerate} \end{document}