% % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % \documentclass[12pt]{article} \usepackage{amsmath,pstricks} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \textwidth=6.2in \bibliographystyle{plainnat} \begin{document} %\setkeys{Gin}{width=0.55\textwidth} \title{Working with GEOquery -- retrievals from NCBI/GEO} \author{VJ Carey} \maketitle \tableofcontents \section{Introduction} GSE3982 is the name of an experiment data archive in GEO, a 'series' in their nomenclature. We can retrieve it using getGEO; the result, if you use this directly, is a list. I have provided the object gg.rda to which the getGEO output is serialized. <>= library(Biobase) load("gg.rda") names(gg) @ The elements of this list are ExpressionSet instances: <>= class(gg[[1]]) @ To simplify references, let's copy the object to a new name: <>= pac1 = gg[[1]] @ The sample annotation supplied with this GEO series has a variable called description: {\small <>= as.character(pac1$description) @ } That will be hard to analyze. We've broken down the description and made a space-delimited file: \begin{verbatim} id purity celltype sort stim stim.timeh source diff.by stim.by mature markers T.type V2 97 eos anti-CD16 FALSE NA PB NA NA NA NA NA V3 97 eos anti-CD16 TRUE 2 PB NA PMA NA NA NA V4 95 mast NA FALSE NA cord IL-10/IL-6/stemCellFact NA NA NA NA V5 95 mast NA TRUE 2 cord IL-10/IL-6/stemCellFact IgE NA NA NA V6 NA dendritic NA TRUE 48 mono IL-4/GM-CSF LPS TRUE NA NA V7 95 mast NA TRUE 2 cord IL-10/IL-6/stemCellFact IgE NA NA NA V8 NA dendritic NA FALSE NA PB.mono IL-4/GM-CSF NA FALSE NA NA V9 NA dendritic NA FALSE NA PB.mono IL-4/GM-CSF NA FALSE NA NA V10 NA dendritic NA TRUE 48 PB.mono IL-4/GM-CSF LPS FALSE NA NA V11 NA macrophage NA FALSE 264 PB.CD14+.mono GM-CSF NA TRUE NA NA V12 NA macrophage NA FALSE 264 PB.CD14+.mono GM-CSF NA TRUE NA NA V13 NA macrophage NA TRUE 264 PB.CD14+.mono GM-CSF LPS TRUE NA NA V14 NA macrophage NA TRUE 264 PB.CD14+.mono GM-CSF LPS TRUE NA NA V15 95 mast NA FALSE NA cord IL-10/IL-6/stemCellFact NA NA NA NA V16 97 neutrophil NA FALSE NA PB NA NA NA CD16+:CD62L.hi:CCR3- NA V17 97 neutrophil NA FALSE NA PB NA NA NA CD16+:CD62L.hi:CCR3- NA V18 96 B NA FALSE NA PB NA NA NA CD19+ NA V19 96 B NA FALSE NA PB NA NA NA CD19+ NA V20 96 neutrophil NA TRUE 1 PB NA LPS NA CD16+ NA V21 96 basophil NA FALSE NA PB NA NA NA CCR3+:low.side.scatter NA V22 95 T NA FALSE NA NA NA NA NA CD4+CD45RO+CCR7- effector:memory V23 95 T NA FALSE NA NA NA NA NA CD4+CD45RO+CCR7- effector:memory V24 96 NK NA FALSE NA PB NA NA NA CD16+CD56+ NA V25 96 NK NA FALSE NA PB NA NA NA CD16+CD56+ NA V26 96 basophil NA FALSE NA PB NA NA NA CCR3+:low.side.scatter NA V27 95 T NA FALSE NA NA NA NA NA CD4+CD45RO+CCR7+ central:memory V28 95 T NA FALSE NA NA NA NA NA CD4+CD45RO+CCR7+ central:memory V29 95 T NA FALSE NA NA neonatalBlood NA NA CD4+CD45RO+CCR7+ helper:typeI V30 95 T NA FALSE NA NA neonatalBlood NA NA CD4+CD45RO+CCR7+ helper:typeI V31 95 T NA FALSE NA NA neonatalBlood NA NA CD4+CD45RO+CCR7+ helper:typeII V32 95 T NA FALSE NA NA neonatalBlood NA NA CD4+CD45RO+CCR7+ helper:typeII V33 97 eos anti-CD16 FALSE NA PB NA NA NA NA NA \end{verbatim} To bind this information into the ExpressionSet, we need to do some work. First, we read the data into a data.frame. Be sure it is synchronized to the samples in the ExpressionSet. <>= df = read.delim("3982.ph.txt", sep=" ", h=TRUE) df[1:4,] names(df) @ Now we combine this with the existing phenoData element, and make a new AnnotatedDataFrame: <>= nd = cbind(pData(pac1), df) adf = new("AnnotatedDataFrame", nd) @ We should add 'varMetadata' to clarify the interpretations of variables. But this will be skipped for now. Let's bind the new phenoData: <>= phenoData(pac1) = adf pac1 @ We should also get experimentData (MIAME schema) using pmid2MIAME("16474395") @ \section{Analysis} \subsection{Question} Consider this figure from the Jeffrey paper, PMID 16474395: \includegraphics{pac1a} How would you (approximately) reproduce this display? \subsection{Answer} We assume you have created an ExpressionSet like pac1 shown above, using getGEO or some other means. Here is a text string with the 'product' symbols (or ad hoc translations) that are known to map to hgu133a probe sets: <>= rstr = paste("PTX3 IL5 CXCL5 AREG CXCL1 CXCL3 IL6 DUSP2 IL8 F3 CCL3 CCL4", "TNFAIP6 CCL5 CCL1 CCL20") jefp = strsplit(rstr, " ")[[1]] jefp @ We will map these to hgu133a probe set identifiers: <>= library(hgu133a.db) rmap = revmap(hgu133aSYMBOL) ps = mget(jefp, rmap, ifnotfound=NA) ps ups = unlist(ps) ups @ Let's reduce the pac1 ExpressionSet to these genes and generate a heatmap. To label the samples, we will paste two of the factors and use them as attributes of a log-scale matrix: <>= kp = log2(exprs(pac1))[ups, ] rownames(kp) = names(ups) ac = as.character colnames(kp) = paste(pac1$celltype, ifelse(pac1$celltype=="T", ac(pac1$T.type), ac(pac1$stim.by)), sep=".") <>= heatmap( kp, margin = c(12, 5)) @ To retain the row ordering presented by Jeffrey et al, we can suppress row clustering. We can also emulate their color scheme by reversing the default heat.colors palette: <>= heatmap( kp[nrow(kp):1, ], margin = c(9, 5), Rowv=NA, col=rev(heat.colors(10))) @ \section{Exercises} \begin{enumerate} \item Discuss the compatibility of the derived heatmap with Fig 1a of Jeffrey et al. \item Sketch how to approximately reproduce Figure 1b of the paper. \item The IL-8 related probesets are numerous, and some seem to have high values for B cells. How does the clustering change if we only use a selection of the IL-8 probe sets? \end{enumerate} \end{document}