next up previous
Next: Appendix II-Frequently Asked Questions Up: Introduction to Splus at Previous: Splus frequently asked question

Appendix I--A Comparison of SAS and S

Terry M. Therneau, Ph.D.
Section of Biostatistics, Mayo Clinic

The following article was written several years ago. Some of the material describing limitations of S is out of date but the main points about the differences in data analysis capabilities and philosophies are still valid.

Just to make this discussion seem more legitimate, let me start by stating my own credentials. I am the author of five SAS procedures, two of which (COXREGR and SURVTEST) have been widely distributed as part of the SAS supplemental library. I have been a user of the SAS language as a data analysis tool for over 12 years. I have used S significantly for 6 years, for both data analysis and simulation work, and have authored seven S functions. (An S function is roughly equivalent to a SAS procedure, i.e., an extension of the package).

The section of Biostatistics at Mayo Clinic is an extensive user of SAS, it has been by far and away our key analysis tool. Four of us who are associated with the Mayo Comprehensive Cancer Center also have desktop access to S using a SUN workstation. The four of us have found the combination of both packages to be far superior to either one alone. S is weakest in those areas where SAS is most capable, and vice versa.

SAS has a very simple data model, the rectangular table. Each row is an observation, each column a variable which can be either numeric or character. A table (or DATA Set) may not be modified once it is created, but a very powerful facility is present to create a new table from an existing table, from the join (MERGE) of one or more tables, or from a sequential input file. New variables are computed and/or old ones dropped during this creation step. A data set may be used as input to a PROCEDURE such as regression, frequency, or listing. SAS is very much oriented toward printed output. Most of the procedures can return none or only some of the results of their computations to the SAS job as new tables.

S is oriented in the opposite direction: each function is designed to return its results to S as a data object, and only a very few functions produce "nice" printed output. To accommodate this, the S data model is far richer than that of SAS. An S data object may be a character string (of variable length, as opposed to SAS's fixed lengths), a numeric value, a logical value, a vector or matrix of one of these, or a list. A list is a concatenation of objects, these may be of any type, including other lists. The coxreg function for instance returns a list with components beta (a px1 vector), var (pxp matrix), loglik (2xl vector), and scoretest(scalar). S can read only simple input data files, has no join facility, and only minor tools for formatting its output.

In our own operation, the first step of nearly any data analysis project is to create the data set, possibly from multiple input files, create new variables and recodings, and generate multiple listings and frequency tables. SAS is clearly the tool for this job. Later on, we may want to do multiple 2x2 tables, and then display the collection of chi-square statistics achieved by these tables on a q-q plot versus the chi-square distribution (an excellent way to control for multiple comparisons). This is very simple to do in S, but cannot be done at all in SAS since the FREQ routine will only print the chi-square statistic, but will not return it. (Of course, one can always redirect the output and then parse it in again. This is a major pain, but in defense of SAS their input statements are good enough to actually do it.)

Another major difference between S and SAS lies in their graphical abilities. In SAS a graph is generated from a single data set by a single procedure call. In S a graph can be created in layers: the points, lines, and text routines are all separate functions which add to the current plot, and each may refer to a separate data set. As a consequence, SAS graphics functions have a plethora of options, and if you would like a slightly specialized graph it is highly likely that none of these options gives quite the right thing. The S functions have relatively few options.

SAS supports many more output devices, and multiple fonts. S supports only those fonts native to the device (but with Postscript at least, this is not a problem).

A completely new graphic function (such as ours for annotated Kaplan-Meier curves) can usually be created in S in a few hours using the macro facility. It is very easy in S to add a fitted function to a data plot, though that fit came from some other procedure (Poisson regression, say), and the number of points in the fitted curve is not the same number in the plotted data set. We have found S graphics uniformly easier to use than SAS graphics, extremely so when a publication quality result is desired. In fairness, this last statement may also reflect the KINDS of plots our group prefers, e.g., we're not big on pie charts, do lots of smoothed scatter-plots, etc., etc.

Both S and SAS allow locally written extensions to the language, which is a very important aspect. How difficult is this process? In either case, of course, the greatest time effort is likely to be development of the computing algorithm itself, say for a new and unusual factor analysis rotation scheme. Assuming that a working subroutine is in hand, however, an interface to the package would take me 1-4 hours in new S and 2-4 weeks in SAS. (Multiply by some constant k for a complex function). Up to half the SAS time may be spent writing print statements; getting things to look right on the paper can be a lot more difficult than the "return" statement S requires. Creating a returned data set is the most difficult part of a SAS extension to program (at least the first time), which discourages most user-written procedures from returning anything but printed output. The time required to learn the interface techniques is also in the relative proportions given above.

One major difference to be mentioned is more chance than design, and that is that there is a surprisingly small overlap in the list of statistical functions provided. SAS is strong in classical linear models, such as ANOVA, canonical correlation, and factor analysis. S is strong in robust techniques, such as bounded influence regression and scatterplot smoothing. [Note: since this was written, the set of modeling techniques available in S has vastly expanded.] This reflects, I am sure, the personal tastes of the authors of the two packages. The consequence is that the two together give a very well stocked toolbox to the statistical practitioner.

In terms of similarities, both SAS and S are major software packages--not the toy statistical packages currently propagating on PCs. There is a substantial learning curve for either product, >6 months, though SAS has better developed teaching aids. Both have a good user interface. Both have a powerful macro language, with new S >> SAS.

As a final note for those who are familiar with the SAS Interactive Matrix Language product: IML is similar to S, but much smaller. IML includes a subset of the matrix operations that S does, and includes a very small subset of the S functions. But the "flavor" of an S program, particularly one doing data manipulations such as subsetting or recoding, is very similar to IML.

If forced to choose only one of the two, we would have to take SAS. Medical data is always "messy", and SAS's data manipulation facilities are absolutely essential. But it is our intention to have both.

Oh yes--cost. SAS costs about 3X S, and charges an annual renewal fee of .5 * initial cost5. Cost per line of code is probably about the same. SAS is a big corporation, with an earned reputation for customer support, and a staggering amount of manuals and documentation (I sometimes wonder if their printing business earns more than the software .... ) S can be obtained from Bell cheap but with no support, or from resellers. We get S from Statistical Sciences, Inc.6; they add some functionality but I personally think that support alone is worth every dime. I don't want to beta test software any more.

Standard disclaimers apply. The opinions are only my own.
Terry M. Therneau
therneau@mayo.edu


next up previous
Next: Appendix II-Frequently Asked Questions Up: Introduction to Splus at Previous: Splus frequently asked question
Elizabeth Brown
1998-11-09