The SAS System for Data Analysis


Introduction

The letters SAS are an acronym for Statistical Analysis System. SAS was originally developed to meet statistical needs of users, but it rapidly grew into an all-purpose data management and analysis system. The basic SAS system provides tools for: information storage and retrieval, data modification and programming, report writing, statistical analysis, and file handling. SAS is widely used by businesses for managing information systems as well as for statistical and econometric analyses. SAS was selected for Econ 3972, because we feel that it is the best of the statistical packages which have found widespread usage in business, industry, and research laboratories.

Although SAS can be used to prepare graphs, charts, arrays, frequency distributions, and summary statistics as well as more advanced analyses such as analysis of variance, multi-variate analysis, discriminate analyses, and clustering, we will primarily use the regression program to construct mathematical and statistical models. However, once you have learned the basic structure of SAS, how to enter data and use proc's, it will be a relatively simple matter to learn how to make use of it in other applications.

1.1 A SAS JOB

A set of commands submitted to the computer to be processed is called a computer job. A SAS job then is simply a set of SAS commands which are submitted to a computer and which directs the computer to undertake the specified data analysis. A SAS job consists of four parts: (1) JCL statements; (2) SAS DATA step commands; (3) DATA lines; and (4) SAS proc step commands. In order to use SAS it is necessary to understand each of these four components of a SAS job and their relation to each other.

1.1.1 JOB CONTROL LANGUAGE (JCL)

Job Control Language is used to identify oneself to the computer and to communicate to the computer the nature of the jobs that will be run on it. The LOGON procedures involve identifying yourself to the computer and are part of this job control language.

1.1.2 THE DATA STEP

The DATA step is the part of the SAS job in which the data to be analyzed is defined. It is first necessary to specify the general form taken by all data which is analyzed by SAS. The upper half of Figure 1 shows the general form for the data matrix. The observations are denoted by the rows down the left side. An observation is information obtained about the unit of analysis.
Figure 1
THE DATA MATRIX

A General Representation

Observations Variables
(Sampling Units) (Measurements)
1st Variable 2nd Variable . . . . . 4th Variable

Observation 1 data data data

Observation 2 data data data

. . . .

. . . .

. . . .

Observation n data data data
--------------------------------------------------------------------------------------------------------------------
A Specific Example

SIZE GDP PRICE POPULATION

Ghana 28 15390 48 11680

Kenya 20 5990 53 16400

Lesotho 22 320 23 1340

. . . . .

. . . . .

. . . . .

. . . . .

Paraguay 14 4450 20 3070

Let's say we are concerned with a business as it operates through time. The first observation or row then might be the behavior of the firm during last month. But what are we interested in about the firm? Let's say we want to explain the sales of the firm. Then one variable which we will want to observe for last month would be the firm's sales. Data on sales expressed in thousands of dollars appears in the first column. But sales are related to other variables or dimensions of the firm's operations. One of these might be the firm's expenditures on advertising. Call this variable 2. It is recorded in the second column in thousands of dollars. Another might be the firm's expenditures on salesmen expressed in thousands of dollars. Call this variable 3. Obviously there are a large number of factors which influence the firm's sales and we could have as many variables as we could identify factors which we feel have an influence on or are related to the sales of the firm. We add a fourth variable, expenditures on bribes, expressed in thousands of dollars. Observation two then relates to the behavior of the firm two months back. Here again we collect information on sales, advertising, expenditures on salesmen and expenditures on bribes and record these in the appropriate columns. If then we had 30 observations, the last observation would be the sales, advertising, salesmen expenditures and bribe expenditures 30 months in the past. The data arranged in this form is referred to as a data matrix. The bottom part of Figure 1 presents specific data for the firm.

Defining data to SAS entails specifying names for the variables in the data matrix, indicating their type (numerical or alphabetical) and identifying their positions on the data line. The DATA step also conveys the location of the data matrix to SAS.

1.1.3 DATA LINES

This term refers to the actual data to be analyzed. In Figure 1 the data lines are the actual numbers appearing in the bottom part of the data matrix. Each line contains measurements for each of the variables for one observation. Each data line has the same format, i.e., the variables all occur in the same order from left to right and occupy the same position(s) on each line. Obviously, each line must conform to the specifications defined in the DATA step. Typically, each line contains only numeric information (numbers only) although character data (alphabetic letters) may also appear.

1.1.4 THE PROC STEP

The term PROC is a shortened form for the word procedure. It specifies what type of analysis is to be performed on the data. We will be primarily concerned with three PROC's, one called CORR, another called REG and another called PLOT. Any number of PROC steps each requesting a different specific analysis may be used in the same SAS job. This enables one to apply many different statistical techniques to the same set of data all in one job.

1.1.5 ORGANIZATION OF THE SAS JOB

The four parts of a SAS job discussed above must be used in the order presented. The JCL instructions must come first, followed by the DATA steps, followed by the data lines, and finally the PROC step. In working with SAS in Econ 3972, the data lines will be placed in a file called INPUT DATA A. The DATA step will identify this file and cause SAS to go out and read the actual data at the appropriate time.

The instructions or statements used in the DATA step and the PROC steps are composed using the English-like SAS command language. Each SAS statement begins with a key word and ends with a semicolon. Failure to place a semicolon at the end of these statements is the most common error made by students.


1.1.6 ERRORS

There are two types of errors which may occur when you are using SAS. One is data errors and the other is an error in the SAS language, which is referred to as a syntax error. About the only type of data errors which SAS can check for is when you specify numeric data and SAS encounters an alphabetic character in that data. In these cases, SAS accepts the data but issues a warning message INVALID DATA. If you see this message, check your printout and correct the error.

Syntax errors occur when the statements submitted to SAS do not adhere to the SAS command language. This occurs when you use invalid key words or do not follow the rules set down for the command language. For example, suppose you misspell a SAS key word. When this happens SAS is unable to understand the command and will put out an error message indicating that it does not recognize the word or phrase. Probably the most common syntax error in using SAS is failing to end statements with semicolon.


1.2 THE REGRESSION ANALYSIS JOB

The first step in regression analysis is to identify the system whose behavior you wish to predict or understand. Next it is necessary to measure the behavior which you wish to predict or explain. For example, if you wish to predict the sales of a firm then measurements of sales are obtained from the firm's records. This is pretty straightforward. But if you wish to measure the innovative ability of the firm, the problem of measurement becomes more difficult. Measures of innovative ability which have been used by researchers in the past are dollar expenditures on research and development, the number of new products brought onto the market over some period of time, or the dollar sales of these new products over that period. The variable you wish to explain or predict is called the dependent variable. Next it is necessary to develop a hypothesis about the variables or factors which are related to this dependent variable. If the dependent variable is government size then one would expect price of government services GDP, and population to have an effect on, or be related to, size. Obviously the student can think of other factors such as urbanization which may also be related to the size of government.
It is then necessary to express these hypothesized relationships in the form of a mathematical model. If it is felt that the relationship between the variables is linear and if we let y stand for size, x1 for price, x2 for GDP, and x3 for population and e stand for random variation in the system, then the mathematical model can be expressed as

y = a + b1x1 + b2x2 + b3x3 + e

The alpha and beta in this model are referred to as its parameters. To understand the relationship between the variables we estimate values for these parameters and evaluate the significance of them.

In discussing the commands and procedures used to run the regression using SAS, you will learn many of the basic procedures necessary to undertake other types of analysis which are available in SAS. The basic steps of presenting JCL, the data step, and the PROC's are the same steps used in running these other programs.

We begin by giving the student a SAS job which will run without modification and produce a regression output once the student has typed his data in a field called INPUT DATA A. This output should be brought to class by the student and will be referred to in class discussion. Students will learn how to program SAS by learning how to change the initial job to obtain relevant output for other hypothesized models. Figure 2 contains the SAS job which you will employ on your initial run. After having logged onto the computer (see Section 2.1), you must put a copy of this SAS job in your file called SASPRG SAS A. Rather than have each student in the course type this SAS job into his/her file, we have placed a copy of it in a file which can be accessed by all students in the course. In order to obtain a copy of this job you will use the COPY command (see Section 2.1).

In order that you can learn how to write your own SAS jobs we will now go through the job described in Figure 2 and discuss each of the commands and indicate alternative commands that may be used in subsequent analysis.

1.2.1 JCL INSTRUCTIONS

Most of the JCL relating to SAS jobs as it is used in Econ 3992 is hidden or transparent to the student and you need not concern yourself with it. However, you will use JCL in logging onto the computer. This is also described in Section 2.1.

The fist line OPTION LS=74; specifies the maximum number of columns that will be used in printing the SAS output. The terminals you will be using can all display lines that are at least 74 characters wide. As a result, the SAS output (with the exception of the graphs) can be analyzed or reviewed at a terminal. Instructions for obtaining printed output are discussed later in the Guide.

Figure 2
A SAS JOB
LINE
OPTION LS=74; 01
CMS FILEDEF SASPNT DISK INPUT DATA A; 02
DATA TEMP; 03
INFILE SASPNT; 04
INPUT GDP 16-22 SIZE 23-30 PRICE 31-35 POP 36-45; 05
PGDP=GDP/POP; 06
PROC CORR DATA=TEMP; 07
PROC REG DATA=TEMP; 08
MODEL SIZE=PRICE PGDP POP/R DW; 09
OUTPUT OUT=SASDTLIB.RESULT P=PREDY R=RSDULS; 10
DATA DATA. RESULT; 11
SET SASDTLIB.RESULT; 12
RSSQ=RSDULS**2; 13
PROC REG DATA=DATA.RESULT; 14
MODEL RSSQ=PREDY; 15
RUN; 16
OPTION LS=130; 17
PROC PLOT; 18
PLOT RSDULS*SIZE='+'/VREF=0.0; 19
PROC PLOT; 20
PLOT RSDULS*PRICE='+'/VREF=0.0; 21
PROC PLOT; 22
PLOT RSDULS*GDP='+'/VREF=0.0; 23
PROC PLOT; 24
PLOT RSDULS*POP='+'/VREF=0.0; 25


The second line CMS FILEDEF SASPNT DISK INPUT DATA A, defines the data file which will be used by the SAS job. The IBM system which you will be working on runs under an operating system referred to as CMS. FILEDEF is simply a shortened term for file definition. The next term SASPNT is merely a name which can be used in the SAS program to point to the file INPUT DATA A which resides on a disk. The file SASDT which stands for "SAS DATA" is part of the operating system of the computer. As such it cannot be directly used in a SAS job. The file definition statement then indicates to the computer when it encounters the term SASPNT it should go out and read the file INPUT DATA A.


1.2.2 THE DATA STEP

When SAS encounters the key word DATA as the first word in a command it knows that what follow will be a series of statements which define the data to be used in the analysis. Following the key word DATA is the word TEMP. This is short for temporary and it is the name of a SAS file in which the data from INPUT DATA A is stored. It is referred to as a temporary file because each time the SAS job is run, whatever data is in the INPUT DATA A file is read into it. In this way the same SAS job, if appropriate, can be run on different data. Note the semicolon at the end of this command as on all of the commands in this job.

When SAS encounter the key word INFILE in line 4 it knows that the word that follows defines the location of the file that should be read into TEMP. In this case it encounters the word SASPNT which as we indicated earlier points to the file INPUT DATA A.

When SAS encounters the key word INPUT in line 5, it knows that what follows defines the data matrix. Referring back to Figure 1, the variables used are GOVERNMENT-GDP RATIO, GDP, PRICE, AND POPULATION. The data, as described in Chapter 2, Section 2.1.4.1 is typed into INPUT DATA A so that GDP is in columns 16-22, size is in columns 23-30, price in columns 31-35, and population is in columns 36-45. Looking again at line 5, we see that immediately following the name which we have assigned to each of the variables in the data set are the columns in which these variables occur.

1.2.3 THE PROC STEP

When SAS encounters the key word PROC in a SAS job, it immediately knows that the next term defines the specific statistical program that is to be used. In our case, we are first using the correlation program which is referred to as CORR to generate the correlation matrix for all 4 variables. The DATA=TEMP defines what data is to be used by the correlation program.

The next line (7) also starts with the key word PROC. But this time we want the regression program called REG to be run. Again the data must be defined (DATA=TEMP). The next line specifies the regression model to be built and what is to be done with the output.

1.2.4 THE MODEL

The linear regression program, REG assumes that the relationship between the variables is linear. The general mathematical model implicit in REG is

yi = a + b1 x i1 + b2 x 2i + ... + bm x mi + ei
where m is the number of independent variables and there are i = 1, n observations. In our earlier example, the hypothesized population model is

yi = a + b1 x 1i + b2 x 2i + b3 x 3i + ei

where the following correspondence rules relate symbols in the mathematical system to those in the real world and to those in the SAS system.


Math SAS
System Real World Concept System
y government size SIZE

x1 price of government services PRICE

x2 gross domestic product GDP

x3 population POP

e Random effect

Return now to the MODEL (line 8). The hypothesized regression model is specified for SAS in that statement by:

SIZE = PRICE GDP POP

The regression program using sample data calculates estimates of the parameters a, b1, b2, b3. For example, let us say that the following estimated values denoted by a, b1, b2, and b3 were obtained: a = 183.7, b1 = 26.4, b2 = 8.2, and b3 = .23. The estimated model is:

_ = 183.7 + 26.4x1 + 8.2x2 + .23x3

The program REG has a standard output which will be discussed later. It also offers special options such as the calculation of residuals and the Durbin-Watson statistic. The / after POP followed by R and DW instruct it to calculate residuals for each observation (yi - _) and the Durbin-Watson statistic.

1.2.5 OUTPUT

Statement number 9 specifies what is to be done with the resulting output from the regression program. In this case, OUT = SASDTLIB.RESULT simply says that the output is to be placed in an SAS file called SASDTLIB.RESULT. Following this statement are two statements which specify what labels we wish to have put over the columns which have the predicted value of y and the residuals. The P = PREDY instructs the computer to print out PREDY over the column with the predicted value of y. Similarly R = RSDULS places RSDULS over the column with the residuals.
1.2.6 HETEROSCEDASTICITY TEST

Statements 10-14 transform the data in SASDTLIB.RESULT and instruct the computer to run the regression required to test for heteroscedasticity. Statements 10-12 are similar to "THE DATA STEP" discussed above (1.2.2) except now the residuals (RSDULS) are being squared in statement 12 to form the new variable, residuals squares (RSSQ). Statements 13 and 14 instruct the computer to regress the residuals squared on the predicted value of sales.

1.2.7 PLOT

In order to use the full page in plotting results, a new option for LS is specified as LS=130. The printed output then becomes 130 columns wide. The statement PROC PLOT specifies that the plotting routine will be used next. To plot variables, use the key word PLOT followed by the specific variables which are to be used. Since these graphs will be used primarily to look at problems of heteroscedasticity and to spot possible specification errors, residuals are plotted against the dependent and independent variables. Thus line 13, PLOT RSDULS*SIZE='+', produces a graph with residuals on the vertical axis and sales on the horizontal axis and with the symbol + used to denote the location of each observation in the plot. The /VREF=0.0 instructs the plot routine to print a straight line across the middle of the graph from the zero point on the residual axis.

Since plots are expensive and you have limited computer funds, do not obtain plots on Modeling Problem 1 after your first successful run. To stop the computer from printing out plots, remove all the PROC PLOT statements (Use DELETE prefix command to remove lines 12 through 20).