|
StatPac for Windows User's Guide |
|
|
Advanced Multivariate StatisticsThe Advanced Analyses module adds multivariate procedures that are not available in the basic StatPac for Windows package. These commands may be used in a procedure when the Advanced Analyses module has been installed.
The REGRESS command may be used to perform ordinary least squares regression and curve fitting. Ordinary least squares regression (also called simple regression) is used to examine the relationship between one independent and one dependent variable. After performing an analysis, the regression statistics can be used to predict the dependent variable when the independent variable is known. People use regression on an intuitive level every day. In business, a well-dressed man is thought to be financially successful. A mother knows that more sugar in her children's diet results in higher energy levels. The ease of waking up in the morning often depends on how late you went to bed the night before. Quantitative regression adds precision by developing a mathematical formula that can be used for predictive purposes. The syntax of the command to run a simple regression analysis is:
REGRESS <Dependent variable> <Independent variable> - or - REGRESS <Dependent variable list> WITH <Independent variable list>
For example, a medical researcher might want to use body weight (V1=WEIGHT) to predict the most appropriate dose for a new drug (V2=DOSE). The command to run the regression would be specified in several ways:
REGRESS DOSE WITH WEIGHT REGRESS DOSE WEIGHT RE V2 WITH V1 (Note: REGRESS may be abbreviated as RE) RE V2 V1
Notice that the keyword WITH is an optional part of the syntax. However, if you specify a variable list for either the dependent or independent variable, the use of the WITH keyword is mandatory. When a variable list is specified, a separate regression will be performed for each combination of dependent and independent variables. The purpose of running the regression is to find a formula that fits the relationship between the two variables. Then you can use that formula to predict values for the dependent variable when only the independent variable is known. The general formula for the linear regression equation is:
y = a + bx
where: x is the independent variable y is the dependent variable a is the intercept b is the slope
Curve FittingFrequently, the relationship between the independent and dependent variable is not linear. Classical examples include the traditional sales curve, learning curve and population growth curve. In each case, linear (straight-line) regression would present a distorted picture of the actual relationship. Several classical non-linear curves are built into StatPac and you can simply ask the program to find the best one. Transformations are used to find an equation that will make the relationship between the variables linear. A linear least squares regression can then be performed on the transformed data. The process of finding the best transformation is known as "curve fitting". Basically, it is an attempt to find a transformation that can be made to the dependent and/or independent variable so that a least squares regression will fit the transformed data. This can be expressed by the equation:
(transformed y) = intercept + slope * (transformed x)
Notice the similarity to the least squares equation. The difference is that we are transforming the independent variable and predicting a transformed dependent variable. To solve for y, use the formula to untransform y and apply it to both sides of the equation.
y = Untransformation of (intercept + slope * (transformed x))
In addition to the built-in transformations, any non-linear relationship that can be expressed as a mathematical formula can be explored with the COMPUTE statement. It is possible to transform both the dependent and independent variables with the COMPUTE statement. The transformations that are built into StatPac are known as Box-Cox transformations. They are:
For example, to apply a square root transformation to the independent variable and no transformation to the dependent variable, the options statement would be:
OPTIONS TX=G TY=H
The following option statement would try to fit your data to a classical S-Curve. It says to apply a reciprocal transformation to the independent variable and a log transformation to the dependent variable:
OPTIONS TX=B TY=E
The program also contains an automatic feature to search for the best transformation. When TX or TY is set to automatic, the program will select the transformation that produces the highest r-squared. To get a complete table of all combinations of the transformations, set both TX and TY to automatic.
OPTIONS TX=A TY=A
The result will produce a table of all possible combinations of transformations and the R-Squared statistics:
Example of a Transformation Table
The Box-Cox transformations can be expressed mathematically:
Notes: z is the transformed data value y is the original data value k is a constant used in the transformation
Two cautions should be noted when using transformations. 1. When a reciprocal transformation is used, the sign of the correlation coefficient may no longer indicate the direction of the relationship in the untransformed data. 2. Some transformations may not be possible for some data. For example, it is not possible to take the log or square root of a negative number or the reciprocal of zero. When necessary, StatPac will automatically add a constant to the data to prevent this type of error. One problem with least squares regression is its susceptibility to extreme or unusual data values. In many cases, even a single extreme data value can distort the regression results. A technique called robust regression is included in StatPac to overcome this problem. Robust regression mathematically adjusts extreme data values through an iterative process. The effect is to reduce the distortion in the regression line caused by the outlying data value(s). The robust process makes successive adjustments to extreme data values by examining the median residual and using a weighted least squares regression to adjust the outliers. If robust regression is specified for either the x or y transformation, no other built-in transformations will be used (even if a transformation is specified for the other variable). StatisticsThe regression statistics provide the information needed to adequately evaluate the "goodness-of-fit"; that is, how well the regression line explains the actual data. The statistics include correlation information, descriptive statistics, error measures and regression coefficients. This data can be used to predict future values for the dependent variable and to develop confidence intervals for the predictions.
Example of a Statistics Printout
Correlation is a measure of association between two variables. StatPac calculates the Pearson product-moment correlation coefficient. Its value may vary from minus one to plus one. A minus one indicates a perfect negative correlation, while a plus one indicates a perfect positive correlation. A correlation of zero means there is no relationship between the two variables. When a transformation has been specified, the correlation coefficient refers to the relationship of the transformed data. The coefficient of determination (r-squared) is the square of the correlation coefficient. Its value may vary from zero to one. It has the advantage over the correlation coefficient in that it may be interpreted directly as the proportion of variance in the dependent variable that can be accounted for by the regression equation. For example, an r-squared value of .49 means that 49% of the variance in the dependent variable can be explained by the regression equation. The other 51% is unexplained error. The standard error of estimate for regression measures the amount of variability in the points around the regression line. It is the standard deviation of the data points as they are distributed around the regression line. The standard error of the estimate can be used to specify the limits and confidence of any prediction made and is useful to obtain confidence intervals for y' given a fixed value of x. Regression analysis enables us to predict one variable if the other is known. The regression line (known as the "least squares line") is a plot of the expected value of the dependent variable for all values of the independent variable. The difference between the observed and expected value is called the residual. It can be used to calculate various measures of error. The measures of error in StatPac are the mean percent error (MPE), mean absolute percent error (MAPE) and the mean squared error (MSE). Using the regression equation, the variable on the y axis may be predicted from the score on the x axis. The slope of the regression line (b) is defined as the rise divided by the run. The y intercept (a) is the point on the y axis where the regression line would intercept the y axis. The slope and y intercept are incorporated into the regression equation as: y = a + bx The significance of the slope of the regression line is determined from the student's t statistic. It is the probability that the observed correlation coefficient occurred by chance if the true correlation is zero. StatPac uses a two-tailed test to derive this probability from the t distribution. Probabilities of .05 or less are generally considered significant, implying that there is a relationship between the two variables. Although StatPac does not calculate the F statistic, it is simply the square of the t statistic for the slope. Data TableThe data table provides a detailed method to examine the errors between the predicted and actual values of the dependent variable. StatPac allows printing of the table to more closely study the residuals. Typing the option DT=Y will cause the output to include a data table.
Example of a Data Table
Outlier Definition and AdjustmentOutliers (extreme data points) can have dramatic effects on the slope of the regression line. One method to deal with outliers is to use robust regression (TX=I or TY=I). There are two other common methods to deal with outliers. The first is to simply eliminate any records that contain an outlier and then rerun the regression without those data records. The other method is known as data trimming, where the highest and lowest extreme values are replaced with a value that limits the standardized residual to a predetermined value. The OA option is used to set the outlier adjustment method. It may be set to OA=N (none), OA=D (delete), or OA=A (adjust). Both methods use a two-step process. First the regression is performed using the actual values for the dependent variable and standardized residuals are calculated for each predicted value. When a standardized residual exceeds a given z-value, the record is flagged. Then the regression is run again and the flagged records are either eliminated (OA=D), or the value of the dependent variable is adjusted to the value defined by the outlier definition z-value (OA=A). For example, if the outlier definition is set to 1.96 standard deviations (OD=1.96), the upper and lower two and a half percent of the outliers would be flagged. Then the dependent variables for the flagged records would be modified to a value that would produce an outlier of plus or minus 1.96. Finally, the regression would be rerun using the modified dependent variable values for the flagged records. Flagged data records will be shown with an asterisk in the data table. It is important to note that the outlier adjustment process is only performed once because each regression would produce a new set of standardized residuals that would exceed the outlier definition value (OD=z). That is, any set of data with sufficient sample size will yield a set of outliers, even if the data has already been adjusted. Allowing the outlier adjustment process to repeat indefinitely would eventually result in an adjustment to nearly every data record. When outlier adjustment is used, the program will also report adjusted means and standard deviations for the dependent variable. This refers to the recalculated mean after deleting or adjusting the data. Confidence Intervals & Confidence LevelConfidence intervals provide an estimate of variability around the regression line. Narrow confidence intervals indicate less variability around the regression line. The option CI=Y will include the confidence intervals in the data table. Prediction intervals, rather than confidence intervals, should be used if you intend to use the regression information to predict new values for the dependent variable. Both the confidence intervals and the prediction intervals are centered on the regression line, but the prediction intervals are much wider. The option CI=P will print the prediction intervals in the data table. The actual confidence or prediction interval is set with the CL option. The CL option specifies the percentage level of the interval. For example, if CI=P and CL=95, the 95% prediction intervals would be printed in the data table. Residual Autocorrelation Function TableExamining the autocorrelation of the residuals is often used in time-series analysis to evaluate how well the regression worked. It is a way of looking at the "goodness-of-fit" of the regression line. If the residuals contain a pattern, the regression did not do as well as we might have desired. A residual autocorrelation table is the correlation between values that occur at various time lags. For example, at time lag one, you are looking at the correlation between adjacent values; at time lag two, you are looking at the correlation between every other value, etc. To select the residual autocorrelation function table, type the option AC=Y.
Example of a Residual Autocorrelation Function Table
Expanding Standard ErrorYou may use the EX option to set the standard error limits of the residual autocorrelation function to a fixed value or to expand with increasing time lags. A study of sampling distributions on autocorrelated time
series was made by When EX=Y, the residual autocorrelation function error limits will widen with each successive time lag. If EX=N, the standard error limits will remain constant. Force Constant to ZeroThe option CZ=Y can be used to calculate a regression equation with the constant equal to zero. If this is done, the regression line is forced through the origin. Note that forcing the constant to zero disables calculation of the correlation coefficient, r-squared, and the standard error of estimate. For this reason, it is not possible to set the transformation parameter for either the independent or dependent variable to automatic, because there are no r-squared statistics to compare. Furthermore, confidence intervals, which are calculated from the standard error of estimate, cannot be computed. The option CZ=N results in a standard regression equation. Save ResultsMany times researchers want to save results for future study. By using the option SR=Y, the predicted values, residuals and confidence or prediction intervals can be saved so they can be merged with the original data file. At the completion of the analysis, you will be given the opportunity to merge the predictions and residuals. Predict InteractivelyWhen performing a regression, predicting values for the dependent variable for specific values of the independent variable may be desired. This is known as interactive prediction. Select interactive prediction by entering the option PR=Y. After the completion of the tabular outputs, the user will be prompted to enter a value for the independent variable. The program will predict the value for the dependent variable based on the regression equation. Confidence and/or prediction intervals will also be given. Labeling and Spacing Options
Multiple regression is an extension of simple regression. It examines the relationship between a dependent variable and two or more explanatory variables (also called independent or predictor variables). Multiple regression is used to: 1. Predict the value of a dependent variable using some or all of the independent variables. The aim is generally to explain the dependent variable accurately with as few independent variables as possible. 2. To examine the influence and relative importance of each independent variable on the dependent variable. This involves looking at the magnitude and sign of the standardized regression coefficients as well as the significance of the individual regression coefficients. The syntax of the command to run a stepwise regression is:
STEPWISE <Dependent variable> <Independent variable list>
For example, we might try to predict annual income (V1=INCOME) from age (V2=AGE), number of years of school (V3=SCHOOL), and IQ score (V4=IQ). The command to run the regression could be specified in several different ways:
STEPWISE INCOME, AGE, SCHOOL, IQ STEPWISE V1,V2,V3,V4 ST INCOME V2-V4 (Note: STEPWISE may be abbreviated as ST) STEPWISE V1-V4
In each example, the dependent variable was specified first, followed by the independent variable list. The variable list itself may contain up to 200 independent variables and can consist of variable names and/or variable numbers. Either a comma or a space can be used to separate the variables from each other. The multiple regression equation is similar to the simple regression equation. The only difference is that there are several predictor variables and each one has its own regression coefficient. The multiple regression equation is:
Y' = a + b1 x1 + b2 x2 + b3 x3 + ... bn xn
where
Y' is the predicted value A is a constant B1 is the estimated regression coefficient for variable 1 X1 is the score for variable 1 B2 is the estimated regression coefficient for variable 2 X2 is the score for variable 2
Descriptive StatisticsThe mean and standard deviations for all the variables in the equation can be printed with the DS=Y option.
Example of a Descriptive Statistics Printout
Regression StatisticsThe regression statistics can be selected with option RS=Y. They give us an overall picture of how successful the regression was. The coefficient of multiple determination, frequently referred to as r-squared, can be interpreted directly as the proportion of variance in the dependent variable that can be accounted for by the combination of predictor variables. A coefficient of multiple determination of .85 means that 85 percent of the variance in the dependent variable can be explained by the combined effects of the independent variables; the remaining 15 percent would be unexplained. The coefficient of multiple correlation is the square root of the coefficient of multiple determination. Its interpretation is similar to the simple correlation coefficient. It is basically a measure of association between the predicted value and the actual value. The standard error of the multiple estimate provides an estimate of the standard deviation. It is used in conjunction with the inverted matrix to calculate confidence intervals and statistical tests of significance. When there are fewer than 100 records, StatPac will apply an adjustment to the above three statistics, and the adjusted value will be printed. The adjustment is for a small n and its value should be used. The variability of the dependent variable is made up of variation produced by the joint effects of the independent variables and some unexplained variance. The overall F-test is performed to determine the probability that the true coefficient of multiple determination is zero. Typically, a probability of .05 or less leads us to reject the hypothesis that the regression equation does not improve our ability to predict the dependent variable.
Example of the Regression Statistics Printout
Regression CoefficientsThe regression coefficients can be printed with the RC=Y option. The output includes the constant, coefficient, beta weight, F-ratio, probability, and standard error for each independent variable. Each coefficient provides an estimate of the effect of that variable (in the units of the raw score) for predicting the dependent variable. The beta weights, on the other hand, are the standardized regression coefficients and represent the relative importance of each independent variable in predicting the dependent variable. The F-ratio allows us to calculate the probability that the influence of the predictor variable occurred by chance. The t-statistic for each independent variable is equal to the square root of its F-ratio. The standard error of the ith regression coefficient can be used to obtain confidence intervals about each regression coefficient in conjunction with its F-ratio.
Example of Regression Coefficients Printout
Simple Correlation MatrixAfter performing a regression analysis, it is a good idea to review the simple correlation matrix (SC=Y). If two variables are highly correlated, it is possible that the matrix is not well conditioned and it might be beneficial to run the regression again without one of the variables. If the coefficient of multiple determination does not show a significant change, you might want to leave the variable out of the equation.
Example of a Simple Correlation Printout
Partial Correlation MatrixThe partial correlation matrix (often called the variance-covariance matrix) is obtained from the inverse of simple correlation matrix. It can be selected with the option PC=Y. This is useful in studying the correlation between two variables while holding all the other variables constant. A significant partial correlation between variables A and B would be interpreted as follows: When all other variables are held constant, there is a significant relationship between A and B. The partial correlation matrix will be printed for those variables remaining in the equation after the stepwise procedure.
Example of a Partial Correlation Matrix Printout
Inverted Correlation MatrixThe solution to a multiple regression problem is obtained through a technique known as matrix inversion. The inverted correlation matrix is the inversion of the simple correlation matrix. It may be selected with the option IC=Y. In examining the inverted matrix, we are specifically interested in the values along the diagonal. They provide a measure of how successful the matrix inversion was. If all the values on the diagonal are close to one, the inversion was very successful and we say the matrix is "well conditioned". If, however, we have one or more diagonal values that are high (greater than ten), we have a problem with collinearity (high correlations between independent variables).
Example of an Inverted Matrix Printout
Print Each StepYou can print the statistics for each step of the stepwise procedure using the option PS=Y. This may be important when you want to study how the inclusion or deletion of a variable affects the other variables.
Example of the Print Steps Output
Summary TableA good way to get an overview of how the steps proceeded and what effect each step had upon the r-squared is to print a summary table. To print the summary table, use the option ST=Y.
Example of a Summary Table
Data TableThe data table provides a detailed method to examine the residuals. StatPac allows printing of the table to more closely study the residuals. Using the option DT=Y will cause the output to include a data table. A "residual" is the difference between the observed value and the predicted value for the dependent variable (the error in the prediction). The standardized residuals which appear in the data table are the residuals divided by the standard error of the multiple estimate. Therefore, the standardized residuals are in standard deviation units. In large samples, we would expect 95 percent of the standardized residuals to lie between -1.96 and 1.96.
Example of a Data Table
Outlier Definition and AdjustmentOutliers (extreme data points) can have a dramatic effect on the stability of a multiple regression model. There are two common methods to deal with outliers in multiple regression models. The first is to simply eliminate any records that contain an outlier and then rerun the regression without those data records. When using OA=D (setting the outlier adjustment to delete), records containing the highest and lowest extreme residuals are deleted from the analysis. The other method is where the dependent variable is adjusted for records with the highest and lowest extreme residuals. That is, the dependent variable is modified to a value that limits the standardized residual to a predetermined value. The OA option is used to set the outlier adjustment method. It may be set to OA=N (none), OA=D (delete), or OA=A (adjust). Both methods use a two-step process. First the regression is performed using the actual values for the dependent variable and standardized residuals are calculated for each predicted value. When a standardized residual exceeds a given z-value, the record is flagged. Then the regression is run again and the flagged records are either eliminated (OA=D), or the value of the dependent variable is adjusted to the value defined by the outlier definition z-value (OA=A). For example, if the outlier definition is set to 1.96 standard deviations (OD=1.96), the upper and lower two and a half percent of the outliers would be flagged. Then the dependent variables for the flagged records would be modified to a value that would produce an outlier of plus or minus 1.96. Finally, the regression would be rerun using the modified dependent variable values for the flagged records. Flagged data records will be shown with an asterisk in the data table. The stepwise procedure presents a problem for data trimming. The stepwise procedure often reduces the number of independent variables to a subset of the original list of independent variables. Data trimming involves using the standardized residuals to adjust the value of the dependent variable for some records. Rerunning the stepwise procedure with different values for some of the dependent variables could result in a different set of independent variables being stepped into the model, especially when there are highly correlated independent variables. To avoid this problem, StatPac reruns the multiple regression using the same independent variables selected in the first stepwise procedure. These variables are forced into the model so that the analysis runs in the non-stepwise mode. It is important to note that the outlier adjustment process is only performed once because each regression would produce a new set of standardized residuals that would exceed the outlier definition value (OD=z). That is, any set of data with sufficient sample size will yield a set of outliers, even if the data has already been adjusted. Allowing the outlier adjustment process to repeat indefinitely would eventually result in an adjustment to nearly every data record. It is suggested that the user actually examine records that are flagged as extreme outliers before allowing the program to make any adjustments. Outlier adjustments assume that the data for all independent variables is acceptable. A mispunched data value for an independent variable could result in an extreme prediction that gets flagged as an outlier. Therefore, visual inspection is the best way to guarantee the successful handling of outliers. Confidence Intervals & Confidence LevelConfidence intervals provide an estimate of variability around the regression line. Narrow confidence intervals indicate less variability around the regression line. The option CI=Y will include the confidence intervals in the data table. Prediction intervals, instead of confidence intervals, should be used if you intend to use the regression information to predict new values for the dependent variable. Both the confidence intervals and the prediction intervals are centered on the regression line, but the prediction intervals are much wider. The option CI=P will print the prediction intervals in the data table. The actual confidence or prediction interval is set with the CL option. The CL option specifies the percentage level of the interval. For example, if CI=P and CL=95, the 95% prediction intervals would be printed in the data table. Number of Variables to ForceThe ability to force variables into an equation is important for several reasons: 1. A researcher often wishes to replicate the analysis of another study and, therefore, to force certain core variables into the equation, letting stepwise regression choose from the remaining set. 2. Some variables may be cheaper or easier to measure, and the user may want to see whether the remaining variables add anything to the equation. 3. It is common to force certain design variables into the equation. 4. When independent variables are highly correlated, one of them may be more accurate than the rest, and you may want to force this variable into the equation. The FO option specifies the number of variables to force into the regression equation. To perform a standard (non-stepwise) multiple regression, set the FO option to the number of independent variables or higher. FO=200 will always force all independent variables into the equation. Thus, the FO option may be used to eliminate the stepwise part of the multiple regression procedure. If you force all variables into the equation, the multiple regression will contain only one step, where all variables are included in the equation. The variables to be forced are taken in order from the list of independent variables. For instance, the option FO=3 forces the first three variables from the list of independent variables. Therefore, any variables you want to force should be specified at the beginning of the independent variable list. F to Enter & F to RemoveWhen faced with a large number of possible explanatory variables, two opposed criteria of selecting a regression equation are usually involved: 1. To make the equation useful for predictive purposes, we would like our model to include as many of the independent variables as possible so that reliable fitted values can be determined. 2. Because of the costs involved in obtaining information on a large number of independent variables, and subsequently monitoring them, we would like the equation to include as few of the independent variables as possible. The compromise between these two extremes is generally called "selecting the best regression". This involves multiple executions of multiple regression in an attempt to add variables to improve prediction or remove variables to simplify the regression function. Stepwise regression provides a partial automation of this procedur |