StatPac for Windows User's Guide 

Advanced Multivariate StatisticsThe Advanced Analyses module adds multivariate procedures that are not available in the basic StatPac for Windows package. These commands may be used in a procedure when the Advanced Analyses module has been installed.
The REGRESS command may be used to perform ordinary least squares regression and curve fitting. Ordinary least squares regression (also called simple regression) is used to examine the relationship between one independent and one dependent variable. After performing an analysis, the regression statistics can be used to predict the dependent variable when the independent variable is known. People use regression on an intuitive level every day. In business, a welldressed man is thought to be financially successful. A mother knows that more sugar in her children's diet results in higher energy levels. The ease of waking up in the morning often depends on how late you went to bed the night before. Quantitative regression adds precision by developing a mathematical formula that can be used for predictive purposes. The syntax of the command to run a simple regression analysis is:
REGRESS <Dependent variable> <Independent variable>  or  REGRESS <Dependent variable list> WITH <Independent variable list>
For example, a medical researcher might want to use body weight (V1=WEIGHT) to predict the most appropriate dose for a new drug (V2=DOSE). The command to run the regression would be specified in several ways:
REGRESS DOSE WITH WEIGHT REGRESS DOSE WEIGHT RE V2 WITH V1 (Note: REGRESS may be abbreviated as RE) RE V2 V1
Notice that the keyword WITH is an optional part of the syntax. However, if you specify a variable list for either the dependent or independent variable, the use of the WITH keyword is mandatory. When a variable list is specified, a separate regression will be performed for each combination of dependent and independent variables. The purpose of running the regression is to find a formula that fits the relationship between the two variables. Then you can use that formula to predict values for the dependent variable when only the independent variable is known. The general formula for the linear regression equation is:
y = a + bx
where: x is the independent variable y is the dependent variable a is the intercept b is the slope
Curve FittingFrequently, the relationship between the independent and dependent variable is not linear. Classical examples include the traditional sales curve, learning curve and population growth curve. In each case, linear (straightline) regression would present a distorted picture of the actual relationship. Several classical nonlinear curves are built into StatPac and you can simply ask the program to find the best one. Transformations are used to find an equation that will make the relationship between the variables linear. A linear least squares regression can then be performed on the transformed data. The process of finding the best transformation is known as "curve fitting". Basically, it is an attempt to find a transformation that can be made to the dependent and/or independent variable so that a least squares regression will fit the transformed data. This can be expressed by the equation:
(transformed y) = intercept + slope * (transformed x)
Notice the similarity to the least squares equation. The difference is that we are transforming the independent variable and predicting a transformed dependent variable. To solve for y, use the formula to untransform y and apply it to both sides of the equation.
y = Untransformation of (intercept + slope * (transformed x))
In addition to the builtin transformations, any nonlinear relationship that can be expressed as a mathematical formula can be explored with the COMPUTE statement. It is possible to transform both the dependent and independent variables with the COMPUTE statement. The transformations that are built into StatPac are known as BoxCox transformations. They are:
For example, to apply a square root transformation to the independent variable and no transformation to the dependent variable, the options statement would be:
OPTIONS TX=G TY=H
The following option statement would try to fit your data to a classical SCurve. It says to apply a reciprocal transformation to the independent variable and a log transformation to the dependent variable:
OPTIONS TX=B TY=E
The program also contains an automatic feature to search for the best transformation. When TX or TY is set to automatic, the program will select the transformation that produces the highest rsquared. To get a complete table of all combinations of the transformations, set both TX and TY to automatic.
OPTIONS TX=A TY=A
The result will produce a table of all possible combinations of transformations and the RSquared statistics:
Example of a Transformation Table
The BoxCox transformations can be expressed mathematically:
Notes: z is the transformed data value y is the original data value k is a constant used in the transformation
Two cautions should be noted when using transformations. 1. When a reciprocal transformation is used, the sign of the correlation coefficient may no longer indicate the direction of the relationship in the untransformed data. 2. Some transformations may not be possible for some data. For example, it is not possible to take the log or square root of a negative number or the reciprocal of zero. When necessary, StatPac will automatically add a constant to the data to prevent this type of error. One problem with least squares regression is its susceptibility to extreme or unusual data values. In many cases, even a single extreme data value can distort the regression results. A technique called robust regression is included in StatPac to overcome this problem. Robust regression mathematically adjusts extreme data values through an iterative process. The effect is to reduce the distortion in the regression line caused by the outlying data value(s). The robust process makes successive adjustments to extreme data values by examining the median residual and using a weighted least squares regression to adjust the outliers. If robust regression is specified for either the x or y transformation, no other builtin transformations will be used (even if a transformation is specified for the other variable). StatisticsThe regression statistics provide the information needed to adequately evaluate the "goodnessoffit"; that is, how well the regression line explains the actual data. The statistics include correlation information, descriptive statistics, error measures and regression coefficients. This data can be used to predict future values for the dependent variable and to develop confidence intervals for the predictions.
Example of a Statistics Printout
Correlation is a measure of association between two variables. StatPac calculates the Pearson productmoment correlation coefficient. Its value may vary from minus one to plus one. A minus one indicates a perfect negative correlation, while a plus one indicates a perfect positive correlation. A correlation of zero means there is no relationship between the two variables. When a transformation has been specified, the correlation coefficient refers to the relationship of the transformed data. The coefficient of determination (rsquared) is the square of the correlation coefficient. Its value may vary from zero to one. It has the advantage over the correlation coefficient in that it may be interpreted directly as the proportion of variance in the dependent variable that can be accounted for by the regression equation. For example, an rsquared value of .49 means that 49% of the variance in the dependent variable can be explained by the regression equation. The other 51% is unexplained error. The standard error of estimate for regression measures the amount of variability in the points around the regression line. It is the standard deviation of the data points as they are distributed around the regression line. The standard error of the estimate can be used to specify the limits and confidence of any prediction made and is useful to obtain confidence intervals for y' given a fixed value of x. Regression analysis enables us to predict one variable if the other is known. The regression line (known as the "least squares line") is a plot of the expected value of the dependent variable for all values of the independent variable. The difference between the observed and expected value is called the residual. It can be used to calculate various measures of error. The measures of error in StatPac are the mean percent error (MPE), mean absolute percent error (MAPE) and the mean squared error (MSE). Using the regression equation, the variable on the y axis may be predicted from the score on the x axis. The slope of the regression line (b) is defined as the rise divided by the run. The y intercept (a) is the point on the y axis where the regression line would intercept the y axis. The slope and y intercept are incorporated into the regression equation as: y = a + bx The significance of the slope of the regression line is determined from the student's t statistic. It is the probability that the observed correlation coefficient occurred by chance if the true correlation is zero. StatPac uses a twotailed test to derive this probability from the t distribution. Probabilities of .05 or less are generally considered significant, implying that there is a relationship between the two variables. Although StatPac does not calculate the F statistic, it is simply the square of the t statistic for the slope. Data TableThe data table provides a detailed method to examine the errors between the predicted and actual values of the dependent variable. StatPac allows printing of the table to more closely study the residuals. Typing the option DT=Y will cause the output to include a data table.
Example of a Data Table
Outlier Definition and AdjustmentOutliers (extreme data points) can have dramatic effects on the slope of the regression line. One method to deal with outliers is to use robust regression (TX=I or TY=I). There are two other common methods to deal with outliers. The first is to simply eliminate any records that contain an outlier and then rerun the regression without those data records. The other method is known as data trimming, where the highest and lowest extreme values are replaced with a value that limits the standardized residual to a predetermined value. The OA option is used to set the outlier adjustment method. It may be set to OA=N (none), OA=D (delete), or OA=A (adjust). Both methods use a twostep process. First the regression is performed using the actual values for the dependent variable and standardized residuals are calculated for each predicted value. When a standardized residual exceeds a given zvalue, the record is flagged. Then the regression is run again and the flagged records are either eliminated (OA=D), or the value of the dependent variable is adjusted to the value defined by the outlier definition zvalue (OA=A). For example, if the outlier definition is set to 1.96 standard deviations (OD=1.96), the upper and lower two and a half percent of the outliers would be flagged. Then the dependent variables for the flagged records would be modified to a value that would produce an outlier of plus or minus 1.96. Finally, the regression would be rerun using the modified dependent variable values for the flagged records. Flagged data records will be shown with an asterisk in the data table. It is important to note that the outlier adjustment process is only performed once because each regression would produce a new set of standardized residuals that would exceed the outlier definition value (OD=z). That is, any set of data with sufficient sample size will yield a set of outliers, even if the data has already been adjusted. Allowing the outlier adjustment process to repeat indefinitely would eventually result in an adjustment to nearly every data record. When outlier adjustment is used, the program will also report adjusted means and standard deviations for the dependent variable. This refers to the recalculated mean after deleting or adjusting the data. Confidence Intervals & Confidence LevelConfidence intervals provide an estimate of variability around the regression line. Narrow confidence intervals indicate less variability around the regression line. The option CI=Y will include the confidence intervals in the data table. Prediction intervals, rather than confidence intervals, should be used if you intend to use the regression information to predict new values for the dependent variable. Both the confidence intervals and the prediction intervals are centered on the regression line, but the prediction intervals are much wider. The option CI=P will print the prediction intervals in the data table. The actual confidence or prediction interval is set with the CL option. The CL option specifies the percentage level of the interval. For example, if CI=P and CL=95, the 95% prediction intervals would be printed in the data table. Residual Autocorrelation Function TableExamining the autocorrelation of the residuals is often used in timeseries analysis to evaluate how well the regression worked. It is a way of looking at the "goodnessoffit" of the regression line. If the residuals contain a pattern, the regression did not do as well as we might have desired. A residual autocorrelation table is the correlation between values that occur at various time lags. For example, at time lag one, you are looking at the correlation between adjacent values; at time lag two, you are looking at the correlation between every other value, etc. To select the residual autocorrelation function table, type the option AC=Y.
Example of a Residual Autocorrelation Function Table
Expanding Standard ErrorYou may use the EX option to set the standard error limits of the residual autocorrelation function to a fixed value or to expand with increasing time lags. A study of sampling distributions on autocorrelated time
series was made by When EX=Y, the residual autocorrelation function error limits will widen with each successive time lag. If EX=N, the standard error limits will remain constant. Force Constant to ZeroThe option CZ=Y can be used to calculate a regression equation with the constant equal to zero. If this is done, the regression line is forced through the origin. Note that forcing the constant to zero disables calculation of the correlation coefficient, rsquared, and the standard error of estimate. For this reason, it is not possible to set the transformation parameter for either the independent or dependent variable to automatic, because there are no rsquared statistics to compare. Furthermore, confidence intervals, which are calculated from the standard error of estimate, cannot be computed. The option CZ=N results in a standard regression equation. Save ResultsMany times researchers want to save results for future study. By using the option SR=Y, the predicted values, residuals and confidence or prediction intervals can be saved so they can be merged with the original data file. At the completion of the analysis, you will be given the opportunity to merge the predictions and residuals. Predict InteractivelyWhen performing a regression, predicting values for the dependent variable for specific values of the independent variable may be desired. This is known as interactive prediction. Select interactive prediction by entering the option PR=Y. After the completion of the tabular outputs, the user will be prompted to enter a value for the independent variable. The program will predict the value for the dependent variable based on the regression equation. Confidence and/or prediction intervals will also be given. Labeling and Spacing Options
Multiple regression is an extension of simple regression. It examines the relationship between a dependent variable and two or more explanatory variables (also called independent or predictor variables). Multiple regression is used to: 1. Predict the value of a dependent variable using some or all of the independent variables. The aim is generally to explain the dependent variable accurately with as few independent variables as possible. 2. To examine the influence and relative importance of each independent variable on the dependent variable. This involves looking at the magnitude and sign of the standardized regression coefficients as well as the significance of the individual regression coefficients. The syntax of the command to run a stepwise regression is:
STEPWISE <Dependent variable> <Independent variable list>
For example, we might try to predict annual income (V1=INCOME) from age (V2=AGE), number of years of school (V3=SCHOOL), and IQ score (V4=IQ). The command to run the regression could be specified in several different ways:
STEPWISE INCOME, AGE, SCHOOL, IQ STEPWISE V1,V2,V3,V4 ST INCOME V2V4 (Note: STEPWISE may be abbreviated as ST) STEPWISE V1V4
In each example, the dependent variable was specified first, followed by the independent variable list. The variable list itself may contain up to 200 independent variables and can consist of variable names and/or variable numbers. Either a comma or a space can be used to separate the variables from each other. The multiple regression equation is similar to the simple regression equation. The only difference is that there are several predictor variables and each one has its own regression coefficient. The multiple regression equation is:
Y' = a + b1 x1 + b2 x2 + b3 x3 + ... bn xn
where
Y' is the predicted value A is a constant B1 is the estimated regression coefficient for variable 1 X1 is the score for variable 1 B2 is the estimated regression coefficient for variable 2 X2 is the score for variable 2
Descriptive StatisticsThe mean and standard deviations for all the variables in the equation can be printed with the DS=Y option.
Example of a Descriptive Statistics Printout
Regression StatisticsThe regression statistics can be selected with option RS=Y. They give us an overall picture of how successful the regression was. The coefficient of multiple determination, frequently referred to as rsquared, can be interpreted directly as the proportion of variance in the dependent variable that can be accounted for by the combination of predictor variables. A coefficient of multiple determination of .85 means that 85 percent of the variance in the dependent variable can be explained by the combined effects of the independent variables; the remaining 15 percent would be unexplained. The coefficient of multiple correlation is the square root of the coefficient of multiple determination. Its interpretation is similar to the simple correlation coefficient. It is basically a measure of association between the predicted value and the actual value. The standard error of the multiple estimate provides an estimate of the standard deviation. It is used in conjunction with the inverted matrix to calculate confidence intervals and statistical tests of significance. When there are fewer than 100 records, StatPac will apply an adjustment to the above three statistics, and the adjusted value will be printed. The adjustment is for a small n and its value should be used. The variability of the dependent variable is made up of variation produced by the joint effects of the independent variables and some unexplained variance. The overall Ftest is performed to determine the probability that the true coefficient of multiple determination is zero. Typically, a probability of .05 or less leads us to reject the hypothesis that the regression equation does not improve our ability to predict the dependent variable.
Example of the Regression Statistics Printout
Regression CoefficientsThe regression coefficients can be printed with the RC=Y option. The output includes the constant, coefficient, beta weight, Fratio, probability, and standard error for each independent variable. Each coefficient provides an estimate of the effect of that variable (in the units of the raw score) for predicting the dependent variable. The beta weights, on the other hand, are the standardized regression coefficients and represent the relative importance of each independent variable in predicting the dependent variable. The Fratio allows us to calculate the probability that the influence of the predictor variable occurred by chance. The tstatistic for each independent variable is equal to the square root of its Fratio. The standard error of the ith regression coefficient can be used to obtain confidence intervals about each regression coefficient in conjunction with its Fratio.
Example of Regression Coefficients Printout
Simple Correlation MatrixAfter performing a regression analysis, it is a good idea to review the simple correlation matrix (SC=Y). If two variables are highly correlated, it is possible that the matrix is not well conditioned and it might be beneficial to run the regression again without one of the variables. If the coefficient of multiple determination does not show a significant change, you might want to leave the variable out of the equation.
Example of a Simple Correlation Printout
Partial Correlation MatrixThe partial correlation matrix (often called the variancecovariance matrix) is obtained from the inverse of simple correlation matrix. It can be selected with the option PC=Y. This is useful in studying the correlation between two variables while holding all the other variables constant. A significant partial correlation between variables A and B would be interpreted as follows: When all other variables are held constant, there is a significant relationship between A and B. The partial correlation matrix will be printed for those variables remaining in the equation after the stepwise procedure.
Example of a Partial Correlation Matrix Printout
Inverted Correlation MatrixThe solution to a multiple regression problem is obtained through a technique known as matrix inversion. The inverted correlation matrix is the inversion of the simple correlation matrix. It may be selected with the option IC=Y. In examining the inverted matrix, we are specifically interested in the values along the diagonal. They provide a measure of how successful the matrix inversion was. If all the values on the diagonal are close to one, the inversion was very successful and we say the matrix is "well conditioned". If, however, we have one or more diagonal values that are high (greater than ten), we have a problem with collinearity (high correlations between independent variables).
Example of an Inverted Matrix Printout
Print Each StepYou can print the statistics for each step of the stepwise procedure using the option PS=Y. This may be important when you want to study how the inclusion or deletion of a variable affects the other variables.
Example of the Print Steps Output
Summary TableA good way to get an overview of how the steps proceeded and what effect each step had upon the rsquared is to print a summary table. To print the summary table, use the option ST=Y.
Example of a Summary Table
Data TableThe data table provides a detailed method to examine the residuals. StatPac allows printing of the table to more closely study the residuals. Using the option DT=Y will cause the output to include a data table. A "residual" is the difference between the observed value and the predicted value for the dependent variable (the error in the prediction). The standardized residuals which appear in the data table are the residuals divided by the standard error of the multiple estimate. Therefore, the standardized residuals are in standard deviation units. In large samples, we would expect 95 percent of the standardized residuals to lie between 1.96 and 1.96.
Example of a Data Table
Outlier Definition and AdjustmentOutliers (extreme data points) can have a dramatic effect on the stability of a multiple regression model. There are two common methods to deal with outliers in multiple regression models. The first is to simply eliminate any records that contain an outlier and then rerun the regression without those data records. When using OA=D (setting the outlier adjustment to delete), records containing the highest and lowest extreme residuals are deleted from the analysis. The other method is where the dependent variable is adjusted for records with the highest and lowest extreme residuals. That is, the dependent variable is modified to a value that limits the standardized residual to a predetermined value. The OA option is used to set the outlier adjustment method. It may be set to OA=N (none), OA=D (delete), or OA=A (adjust). Both methods use a twostep process. First the regression is performed using the actual values for the dependent variable and standardized residuals are calculated for each predicted value. When a standardized residual exceeds a given zvalue, the record is flagged. Then the regression is run again and the flagged records are either eliminated (OA=D), or the value of the dependent variable is adjusted to the value defined by the outlier definition zvalue (OA=A). For example, if the outlier definition is set to 1.96 standard deviations (OD=1.96), the upper and lower two and a half percent of the outliers would be flagged. Then the dependent variables for the flagged records would be modified to a value that would produce an outlier of plus or minus 1.96. Finally, the regression would be rerun using the modified dependent variable values for the flagged records. Flagged data records will be shown with an asterisk in the data table. The stepwise procedure presents a problem for data trimming. The stepwise procedure often reduces the number of independent variables to a subset of the original list of independent variables. Data trimming involves using the standardized residuals to adjust the value of the dependent variable for some records. Rerunning the stepwise procedure with different values for some of the dependent variables could result in a different set of independent variables being stepped into the model, especially when there are highly correlated independent variables. To avoid this problem, StatPac reruns the multiple regression using the same independent variables selected in the first stepwise procedure. These variables are forced into the model so that the analysis runs in the nonstepwise mode. It is important to note that the outlier adjustment process is only performed once because each regression would produce a new set of standardized residuals that would exceed the outlier definition value (OD=z). That is, any set of data with sufficient sample size will yield a set of outliers, even if the data has already been adjusted. Allowing the outlier adjustment process to repeat indefinitely would eventually result in an adjustment to nearly every data record. It is suggested that the user actually examine records that are flagged as extreme outliers before allowing the program to make any adjustments. Outlier adjustments assume that the data for all independent variables is acceptable. A mispunched data value for an independent variable could result in an extreme prediction that gets flagged as an outlier. Therefore, visual inspection is the best way to guarantee the successful handling of outliers. Confidence Intervals & Confidence LevelConfidence intervals provide an estimate of variability around the regression line. Narrow confidence intervals indicate less variability around the regression line. The option CI=Y will include the confidence intervals in the data table. Prediction intervals, instead of confidence intervals, should be used if you intend to use the regression information to predict new values for the dependent variable. Both the confidence intervals and the prediction intervals are centered on the regression line, but the prediction intervals are much wider. The option CI=P will print the prediction intervals in the data table. The actual confidence or prediction interval is set with the CL option. The CL option specifies the percentage level of the interval. For example, if CI=P and CL=95, the 95% prediction intervals would be printed in the data table. Number of Variables to ForceThe ability to force variables into an equation is important for several reasons: 1. A researcher often wishes to replicate the analysis of another study and, therefore, to force certain core variables into the equation, letting stepwise regression choose from the remaining set. 2. Some variables may be cheaper or easier to measure, and the user may want to see whether the remaining variables add anything to the equation. 3. It is common to force certain design variables into the equation. 4. When independent variables are highly correlated, one of them may be more accurate than the rest, and you may want to force this variable into the equation. The FO option specifies the number of variables to force into the regression equation. To perform a standard (nonstepwise) multiple regression, set the FO option to the number of independent variables or higher. FO=200 will always force all independent variables into the equation. Thus, the FO option may be used to eliminate the stepwise part of the multiple regression procedure. If you force all variables into the equation, the multiple regression will contain only one step, where all variables are included in the equation. The variables to be forced are taken in order from the list of independent variables. For instance, the option FO=3 forces the first three variables from the list of independent variables. Therefore, any variables you want to force should be specified at the beginning of the independent variable list. F to Enter & F to RemoveWhen faced with a large number of possible explanatory variables, two opposed criteria of selecting a regression equation are usually involved: 1. To make the equation useful for predictive purposes, we would like our model to include as many of the independent variables as possible so that reliable fitted values can be determined. 2. Because of the costs involved in obtaining information on a large number of independent variables, and subsequently monitoring them, we would like the equation to include as few of the independent variables as possible. The compromise between these two extremes is generally called "selecting the best regression". This involves multiple executions of multiple regression in an attempt to add variables to improve prediction or remove variables to simplify the regression function. Stepwise regression provides a partial automation of this procedure. An important property of the stepwise procedure is based on the fact that a variable may be indicated to be significant in an early stage, and, thus, be entered in the equation. After several other variables are added to the regression equation, however, the initial variable may be indicated to be insignificant. The combined effects of two or more independent variables capture the same variance as a variable entered early on in the stepwise process. This method is often referred to as forward inclusion with backward elimination. The algorithm used by StatPac is as follows: 1. First, enter into the regression equation all variables which the user wishes to force into the equation. 2. Enter the predictor that produces the greatest decrease in the residual sum of squares from all remaining predictors whose entry is not inhibited by the Ftoenter. 3. Remove the predictor that makes the least increase in the residual sum of squares from all (nonforced) predictors whose removal is not inhibited by the Ftoremove inhibiting rule. Note that step 2 is executed only when it is not possible to execute step 3. If neither can be executed, the stepping is complete. The following should be considered when setting Ftoenter and Ftoremove values in the parameter table: 1. A variable is removed if the Fvalue associated with that variable is less than the Ftoremove value set in the parameter table. Similarly, a variable is added if the Fvalue associated with that variable would be greater than the Ftoenter value set in the parameter if that variable were entered in the current equation. 2. Care should be taken to ensure that the Ftoremove be less than the Ftoenter; otherwise, a variable would be entered and then removed at alternate steps. 3. The default values for the Ftoenter and Ftoremove for many mainframe packages and StatPac are 4.0 and 3.9, respectively, which provide useful starting values. 4. Forcing all variables in an equation will give the usual (nonstepwise) multiple regression results. 5. Setting the Ftoremove value low yields the forward inclusion method. 6. For the first run on a data set, it is common to set the Ftoenter and Ftoremove values low to execute a large number of steps. Residual Autocorrelation Function TableExamining the autocorrelation of the residuals is often used in timeseries analysis to evaluate how well the regression worked. It is a way of looking at the "goodnessoffit" of the regression line. If the residuals contain a pattern, the regression did not do as well as we might have desired. A residual autocorrelation function table contains the correlation between values that occur at various time lags. For example, at time lag one, you are looking at the correlation between adjacent values; at time lag two, you are looking at the correlation between every other value, etc. To select the residual autocorrelation function plot, type the option AC=Y.
Example of a Residual Autocorrelation Function Table
Expanding Standard ErrorYou may use the EX option to set the standard error limits of the residual autocorrelation function to a fixed value or to expand with increasing time lags. A study of sampling distributions on autocorrelated time
series was made by When EX=Y, the residual autocorrelation function error limits will widen with each successive time lag. If EX=N, the standard error limits will remain constant. Save ResidualsResearchers often want to save the residuals in a file for further study. If further analysis of the residuals shows a pattern, the regression may not have captured all the variance it might have, and we may want to model the residuals to further explain the variance. You can save the results in a file with the options command SR=Y. The dependent variables, predicted values, residuals and confidence or prediction intervals will be saved, and at the completion of the procedure you will be offered the opportunity to merge the saved data into the original data file. in the new file. Force Constant to ZeroThe option CZ=Y can be used to calculate a regression equation with the constant equal to zero. If this is done, the regression line is forced through the origin. Note that forcing the constant to zero disables calculation of the rsquared, coefficient of multiple correlation and the standard error of estimate. Furthermore, confidence intervals, which are calculated from the standard error of estimate, cannot be computed. The option CZ=N results in a standard regression equation. Mean SubstitutionMean substitution is one method often used to reduce the problem of missing information. Often, multiple regression research is difficult because if one independent variable is not known, it is necessary to exclude the whole record from the analysis. It is possible that this can substantially reduce the number of records that are included in the analysis. Mean substitution overcomes this problem by replacing any missing independent variable with the mean of that variable. While this technique has the possibility of slightly distorting the results, it can make it possible to perform a regression with substantial missing data. Steps LimitThe steps limit is simply the maximum number of steps that can occur. Each inclusion and deletion of a variable is counted as one step. The purpose of the steps limit is to limit computer time. The syntax for the steps limit option is SL=n, where n is the maximum number of steps allowed. Predict InteractivelyAfter performing a regression, you may want to predict values for the dependent variable. This is known as interactive prediction. Select interactive prediction by entering the option PR=Y. You will then be prompted to enter a value for each independent variable, and the computer will use the regression coefficients to predict the dependent variable. Labeling and Spacing Options
Probit and logistic regression analyses examine the relationship between a dichotomous dependent variable (takes on only two values) and one or more explanatory variables (also called independent or predictor variables). When the dichotomous variable (dependent variable) is coded as 0 or 1, its predicted value from probit or logistic regression is the estimated probability of it being 1. Probit and logistic regressions are often used to answer yes/no type questions. For example: a banker wants to decide whether or not to make a loan, or a scientist wants to predict whether the rat lives or dies. Both questions are yes/no and could be coded as zero or one. Deciding between probit or logistic regression is a matter of choice. In logistic regression, the estimated value of the dependent variable is based on the cumulative logistic distribution, and in probit regression it is based on the cumulative normal distribution. The logistic distribution is scarcely distinguishable from the cumulative normal distribution between response rates of .01 and .99, and therefore, the choice of probit or logistic regression is usually made on the basis of which technique the user is most familiar with. The syntax of the commands and options to run probit or logistic regression are identical:
PROBIT <Dependent variable> <Independent variable list> LOGIT <Dependent variable> <Independent variable list>
For example, a banker might want to predict successful loan repayment (V4=LOAN PAYBACK) from previous loan experience (V1=EXPERIENCE), number of credit cards (V2=CREDIT CARDS), and bank balance (V3=BALANCE). The command to run the regression could be specified in several ways.
PROBIT LOAN PAYBACK, EXPERIENCE, CREDIT CARDS, BALANCE PROBIT V4, V1V3 (Note: PROBIT may be abbreviated as PR) LOGIT LOAN PAYBACK V1V3 LO V4 V1 V2 V3 (Note: LOGIT may be abbreviated as LO)
In each example, the dependent variable is specified first, followed by the independent variable list. The variable list may contain up to 200 independent variables and can consist of variable names and/or numbers. Either a comma or a space can be used to separate the variables from each other. When using probit or logistic regression, the dependent variable is always coded as 0 or 1. Regular multiple linear regression, with the dependent variable coded as 0 or 1, is inappropriate for the following reasons: 1. Estimated probabilities using multiple regression are not restricted to the interval (0,1). Using multiple linear regression, it is quite possible to get an estimated probability of greater than one or less than zero. It would be difficult to interpret this as an estimated probability. Unfortunately, it is quite common for 10% to 20% of the estimated probabilities to lie outside the unit interval when employing multiple regression with a (0,1) dependent variable. 2. Estimated probabilities using multiple regression are exceptionally sensitive to the observed distribution of the dependent variable (i.e., very small or very large mean for the dependent variable). 3. Standard multiple regression assumes that the effect of the independent variables is constant over the entire range of the predicted dependent variable. Probit and logistic regression, on the other hand, assume that the effects of the independent variables vary (i.e., nonlinear multiple regression). In summary, there are two assumptions we make when using probit or logistic regression analysis: 1. the dependent variable of a record is assumed to be most flexible when its estimated probability is near onehalf (i.e., the effect of an independent variable is expected to be highest when its estimate of probability is onehalf). In cases when the outcome of the event seems certain (e.g. P<.1 or P>.9 ) the explanatory variables have a smaller impact on changing the probability than the cases where the outcome is less certain. If the probability of an event is .9, we are in a stage of diminishing return to increasing its probability. 2. the effect of an independent variable depends on the estimated probability. Descriptive StatisticsDescriptive statistics will be printed as part of the probit or logistic regression output if the option DS=Y is specified.
Example of Descriptive Statistics Printout
The example is an analysis looking at cancer remission data. The objective of the analysis is to assess the probability of complete cancer remission (REMISSION) on 3 patient characteristics (CELL, LITHIUM and TEMPERATURE). The output reveals that there were 9 cases with cancer remission (DV=1) and 18 cases without cancer remission (DV=0), for a total of 27 cases. From the descriptive statistics, we can see that REMISSION appears associated with high mean values for CELL and LITHIUM, and a low mean value for TEMPERATURE. Simple Correlation MatrixThe simple correlation matrix can be requested with the option SC=Y. The simple correlation output can be used to examine the relationships between the independent variables.
Example of a Simple Correlation Matrix Printout
Regression InformationThe regression coefficients and their standard errors are automatically included in each analysis. The output also includes the t statistic and its probability. The t statistic for an independent variable is its coefficient divided by its standard error.
Example of a Regression Information Printout
The chisquare statistic is used to measure the overall significance of the equation in explaining the dependent variable. This statistic is equivalent to the overall Fratio in multiple regression and tests whether the set of independent variables as a group contributes significantly to the explanation of the dependent variable. Estimates of the regression coefficients are obtained by maximizing the log of the likelihood function using the iterative NewtonRaphson method of scoring. Convergence is said to have occurred if the change in the log of the likelihood function on successive iterations is less than or equal to the tolerance level set in the parameter file. The tolerance level can be set between .001 and .000000001. Change in Probability TableThe change in probability table may be selected with the option PT=Y. As indicated earlier, the independent variable has its maximum effect when the estimated probability is onehalf. Using the change in probability table, we can study how the probability changes when there is a change in the value of an independent variable.
Example of a Change in Probability Table
The table reveals how a one unit increase in each independent variable will affect the probability of the dependent variable. In the sample printout, note that if the estimated probability of cancer remission for an individual is .5, a one unit increase in the independent variable V3 is expected to increase the predicted probability by .89812. For that same individual, an increase of .1 in V3 would be expected to increase the predicted value of REMISSION by .89812 times .1 = .089812. If the estimated probability of cancer remission for an individual is .9, a .1 increase in V3 is expected to increase the predicted probability by only .039501 (.39501 times .1). The first column in the "Change in Probability" table is always the effect of the independent variable evaluated at the sample mean of the dependent variable (.3333 for this example). Classification TableThe classification table may be selected with the option CT=Y. It gives the frequency distribution of the observed value of the dependent variable (0 or 1) versus its predicted value based on the set of independent variables. If the dependent variable is well explained by the set of independent variables, we expect: 1. The frequencies in the first row of the table (observed value of DV=0) to be clustered in the first few columns. 2. The frequencies in the last row of the table (observed value of DV=1) to be clustered in the last few columns.
Example of a Classification Table Printout
Mean SubstitutionMean substitution is one method often used to reduce the problem of missing information. Often, regression analysis is difficult because if one independent variable is not known, it is necessary to exclude the whole record from the analysis. It is possible that this can substantially reduce the number of records that are included in the analysis. Mean substitution overcomes this problem by replacing any missing independent variable with the mean of that variable. While this technique has the possibility of slightly distorting the results, it can make it possible to perform a regression with substantial missing data. Convergence ToleranceThe convergence tolerance is used to find the maximum log of the likelihood function. It may be set using the option TL=n, where n is the convergence tolerance. A good initial value to use is .0000001. The value of the convergence tolerance is very important. Too high a value does not result in the maximum of the likelihood function, while too small a value results in an iterative procedure which drifts about the maximum. If this is the case, the program will start halving, working towards the previous (higher) value of the log of the likelihood function. A message will be printed as follows: Results are not based on last iteration but on a previous iteration which had a higher value for the log likelihood function. Convergence assumed after x iterations The above message will also be printed if the iterations don't converge. This is usually due to one (or more) of the following reasons: 1. Some of the explanatory variables are highly correlated. Examination of the correlation matrix in conjunction with collinearity diagnostics (such as those found in principal components and multicollinearity analyses) will usually indicate a variable which should have been omitted or transformed. 2. The response surface is very flat; this is usually due to the variables as a group being poor predictors of the dependent variable. There is then no significant maximum to find. 3. A variable may have very little variability and, therefore, be highly correlated with the constant term. 4. The iteration may go too far and skip the maximum point. This is usually due to setting the value too low. Although the message is printed, it is not usually a problem since StatPac always saves the value of the regression coefficients at the maximum value of the log likelihood and attempts halving towards the maximum point. Iteration LimitThe maximum number of iterations can be set from 1 to 100 as a safeguard against a flat surface where iterations might proceed indefinitely. With the convergence tolerance set at 0.0000001, the number of iterations required for convergence usually varies from 4 to 12. Setting the maximum number of iterations around 30 is adequate. It should be noted that the data is read from disk with each iteration. The amount of time to execute one iteration is about the same as the amount of time it takes to run a multiple regression with the same number of variables and cases. Save ProbabilitiesResearchers often want to save the probabilities in a file for further study. When SR=Y, you will be offered the opportunity to merge the predicted probabilities into the original data file at the completion of the analysis. Predict InteractivelyAfter performing a regression, you may want to predict values for the dependent variable. Select interactive prediction by entering the option PR=Y. At the completion of the analysis, you will then be prompted to enter a value for each independent variable, and the computer will use the regression coefficients and cumulative normal distribution (for probit) or the cumulative logistic distribution (for logit) to predict the probability that the value of the dependent variable is equal to one.
Labeling and Spacing Options
Principal components analysis (PCA) investigates relationships among variables without designating some as independent and others as dependent; instead PCA examines relationships within a single set of variables. The technique of PCA is primarily used to reduce the number of dimensions. Usually, most of the variation in a large group of variables can be captured with only a few principal components. Up to 200 variables can be analyzed. This technique basically attempts to explain the variancecovariance structure of variables by constructing a smaller set of orthogonal (independent) linear combinations (principal components) of the original variables. The first principal component (PC) is that weighted combination of response variables which accounts for the maximum amount of total variation in the original variables. The second PC is that weighted combination of response variables which, among all combinations orthogonal to the first, accounts for the maximum amount of remaining variation. The syntax to run a principal components analysis is:
PCA <Variable list>
For example, after conducting a lengthy survey, we might believe that several of the questions were actually measuring the same thing. Principal components could be used to isolate those questions that were measuring the same dimension. Take the following questions from a larger twentyfive question survey:
11. What is your annual income? 12. What percent of your salary do you pay in taxes? 13. How much discretionary income do you have? 14. What is the market value of your house? 15. How much disability insurance do you carry?
All of the above questions might be measuring a dimension related to income. If a principal components analysis extracted these variables into one component, we might try to shorten future surveys by asking fewer questions about income. The command to run the principal components analysis would be:
PCA V1V25 PC V1V25 (Note: PCA may be abbreviated as PC)
Notice that all the questions in the survey were specified as part of the variable list. It is the job of PCA to extract the individual components. A component is the weighted combination of variables which explains the maximum amount of remaining variation in the original variables (orthogonal to the previous components). That is, each component is mutually independent from all other components. Mathematically, the problem is one of explaining the variancecovariance structure of the variables through linear combinations of the variables. Primary interest lies in the algebraic sign and magnitude of the principal component coefficients (loadings), and in the total variation in the dependency structure explained by a component. Grouping of variables is based on the magnitude of the loadings associated with each principal component. Loadings below .30 are usually disregarded for purposes of interpretation. Loadings are comparable to standardized partial regression coefficients. The sign and magnitude of each loading reveals how the particular variable is associated with that principal component. A loading may be interpreted like a correlation coefficient in that it shows the strength and direction of the relationship between a given variable and the principal component. Generally, during study design, concepts or constructs are identified as part of the study goals. Variables are developed to answer the study goals. Principal components analysis can be used to evaluate how well each variable is associated with the construct it was designed to measure. In a well structured survey, each variable will have a high loading on only one construct (the one it was designed to measure). When a variable has a high loading on more than one principal component, the variable did not do a good job of discriminating between two or more constructs Principal ComponentsEither the correlation matrix or covariance matrix may be used for deriving principal components. If the responses are in similar units, the covariance matrix has a greater statistical appeal. When the responses are in widely different units (age in years, weight in kilograms, height in centimeters, etc.) the correlation matrix should be used. In practice, the use of the correlation matrix is more common. The PC option may be set to one of four values: A = Correlation matrix B = Covariance matrix with mean correction C = Covariance matrix without mean correction D = PCA not requested
Example of Principal Components Printout
In the example above, PCA was run on variables 2 to 7 of the Longley data. The correlation matrix was used (PC=A) because the variables were in widely different units. Note that the first two principal components account for 96.3% of the total variation in the dependency structure of the six variables. The first PC has equally high loadings on all variables except variable 3 (size of the armed forces). This component can be interpreted as an economic performance indicator. It is common for the first PC to load equally on most variables. The second component has high loadings on variable 3 (.80) and variable 4 (.60) and the signs are different, implying that there is a high relationship between size of the armed forces and unemployment. The fact that the signs are different implies that as size of the armed forces goes up, unemployment goes down (as expected). The remaining four PC's account for less than 4% of the total variability and are ignored. Following are some examples of the common uses of principal components analysis: 1. The most common use of principal components analysis is to derive a small number of linear combinations (principal components) from a set of variables that retain as much of the information in the original variables as possible. Often a small number of principal components can be used in place of the original variables for plotting, regression, clustering, etc. 2. Principal components analysis can also be viewed as an attempt to uncover approximate linear dependencies among variables (i.e., to understand the correlation structure). The multicollinearity diagnostics described below are based on the principal components. 3. It is often impossible to measure certain theoretical concepts, but many (highly interrelated) variables may be available to provide a mathematical formula for the theoretical concept. Principal components can be used to determine appropriate weights associated with each of these variables to provide an "optimum" measure of a theoretical concept, such as mathematical ability. In essence, PCA obtains components which may be given special meaning. 4. Because the principal components are independent of each other, they may help to circumvent the problem of multicollinearity. Most of the variation in the variables is accounted for by the first few principal components. The last few principal components define dimensions of the regressor space that are not very prominent  the components are so flimsy that they can be blown around wildly by small perturbations in the data. To get rid of the instability of the estimates, you throw away the last few components. There are as many principal components as original variables, but usually only a small number of components are retained. 5. Principal components analysis is similar to factor analysis in that they both provide analysis of the interdependence structure of a set of variables. In factor analysis, it is assumed that each original variable is influenced by various factors. Some are shared by other variables in the set (common factors), while others are not shared by any other variable (unique factors). In PCA, on the other hand, no assumptions about the underlying structure of the variables are made. We define new hypothetical variables that are exact mathematical transformations of the original variables, but that are independent of each other. That is, we seek that set of linear combinations of the original variables that absorb and account for the maximum possible proportion of total variation in those variables. The first principal component is the single best summary of the total variance; the second principal component is the best summary of the variance remaining after the first principal component has been extracted. Subsequent components are defined similarly until all the variance in the data is exhausted. Descriptive StatisticsDescriptive statistics may be printed or suppressed using the DS=Y or DS=N option respectively.
Example of Descriptive Statistics Printout
Simple Correlation MatrixThe simple correlation matrix (SC=Y) is the easiest way of examining linear dependencies between the variables. High intercorrelations between variables is a warning sign that collinearity might exist.
Example of a Simple Correlation Printout
Collinearity DiagnosticsMulticollinearity refers to the presence of highly intercorrelated predictor variables in regression models, and its effect is to invalidate some of the basic assumptions underlying their mathematical estimation. It is not surprising that it is considered to be one of the most severe problem in multiple regression models and is often referred to by social modelers as the "familiar curse". Collinearity diagnostics measure how much regressors are related to other regressors and how this affects the stability and variance of the regression estimates. Signs of multicollinearity in a regression analysis include: 1. Large standard errors on the regression coefficient, so that estimates of the true model parameters become unstable and low tvalues prevail. 2. The parameter estimates vary considerably from sample to sample. 3. Often there will be drastic changes in the regression estimates after only minor data revision. 4. Conflicting conclusions will be reached from the usual tests of significance (such as the wrong sign for a parameter). 5. Extreme correlations between pairs of variables. 6. Omitting a variable from the equation results in smaller regression standard errors. 7. A good fit not providing good forecasts.
We use the multicollinearity diagnostics: 1. To produce a set of condition indices that signal the presence of one or more near dependencies among the variables. (Linear dependency, an extreme form of multicollinearity, occurs when there is an exact linear relationship among the variables.) 2. To uncover those variables that are involved in particular near dependencies and to assess the degree to which the estimated regression coefficients are being degraded by the presence of the near dependencies. In practice, if one independent variable has a high squared multiple correlation (rsquared) with the other independent variables, it is extremely unlikely that the independent variable in question contributes significantly to the prediction equation. When the rsquared is too high, the variable is, in essence, redundant. When the collinearity analysis is requested with the CD=Y option, the statistics attributed to Belsley, Kuh and Welsch (1980) are printed (namely the eigenvalues, condition indices and the decomposition of the variances of the estimates with respect to each eigenvalue).
Example of Collinearity Diagnostics
Note that variables 2,3,6 and 7 are highly correlated and the VIF's for all variables (except variable 4) are greater than 10 with one of them being greater than 1000. Examination of the condition index column reveals a dominating dependency situation with high numbers for several indices. Further regressions on subsets of the independent variables are called for. The following steps are generally recommended in diagnosing multicollinearity: 1. Inspection of the correlation matrix for high pairwise correlations; this is not sufficient, however, since multicollinearity can exist with no pairwise correlations being high. 2. VIF's greater than 10 are a sign of multicollinearity. The higher the value of VIF's, the more severe the problem. In the StatPac output, any VIF greater than 999.99999 is set to the value 999.99999. 3. Condition indices of 30 to 100 (generally indicating moderate to strong collinearities) combined with at least 2 high numbers (say greater than 0.5) in a "variance proportion" row are a sign of multicollinearity. The higher the condition indices, the more severe the multicollinearity problem. Three cases can be distinguished: Case 1: Only one near dependency present This occurs when only one condition index is greater than 30. A variable is involved in, and its estimated coefficient degraded by, the single near dependency if it is one of two or more variables in a row with "variance proportion" numbers in excess of some threshold value, such as .5 . Case 2: Competing dependencies This occurs with more than one condition index of roughly the same magnitude and greater than 30. Here, involvement is determined by aggregating the "variance proportion" numbers of each variable over the high condition index rows. The variables whose aggregate proportions exceed 0.5 are involved in at least one of the dependencies, and therefore, have degraded coefficient estimates. The number of near dependencies present corresponds to the number of competing indices. Case 3: Dominating dependencies Dominating dependencies exist when high condition indices (over 30) coexist with even larger condition indices. The dominating dependency can become the prime determinant of the variance of a given coefficient and obscure information about the simultaneous involvement in a weaker dependency. In this case, other variables can have their joint involvement obscured by the dominating near dependency. With this dominating near dependency removed, the obscured relationship may reappear. In this case, additional analysis, such as auxiliary regressions, is warranted to investigate the descriptive relations among all of the variables potentially involved. Since the variance of any regression coefficient depends on regression residual error, sample size, and the extent of multicollinearity, the following are suggested as possibilities for increasing the precision of the regression coefficients: 1. Can the precision of measurement of any variable be improved? This has the effect of reducing regression residual error. 2. Can the model specification be improved? Have we, for example, omitted an important variable or transformed the variables appropriately (using logs, reciprocal, etc.) to match theory? 3. Can we increase the sample size, thereby decreasing mean square residual error? 4. Can we replace a variable with another less correlated with the current set of independent variables, but as correlated with the dependent variable? 5. A group of variables (highly intercorrelated) may be aggregated (or averaged), using principal components or factor analysis to find appropriate weights. This is especially true of variables measured in the same units, such as income. 6. Because multicollinearity indicates that some independent variables convey little information over that of the other variables, one way to scale down variables is to drop a redundant variable. Include InterceptA collinearity with the constant term occurs because some linear combination of two or more variables is essentially constant. This situation can be detected by omitting the intercept from the collinearity analysis (set with the IC=N option), and then examining the standard deviations or coefficients of variation (standard error/mean) of the variables. Variance Inflation FactorsVariance inflation factors show the degree to which a regression coefficient will be affected because of the variable's redundancy with other independent variables. As the squared multiple correlation of any predictor variable with the other predictors approaches unity, the corresponding VIF becomes infinite. For any predictor orthogonal (independent) to all other predictors, the variance inflation factor is 1.0. VIFi thus provides us with a measure of how many times larger the variance of the ith regression coefficient will be for multicollinear data than for orthogonal data (where each VIF is 1.0). If the VIF's are not unusually larger than 1.0, multicollinearity is not a problem. An advantage of knowing the VIF for each variable is that it gives a tangible idea of how much of the variances of the estimated coefficients are degraded by the multicollinearity. VIF's may be printed using the VI=Y option. Save ScoresResearchers often want to create principal component scores for respondents. This provides an indication of how strongly an individual loads on each component. The data is first standardized and the principal component loadings are used as coefficients for calculating the component score. You can save the component scores with the options command SS=Y. At the completion of the analysis you will be given the opportunity to merge to component scores with the original data file.. Mean SubstitutionMean substitution is one method often used to reduce the problem of missing information. Usually, if one independent variable is not known, it is necessary to exclude the whole record from the analysis. It is possible that this can substantially reduce the number of records that are included in the analysis. Mean substitution overcomes this problem by replacing any missing variable with the mean of that variable. While this technique has the possibility of slightly distorting the results, it can make it possible to perform a principal components analysis with substantial missing data. Labeling and Spacing Options
Factor analysis is similar to principal components analysis. It is another way of examining the relationships between variables. Factor analysis differs from principal components in that there are usually fewer factors than variables. Up to 200 variables may be included. StatPac contains two different methods for extracting factors from a set of variables: varimax and poweredvector. Both methods extract factors that are independent (orthogonal) from other factors. This is known as a simple structure analysis. The program also contains an option to improve the simple structure by allowing the factors to be correlated with each other. When factors are correlated, it is known as an oblique reference structure analysis. The syntax to run a factor analysis is:
FACTOR <Variable list> FA <Variable list> (Note: FACTOR may be abbreviated as FA)
Factor analysis is essentially a way of examining the correlation matrix. While both techniques (varimax and poweredvector) are quite different from each other in methodology, they have the same first step, which is to decide what values are to be used for the communality estimates. For each variable, the communality is defined as the sum of the squares of all the factor loadings for that variable. A factor loading itself can be thought of as the correlation of a variable with the factor. The communalities are placed on the diagonal of the correlation matrix. Initial Communality Estimates for the DiagonalThere are three commonly used communalities: units, absolute row maximums and squared multiple correlation coefficients. Units is probably the most commonly used since it is always a value of one. That is, the value of one is placed on the diagonal of the correlation matrix. The highest absolute row maximums refers to the highest absolute correlation coefficient in each row. The squared multiple correlation coefficients refers to the correlation of each variable with the remainder of the variables. That is, the diagonal variable is regressed against all other variables, and the coefficient of multiple correlation is placed on the diagonal. You can select the initial diagonal values by setting the DI option to one of three values: 1 = Use units, (i.e., 1) 2 = Use highest absolute row correlation 3 = Use squared multiple correlation coefficient Type of SolutionThere are two different techniques to extract factors. The varimax solution (TY=V) is most commonly used. The first step with the varimax technique is known as "principal factor analysis". It is the same as a principal components analysis except that the extraction of principals is stopped by some predetermined criterion. When a sufficient number of principals has been extracted, a rotational technique called "varimax" is used to create the simple structure factor loadings. The poweredvector technique (TY=P) uses an entirely different approach. This approach is faster than the varimax technique because it does not require a principal components analysis first. Instead, it uses a cluster technique to extract factors directly from the correlation matrix. An additional technique called "weighted crossfactor" rotation is often used with the poweredvector solution to provide a cleaner separation among the factors.
Example of a Factor Analysis Printout
Descriptive StatisticsOccasionally, you may be interested in the means and variance for each of the variables in the analysis. Descriptive statistics may be printed or excluded from the output by using the options DS=Y and DS=N, respectively.
Example of a Descriptive Statistics Printout
Simple Correlation MatrixIt is often desirable to print the simple correlation matrix when performing a factor analysis (SC=Y). It can provide a good initial understanding of the interrelationships in the data.
Example of a Simple Correlation Matrix Printout
Principal Components AnalysisThe results of the principal components analysis may be printed (PC=Y) or not (PC=N). This option only refers to printing the output. The remaining principal components options must be set appropriately regardless of the PC setting. CrossFactor RotationCrossfactor rotation is often used in conjunction with the poweredvector technique to improve upon the simple structure. It usually provides a "cleaner" structure, that is, a clearer separation of the factors. Select crossfactor rotation with the option CR=Y. Oblique Simple Structure Factor LoadingsAfter performing either a varimax or poweredvector solution, you may want to perform an oblique rotation. Both the varimax and poweredvector solutions make an arbitrary assumption that factors are unique and independent of one another. The oblique rotation removes this restriction and allows factors to be correlated with each other. The oblique simple structure factor loadings may be printed (OS=Y) or excluded (OS=N). The oblique rotation is often used to test the uniqueness of factors. If the resulting factors have low intercorrelations after an oblique rotation, it is fairly certain that the factors are orthogonal (independent of each other).
Example of Oblique Simple Structure Factor Loadings
Oblique Factor Correlation MatrixThe correlation matrix of the factors can be printed (OC=Y) or excluded (OC=N). It represents relationships between factors after the oblique rotation. If the correlations are low (less than .3), we can be confident that the varimax or poweredvector solution produced unique and unrelated factors.
Example of an Oblique Correlation Matrix Printout
Number of FactorsWhen the EX=1, StatPac will extract NF factors from the data. This is often used to test a specific hypothesis that has been developed. For example, you might believe that twenty items on a survey are really measuring only two major factors. You could test this hypothesis by using the option:
OPTIONS EX=1 NF=2
In this case, only two components would be extracted. They could then be examined as to how well they "hang together". The only time you use the NF option is to test a specific hypothesis. Percent of Total Variance to ExplainWhen the EX=2, StatPac will continue to extract factors until it has accounted for PR proportion of the variance. For example, to use the poweredvector technique and extract factors until 95 percent of the variance has been accounted for, enter the option:
OPTIONS EX=2 PR=95
Minimum Variance Proportion for Principal InclusionWhen EX=3, StatPac will continue to extract components until the next component would not account for a minimum proportion of the total variance. For example, if you set MP=5, the program will extract all the components that account for at least five percent of the total variance. Any component that does not account for five percent of the variance will not be included in the analysis. Varimax Rotational Factor AngleThe varimax rotational technique is one of the better factoring methods. It usually provides a clearer separation of factors than other methods. The technique extracts factors using normalized factor loadings during the iterations. These factors are assumed to be unique (orthogonal) and not correlated with each other. The technique attempts to maximize the variance of the squared loadings for each factor. A factor loading itself can be thought of as the correlation of a variable with the factor. Using this method, the angle of rotation is calculated in each iteration. When the angle is less than the varimax rotational factor angle (RF), the process is completed. In other words, each iteration is a rotation in an attempt to improve the simple structure. When the rotational angle is less than the value of RF, the exit criteria has been achieved. The most often used value is one degree. The RF option can be used to set the angle to any value. For example, the following options command sets the exit criteria to one and a half degrees:
OPTIONS RF=1.5
Convergence Tolerance for Principal ComponentsThe convergence tolerance is the way that the "resolution" of the components is controlled. In other words, it determines the point at which the program decides it has finished extracting a component. Setting the convergence tolerance too low will result in a very large number of calculations and, in the worst case, may cause the program to exceed the limit on the number of iterations. Setting the value too high would cause the program to prematurely believe it had extracted a component. A good starting value for the convergence tolerance is TL=.000001. Convergence Tolerance for Oblique RotationEach iteration in the oblique rotation improves the simple structure by successively decreasing amounts. The convergence tolerance (OT) places a limit on the convergence process, and is used as the exit criteria for oblique rotation. The iterations will continue until additional rotations do not improve the simple structure by more than the convergence tolerance value. While the convergence tolerance could be any number greater than zero, a typical value might be .000001. The options command to set this value is:
OPTIONS OT=.000001 Iteration LimitThe maximum number of iterations for any of the factoring algorithms is included to limit the time that the computer could be working on the problem. Convergence usually occurs in fewer than ten iterations; however, as the convergence tolerance is set to lower values, it will require more iterations to achieve a solution. Generally, the maximum number of iterations is set to 100 (IT=100). This seems to allow most solutions, and at the same time, prevents unreasonable calculation times. Save ScoresResearchers often want to create factor scores for respondents. This provides an indication of how strongly an individual loads on each factor. The data is first standardized and the factor loadings are used as coefficients for calculating the factor scores. It should be noted that the loadings will be those of last rotation requested with the other options. Thus, if an oblique rotation was requested (i.e., the last rotation to be performed) the saved scores will be those obtained from the oblique rotation. You can save the factor scores with the options command SS=Y. At the completion of the analysis, you will be given the opportunity to merge the saved scores with the original data file. Mean SubstitutionMissing data can become a problem when there are few cases. In the worst case, missing data may make it impossible to perform an analysis. Mean substitution is one method often used to combat the problem of missing data. When mean substitution is used (MS=Y), any data that is missing is replaced with the mean of the variable. Labeling and Spacing Options
The objective of cluster analysis is to separate the observations into different groups (clusters) so that the members of any one group differ from one another as little as possible, whereas observations across clusters tend to be dissimilar. The grouping can be used to summarize the data or as a basis for further analysis. In discriminant analysis, the groups are already defined, whereas in cluster analysis the purpose is to define the groups.
The syntax of the command to run cluster analysis is:
CLUSTER <Variable list>
As an example, a researcher studying iris flowers wanted to know if the iris would group into types based on length and width of their sepals and petals. The four clustering variables are:
V1 SEPAL LENGTH length of sepal V2 SEPAL WIDTH width of sepal V3 PETAL LENGTH length of petal V4 PETAL WIDTH width of petal
The command to run cluster analysis could be specified in several ways:
CLUSTER SEPAL LENGTH, SEPAL WIDTH, PETAL LENGTH, PETAL WIDTH CLUSTER V1 V2 V3 V4 CLUSTER V1V4 CL V1V4 (Note: CLUSTER can be abbreviated as CL)
In the first example, a continuation line was used to extend the variable list. The variable list can consist of variable labels and/or variable numbers. Either a comma or a space can be used to separate the variables from each other. CLUSTER provides two types of cluster analysis: agglomerative hierarchical cluster analysis and nonhierarchical cluster analysis. For hierarchical methods, the general procedure is as follows: 1. Begin with as many clusters as there are observations (i.e., each cluster consists of exactly one observation). 2. Search for the most similar pair of clusters. This involves evaluating a criterion (distance) function for each possible pair of clusters and choosing the pair of clusters for which the value of the criterion function is the smallest. The criterion function is constructed using the clustering variables; the actual formula for the criterion function depends on the clustering algorithm used. Label the chosen clusters as p and q. 3. Reduce the number of clusters by one through the merger of clusters p and q; the new cluster is labeled q. 4. Perform steps 2 and 3 until all the observations are in one cluster. At each stage, the identity of the merged clusters as well as the value of the criterion function is stored. Nonhierarchical clustering methods begin with the number of clusters given. Their primary use is to refine the clusters obtained by hierarchical methods. The nonhierarchical cluster analysis methods used by StatPac are convergent Kmeans methods and generally follow the following sequence of steps: 1. Begin with an initial partition of data units into clusters. There are several different initial partitions available in StatPac. 2. Take each data unit in sequence and compute the distances to all cluster centroids; if the nearest centroid is not that of the data unit's current cluster, then reassign the data unit and update the centroids of the losing and gaining clusters. 3. Repeat step 2 until convergence is achieved; that is, continue until a full cycle through the data set fails to cause any changes in cluster membership. Cluster analysis performed on small data sets (a few hundred cases) will run relatively fast. However, the time to run the analysis increases exponentially with the number of cases in the data file. When thousands of cases are involved, it may take several hours to complete the analysis. Therefore, during preliminary analyses on large data sets, it may be desirable to use a SELECT statement to limit the number records being analyzed. Type Of Clustering AlgorithmThere are six different clustering algorithms available in StatPac. The TY option is used to select the clustering method. Algorithms 13 are agglomerative hierarchical clustering algorithms while algorithms 46 are nonhierarchical clustering algorithms. Each of the six algorithms are described below: Minimum average sum of squares cluster analysis TY=1 With this algorithm, the clusters merged at each stage are chosen so as to minimize the average contribution to the error sum of squares for each member in the cluster. This quantity is also the variance in each cluster and is similar to average linkage in that it tends to produce clusters of approximately equal variance. Consequently, if the clusters are all of approximately the same density, then there will be a tendency for large natural groups to appear as several smaller clusters, or for small natural groups to merge into larger clusters. Ward's method TY=2 At each stage, this method minimizes the withincluster sum of squares over all partitions due to the merger of clusters p and q. This method tends to join clusters with a small number of observations and is biased towards producing clusters with roughly the same number of observations. Centroid method TY=3 This method minimizes the squared Euclidian distance between clusters at each stage. The centroid method is not as sensitive to the presence of outliers, but does not perform as well as the first two methods if there are no outliers. If there are no outliers, one of the first two methods should be used. The first method performs better than Ward's method under certain types of errors (Milligan, 1980). The three nonhierarchical clustering algorithms are all based on the convergent Kmeans method (Anderberg, 1973) and differ only in terms of their starting values. Convergent Kmeans using minimum average sum of squares centroids TY=4 This algorithm first runs the minimum average sum of squares hierarchical cluster analysis method and uses the centroids from this method as input to the convergent Kmeans procedure. The distance measure used to allocate an observation to a cluster in the convergent Kmeans procedure is the Euclidian distance obtained from the clustering variables for that observation. Convergent Kmeans using Ward method centroids TY=5 This algorithm first runs the Ward hierarchical cluster analysis method and uses the centroids from this method as input to the convergent Kmeans procedure. The distance measure used to allocate an observation to a cluster in the convergent Kmeans procedure is the Euclidian distance obtained from the clustering variables for that observation. Convergent Kmeans using centroids from the centroid method TY=6 This algorithm first runs the centroid hierarchical cluster analysis method and uses the centroids from this method as input to the convergent Kmeans procedure. The distance measure used to allocate an observation to a cluster in the convergent Kmeans procedure is the Euclidian distance obtained from the clustering variables for that observation. Nonhierarchical methods generally perform better than hierarchical methods if nonrandom starting clusters are used. When random starting clusters are used (for example, the first p observations are used as centroids for the p starting clusters), the nonhierarchical clustering methods perform rather poorly. The random start methods were, therefore, not implemented in StatPac. Kmeans procedures appear more robust than any hierarchical methods with respect to the presence of outliers, error perturbations of distance measures and choice of distance metric. However, nonhierarchical methods require the number of clusters to be given. Many studies recommend the following series of steps in running cluster analysis: 1. Run cluster analysis using one of the first two hierarchical cluster analysis algorithms (minimum average sum of squares or Ward methods). 2. Remove outliers from the data set. Outliers can be located by looking at the distance from the cluster centroids (CC option), or the hierarchical tree diagram (one observation clusters that are late in merging with other clusters). Outliers often represent segments of the population that are underrepresented and therefore, should not be discarded, without examination. 3. Delete dormant clustering variables. These can be located using the decomposition of sum of squares (DC option). 4. Determine the number of clusters. This can be done using the criterion function column in the decomposition of sum of squares (DC option), as well as the hierarchical tree diagram (TD option). 5. Once outliers are discarded, dormant variables omitted and the number of clusters determined, run one of the first two nonhierarchical methods (TY=4 or 5) several times, varying the number of clusters. Number of ClustersIn the first cluster analysis run on a data set, you should choose one of the first three hierarchical clustering algorithms, and set the number of clusters equal to 99 (NC=99). This will print a hierarchical tree diagram and the decomposition of sum of squares, leaving the other options off (i.e., TD=Y, DC=Y, CC=N, CM=N). Note that the type of clustering option (TY) should be 1, 2 or 3 (one of the hierarchical clustering algorithms). Once you have examined the hierarchical tree diagram and the decomposition of sum of squares, you would select the number of clusters using the NC option. Cluster analysis is an exploratory technique, and you will probably have to rerun the cluster analysis several times, varying the number of clusters as well as the clustering algorithm and the set of clustering variables. It is not uncommon to set the number of clusters to a few more than you suspect there are clusters, in an attempt to discover outliers. Nonhierarchical clustering algorithms are more effective in spotting outliers by this method than their hierarchical counterparts. Descriptive StatisticsDescriptive statistics may be printed or suppressed using the DS=Y or DS=N option, respectively.
Example of Descriptive Statistics Printout
Hierarchical Tree DiagramThe hierarchical tree diagram provides the analyst with an effective visual condensation of the clustering results. The hierarchical tree diagram is one of the most commonly used methods of determining the number of clusters. It is also useful in spotting outliers, as these will appear as onemember clusters that are joined later in the clustering process. The numbers at the top and bottom of the hierarchical tree diagram represent equallyspaced values of the criterion function. It gives a pictorial representation of the criterion function information. If two or more clusters in a set of data are distinguished very well from each other, all merges but the last few (where "true" clusters are joined) will be clumped to the left of the tree diagram because of the extreme dissimilarity of the "true" clusters (i.e., most of the criterion function is accounted for by these clusters). To better understand the internal structure of these "true" clusters, it may be necessary to rerun cluster analysis separately on each of these "true" clusters.
Example of a Hierarchical Tree Diagram
Cluster CentroidsThe cluster centroids are simply the means of each clustering variable for each cluster. The cluster centroids are probably the most useful multivariate characterization of the clusters.
Example of a Cluster Centroids Printout
Decomposition of Sum of SquaresThis option combines two types of information: criterion function information and decomposition of sum of squares information. The criterion function is useful in determining the number of clusters. It is expressed, at each clustering stage, as a proportion of the value of the criterion function when all observations are joined in one cluster (the last stage in a hierarchical cluster analysis). The criterion function, at a given clustering stage, is a measure of the distance between all observations in all clusters. Consequently, at the start (when there are as many clusters as observations), the value of the criterion function is zero (each cluster has zero variance because it contains only 1 observation). As clusters are joined, the value of the criterion function increases. The criterion function rises slowly in the first stages, as the most similar clusters are joined and there is very little within cluster variability. However, as true distinct groups are joined, the withincluster variability increases and the criterion function rises sharply. As an example, suppose you are analyzing a data set with four clearly defined groups (clusters). The value of the criterion function should rise very slowly until you reach three clusters, in which case two "true" clusters are joined. This would be the clue as to the number of clusters (i.e., the sharp rise in the criterion function when you reach three clusters). A "random" variable (variable not useful in separating clusters) can have a detrimental effect in cluster analysis and should be eliminated. One way of evaluating the relationship between a given hierarchical classification, and each of the clustering variables, is through the examination of the growth in unexplained sum of squares, as the clustering progresses through increasing levels of aggregation. At the beginning of clustering, each observation is represented perfectly by the mean vector to which it belongs and there is no within cluster error. At the highest level of aggregation, there is only one cluster and it contains every observation. The proportion of unexplained sum of squares is, therefore, 1.0. At any stage between these two extremes, within cluster error sum of squares is that portion of the total variance unexplained by the current set of clusters. To locate "random" variables, one compares the step by step growth in the proportion of unexplained sum of squares for each clustering variable. For a few variables, the fractions may remain small up to the last few stages, whereas, for other variables, the fractions may get large at a fairly early stage. The former variables may be thought of as being dominant in the results, while the latter are dormant. Repeating the clustering with dormant variables eliminated should have little effect on the results. However, deleting a dominant variable probably will have a marked influence on the clustering. This kind of analysis can be an especially useful device for generating a parsimonious set of variables, to be used in subsequent attempts to cluster the data. It should be noted that if the data set has very distinct clusters, the unexplained sum of squares will rise slowly, even when a variable is dominant. It is not until the size of the clusters increases and/or "true" clusters are joined that the proportion of unexplained sum of squares rises sharply. Indirectly, the decomposition of sum of squares can also be used as an indicator of the number of true clusters. As this option generates a line for each observation, the number of clustering variables "decomposed" is restricted to what will fit on one line. This option is only possible with hierarchical clustering algorithms (TY=1, 2, or 3). The decomposition of sum of squares will remain the same, regardless of the number of clusters chosen, if you use the same data set and the same clustering algorithm.
Example of A Decomposition of Sum of Squares Printout
Cluster Membership and Distance to Cluster CentroidsThis option will list the members of each cluster as well as the Euclidian distance of each member from its cluster centroid. This output provides useful information on how homogeneous the clusters are, and provides an aid in the detection of outliers.
An Example of a Cluster Membership and Distance to Centroids Printout
Standardize Clustering VariablesStandardizing the clustering variables consists of subtracting the variable mean, and dividing by the variable standard deviation. The data should be standardized if the clustering variables are in widely different units (age in years, weight in kilograms, height in centimeters, etc.) to avoid giving variables with higher variances more weight in the clustering process. With the hierarchical clustering algorithms especially, standardizing the clustering variables tends to reduce the effect of outliers on the final clustering solution. Mean SubstitutionMean substitution is one method often used to reduce the problem of missing information. Often, cluster analysis is difficult because if one clustering variable is not known, it is necessary to exclude the whole record from the analysis. It is possible that this can substantially reduce the number of records that are included in the analysis. Mean substitution overcomes this problem by replacing any missing clustering variable with the mean of that variable. While this technique has the possibility of slightly distorting the results, it can make it possible to perform a cluster analysis with substantial missing data. If an observation for which one or more missing clustering variables was replaced by the mean shows up as an outlier, then this observation should be eliminated from future cluster analysis runs. Iteration LimitThe nonhierarchical methods require that the maximum number of iterations be specified. The nonhierarchical clustering process will stop either when the maximum number of iterations has been reached, or when an iteration has resulted in no observation being moved from one cluster to another. Since the nonhierarchical clustering algorithms in StatPac start with hierarchical cluster centroids, they will rarely require more than a few iterations. Setting the maximum number of iterations to ten should be sufficient in most cases. Should the iteration limit be reached, the iteration summary output should provide guidance as to how much the iteration limit should be increased. Save Cluster Membership VariableCluster analysis is usually a first step for running other statistical techniques. The next step in the analysis often involves one of the following: 1. Run analysis within each of the clusters. 2. Run discriminant analysis on the clusters, thus allowing one to get multivariate statistics on the clusters as well as a plot of the first two canonical axis. The SM option allows you to save the cluster membership variable for further analysis of the clusters. At the end of the analysis you will be given the opportunity to merge the cluster membership variable into the original data. Labeling and Spacing Options
Discriminant function analysis is a technique for the multivariate study of group differences. It is similar to multiple regression in that both involve a set of independent variables and a dependent variable. In multiple regression, the dependent variable is a continuous variable, whereas in discriminant analysis, the dependent variable (often called the grouping variable) is categorical. Discriminant analysis can be seen as an extension of probit or logistic regression. In probit and logistic regression, the dependent variable is numerically coded as 0 or 1; in discriminant analysis the grouping variable may be numeric or alpha (e.g., 1, 2, 3 or A, B, C). When there are only two groups, many researchers use probit or logistic regression and code the two groups as 0 and 1. Discriminant analysis is used to: 1. describe, summarize and understand the differences between groups. 2. determine which set of independent variables best captures or characterizes group differences. 3. classify new subjects into groups or categories. Canonical correlation analysis (an option) can be used to reduce the dimensionality of the independent variables, similar to principal components analysis. Canonical analysis also makes it possible to determine how well the groups are separated, using two linear combinations of the independent variables in the discriminant equations. The syntax of the command to run a stepwise discriminant analysis is:
DISCRIMINANT <Dependent variable> <Independent variable list>
The maximum number of categories (groups) for the dependent variable is 24. Up to 200 independent variables may be specified. As an example, a researcher studying three types of iris flowers wanted to know if the type of iris could be determined from the length and width of their sepals and petals. The IRIS TYPE variable (V1) is coded 1=Setosa, 2=Veriscol and 3=Virginic. Note that this is a categorical variable; an iris is one type or another (we normally won't have a mixed breed iris). We also have four independent variables: SEPAL LENGTH (V2) is the length of sepal, SEPAL WIDTH (V3) is the width of sepal, PETAL LENGTH (V4) is the length of petal, and PETAL WIDTH (V5) is the width of petal. The command to run the discriminant analysis could be specified in several different ways:
DISCRIMINANT IRIS TYPE, SEPAL LENGTH, SEPAL WIDTH, PETAL LENGTH, PETAL WIDTH DISCRIMINANT V1 V2 V3 V4 V5 DISCRIMINANT IRIS TYPE V2V5 DI V1V5 (Note: DISCRIMINANT can be abbreviated as DI)
In each example, the dependent variable (IRIS TYPE) was specified first, followed by the independent variable list. The variable list itself can consist of variable labels and/or variable numbers. Either a comma or a space can be used to separate the variables from each other. The dependent variable may be alpha or numeric. If it is numeric, it must be coded 1 through 24. If it is alpha, it must be coded A, B, C, etc. If the study design does not contain value codes and labels for the dependent variable, the program will use the data itself to determine the value codes. A discriminant function equation is used to obtain the posterior probability that an observation came from each of the groups. An observation is, therefore, classified, by the discriminant analysis, into the group with the highest posterior probability as estimated above. Descriptive StatisticsThe mean and standard deviations for all the independent variables can be printed with the descriptive statistics option (DS=Y). The output contains the means and standard deviations for each of the independent variables. When DS=C, the output will contain descriptive statistic controlled for each of the dependent variable groups.
Example of a Descriptive Statistics Printout
Simple Correlation MatrixThe within group correlation matrix is obtained by pooling the correlation matrix from each of the groups (SC=Y). If two variables are highly correlated, it is possible that the matrices are not well conditioned, and it might be beneficial to run the discriminant analysis again without one of the variables. If Wilks' lambda does not show a significant increase, you might want to leave the variable out of the discriminant analysis.
Example of a Within Group Correlation Matrix Printout
Group Discriminant Function CoefficientsThe group discriminant function (classification) coefficients can be printed with the CO option (CO=Y). The output includes a constant and a coefficient for each independent variable, for each value of the grouping variable. Each coefficient provides an estimate of the effect of that variable (in the units of the raw score) for classifying an observation into each group.
Example of Group Discriminant Function Coefficients Printout
Classification MatrixThe classification matrix may be selected with the option CM=Y. It gives the frequency distribution of the observed group versus its predicted group, based on the set of independent variables in the discriminant function. This option also calculates the percent correctly classified in each group, as well as over all groups. If the group is well predicted by the set of independent variables, we expect to find most observations falling on the diagonal of this matrix (i.e., observations in group i would be classified as belonging to group i). The classification matrix also provides valuable insight into which groups are well separated and which groups are harder to separate.
Example of a Classification Matrix
List Incorrectly Classified CasesThis listing may be selected with the option IC=Y. For each case that was incorrectly classified, this option gives the case number, the group that case came from as well as the predicted group based on the independent variables in the discriminant function. This listing can be used to check for errors in one or more of the independent variables.
Example of Incorrectly Classified Cases Listing
Print Each StepYou can print the statistics for each step using the option PS=Y. This may be important when you want to study how the inclusion or deletion of a variable affects other variables. Wilks' lambda is the ratio of the determinants of the within groups crossproduct to total crossproduct matrices. It has values between 0 and 1. Wilks' lambda is similar to the coefficient of multiple determination (rsquared) in multiple regression, except that it moves in the opposite direction. Where rsquared gets larger as the equation improves, Wilks' lambda gets smaller as the equation improves. Thus, Wilks' lambda could be interpreted as the proportion of variance in the dependent variable that is not explained by the discriminant function model. Large values of Wilks' lambda indicate that the independent variables in the equation are not doing a good job of predicting the dependent variable group. Small values of Wilks' lambda indicate good separation between (at least some) groups. The overall Fratio measures whether the variables in the equation are useful in classifying cases. Typically, a probability of .05 or less leads us to reject the hypothesis that the discriminant function does not improve our ability to classify cases. The Ftoenter value for any variable not in the equation tests whether adding this variable in the equation would lead to a significant decrease in Wilks' lambda, while the Ftoremove value for any variable in the equation tests whether this variable would lead to a significant increase in Wilks' lambda. These values are used to determine the independent variable to enter or delete in the next step.
Example of the Print Each Step Output
Summary TableA good way to get an overview of how the steps proceeded, and what effect each step had upon Wilks' lambda, is to print the summary table. To print the summary table, use the option ST=Y.
Example of a Summary Table
Canonical Variable AnalysisCanonical analysis can be used to reduce the dimensionality of the independent variables, and is similar to principal components. The first canonical variable is the linear combination of independent variables that best summarizes the differences among the groups. The second canonical variable is the next best linear combination orthogonal to the first one, and so on. You can print the canonical variable analysis by entering the option CV=Y. This option provides two tables. The first table is a summary of the eigenvalues associated with each canonical variable, as well as the proportion of the "betweengroup variability" accounted for by each canonical variable.
Example of a Canonical Variable Summary Table
The second table gives the coefficients of the canonical variables. This is similar to the eigenvalue printout in principal components. The number of canonical variables reported is the lesser of (the number of groups minus 1) and the number of variables entered in the discriminant function.
Example of a Canonical Variable Coefficients Table
Canonical Variables Evaluated at Group MeansYou can select to print a table of the canonical variables, evaluated at the group means using the option GM=Y.
Example of Canonical Variables Evaluated at Group Means Printout
Save Canonical PairResearchers often want to save the first two canonical variables for future analysis. You can save them with the options command SP=Y. At the completion of the analysis, you will be given the opportunity to merge the canonical variable pair and predicted group into the original data. Prior ProbabilitiesPrior probabilities are the initial values that will be placed on the diagonal of the matrix. Prior probabilities may be set to equal (PP=E), automatic (PP=A), or individual probabilities may be specified. When PP=E, the values for the prior probabilities will be equal to one divided by the number of categories (so each category has an equal prior probability). Setting the prior probabilities option to automatic (PP=A) will assign prior probabilities to each category, based on the frequency of that category. The sum of the prior probabilities will be one. The other method of setting the prior probabilities is to explicitly specify them. When this method is used, a prior probability must be assigned to each category. They do not need to sum to one. The following option would assign prior probabilities to three alpha categories. Note that commas are used to separate them from each other.
OPTIONS PP=(A=3.5, B=4.7, C=2.9)
Number of Variables to ForceThe ability to force variables into an equation is important for several reasons: 1. A researcher often wishes to replicate the analysis of another study and, therefore, to force certain core variables into the equation, letting stepwise discriminant analysis choose from the remaining set. 2. Some variables may be cheaper or easier to measure, and the user may want to see whether the remaining variables add anything to the equation. 3. When independent variables are highly correlated, one of them may be more accurate than the rest, and you may want to force this variable in the equation. The syntax for the number of variables to force in is FO=n, where n is the number of variables to force. For example, FO=3 will force the first 3 variables from the independent variable list into the equation. An option such as FO=200 may be used to perform a nonstepwise analysis (all variables will be included in the analysis). Category CreationThe actual categories (dependent variable groups) can be created either from the study design value labels (CC=L) or from the data itself (CC=D). When the categories are created from the labels, the value labels themselves will be used to define the dependent variable groups. Any data that does not match up with a value label (e.g., miskeyed data) will be counted as missing. When categories are created from the data, all data will be considered valid, whether or not there is a value label for it. F to Enter & F to RemoveWhen faced with a large number of possible explanatory variables, two opposed criteria of selecting variables for a discriminant analysis are usually involved: 1. To make the equation useful for classification purposes, we would like our model to include as many of the independent variables as possible so that reliable group classification can be determined. 2. Because of the costs involved in obtaining information on a large number of independent variables, we would like the equation to include as few of the independent variables as possible. The compromise between these two extremes is generally called "selecting the best set of independent variables". This involves multiple execution of discriminant analysis, in an attempt to add variables to improve classification or remove variables to simplify the classification equations. Stepwise discriminant analysis provides a partial automation of this procedure. An important property of the stepwise procedure is based on the fact that a variable may be indicated to be significant in an early stage and, thus, be entered in the equation. After several other variables are added to the equation, however, the initial variable may be indicated to be insignificant (redundant), and thus removed from the model. This method is often referred to as forward inclusion with backward elimination. The algorithm used by StatPac is as follows: 1. First, enter into the discriminant analysis all variables which the user wishes to force into the equation. 2. Enter the predictor that produces the greatest decrease in Wilks' lambda from all the remaining predictors whose entry is not inhibited by the Ftoenter. 3. Remove the predictor that makes the least increase in Wilks' lambda from all (nonforced) predictors whose removal is not inhibited by the Ftoremove. Note that step 2 is executed only when it is not possible to execute step 3. If neither can be executed, the stepping is complete. The following should be considered when setting Ftoenter and F toremove values in the parameter table: 1. A variable is removed if the Fvalue associated with that variable is less than the Ftoremove value set in the parameter table. Similarly, a variable is added if the Fvalue associated with that variable would be greater than the Ftoenter value set in the parameter table, if that variable were entered in the current equation. 2. Care should be taken to ensure that the Ftoremove be less than the Ftoenter; otherwise, a variable would be entered and then removed at alternate steps. 3. The default values for the Ftoenter and Ftoremove for many mainframe packages, and StatPac, are 4.0 and 3.9, respectively, which provide useful starting values. 4. Setting the Ftoremove value low yields the forward inclusion method. 5. For the first run on a data set, it is common to set the Ftoenter and Ftoremove values low, to execute a large number of steps. Steps Limit The steps limit is the maximum number of steps that can occur. Each inclusion and deletion of a variable is counted as one step. The purpose of the steps limit option is to limit computer time. The syntax for the steps limit option is SL=n, where n is the maximum number of steps allowed. Mean SubstitutionMean substitution is one method often used to reduce the problem of missing information. Often, discriminant analysis is difficult because if one independent variable is not known, it is necessary to exclude the whole record from the analysis. It is possible that this can substantially reduce the number of records that are included in the analysis. Mean substitution overcomes this problem by replacing any missing independent variable with the mean of that variable. While this technique has the possibility of slightly distorting the results, it can make it possible to perform a discriminant analysis with substantial missing data. If the value of the dependent (grouping) variable is missing, the whole record is deleted from the analysis. Predict InteractivelyAfter performing a discriminant analysis, you may want the posterior probabilities associated with each group, for a new observation, or an observation in your data for which the value of the dependent variable was missing. This is known as interactive prediction. You can select interactive prediction by entering the option PR=Y. At the end of the analysis you will then be prompted to enter a value for each independent variable, and the computer will use the discriminant function coefficients to estimate the posterior probabilities associated with each group. The observation is then assigned to the group with the highest posterior probability. You can skip over any independent variable by just pressing <enter>; the value of the group mean for that independent variable will be used. The analysis of variance procedure provides a systematic way of studying variability. Usually, we are interested in how much of the variability of scores on the dependent variable can be explained by the differences between scores (levels) on the experimental variables (factors). StatPac may contain up to three factors and up to 90 levels for each factor. Type of DesignStatPac Gold contains eleven different ANOVA designs (or models). Choosing the appropriate design for a particular experiment requires careful evaluation. It is quite possible to perform an inappropriate statistical procedure by choosing the wrong model. Since StatPac has no way of knowing that the model is wrong, it will produce erroneous results. The following types of models are available:
1. One Factor Completely Randomized Design 2. Randomized Complete Block Design 3. Randomized Complete Block Design With Sampling 4. Two Factor Factorial in Completely Randomized Design 5. Two Factor Factorial in Randomized Complete Block Design 6. Three Factor Factorial in Completely Randomized Design 7. Three Factor Nested Design 8. SplitPlot With Completely Randomized Design of Main Plots 9. SplitPlot With Randomized Complete Block Design of Main Plot 10. SplitPlot With SubUnits Arranged in Strips
The number relating to the type of design has no intrinsic meaning in and of itself. It is simply a number used to specify which model you want StatPac to use for the analysis. Missing Data in Anova DesignsMany analysis of variance experimental designs involve assigning an equal number of cases to each cell. When all cells do not contain the same number of cases, the model is said to be "unbalanced". Unbalanced designs usually occur because of differential attrition (e.g., some of the crop dies, respondents become unavailable or refuse to participate, recording errors, etc.). StatPac uses an unweighted means solution when there is not an equal number of cases in each cell. This involves the use of the harmonic mean to adjust the sums of squares. If there is an equal number of observations in each cell, the unweighted means solution is equivalent to the usual least squares approach. The unweighted means solution requires at least one case in every cell. There may not be any cells where all the data is missing. The unweighted means approach is an approximation technique and does not produce exact results when the design is unbalanced. Although the F statistics may not be exact, researchers have found that the Fratios are acceptable unless the design is highly unbalanced. As a measure of departure from a balanced design, use the ratio of Ni/Nj, where Ni is the greatest number of observations in any cell and Nj is the minimum. A ratio of 4:1 is tolerable, but a ratio of 16:1 should not be accepted. In these cases, an exact solution can be obtained by creating appropriate dummy variables and performing a regression analysis. For a detailed discussion, see D.G. Gosslee and H.L. Lucas (Biometrics, Volume 21, p. 115133). Command Syntax and the Data File StructureThe syntax for the ANOVA command is similar for all eleven different models. The only difference is in the number of factors that are specified as part of the command. There are two general forms of the command:
ANOVA (Type) <Dependent Variable> (<Fact. 1>) (<Fact. 2>) (<Fact. 3>)
This format is used when each record contains a single value for the dependent variable. The type of design is specified first and must be enclosed in parentheses. It may be a number between one and eleven. The dependent variable is specified next. Finally, a variable is specified for each factor. Each of these variables is also enclosed in parentheses. If a design only contains one (or two) factors, only one (or two) need be specified. The second form of the ANOVA command is used when each record contains several values for the dependent variable (one for each level of one of the factors). In this case, the dependent variable is not a single variable, but rather a variable list. Since any one of the factors may be the dependent variable list, the syntax may take on three different forms:
ANOVA (Type) (<Factor 1 var. list>) (<Factor 2>) (<Factor 3>)
ANOVA (Type) (<Factor 1>) (<Factor 2 var. list>) (<Factor 3>)
ANOVA (Type) (<Factor 1>) (<Factor 2>) (<Factor 3 var. list>)
Which form of the ANOVA command you use depends upon how the data file is arranged. The following three examples illustrate the different forms of the command. The first example is a completely randomized oneway design (Type 1). There is only one factor for this kind of model. As stated above, there are two possible formats for the command. They are:
ANOVA (1) <Dependent variable> (<Factor 1>)
ANOVA (1) (<Factor 1 variable list>)
For example, let's say we are interested in studying the effect of training after one hour, two hours, and three hours. There are two ways the data file might be organized. In the first format, each record contains the score and the time of measurement. It would appear like this:
34 1 (record 1  score at time 1 for case 1) 43 2 (record 2  score at time 2 for case 1) 55 3 (record 3  score at time 3 for case 1) 41 1 (record 4  score at time 1 for case 2) 52 2 (record 5  score at time 2 for case 2) 63 3 (record 6  score at time 3 for case 2) 37 1 (record 7  score at time 1 for case 3) 59 2 (record 8  score at time 2 for case 3) 61 3 (record 9  score at time 3 for case 3)
In this example, the dependent variable (SCORE) is variable one and the factor (TIMEPERIOD) is variable two. The values for variable two represent the levels of the variable (i.e., what time the score was taken). Two examples of the syntax to run a oneway ANOVA using the above data file would be:
ANOVA (1) V1 (V2) ANOVA (1) SCORE (TIMEPERIOD)
In the other type of data file format, each record contains the score for each hour of training. With this format, the above data file would appear like this:
34 43 55 (rec 1  score after each hour of training for case 1) 41 52 63 (rec 2  score after each hour of training for case 2) 37 59 61 (rec 3  score after each hour of training for case 3)
This data file format differs from the previous format, but the data is the same. The dependent variable is no longer just held in a single variable. There is a dependent variable for each level of the factor. Variable one is the score for time one (TIME1), variable two is the score for time two (TIME2), and variable three is the score for time three (TIME3). Three examples of the syntax to run a oneway analysis of variance for this type of data file format are:
AN (1) (V1V3) (Note: ANOVA may be abbreviated as AN) ANOVA (1) (V1,V2,V3) ANOVA (1) (TIME1, TIME2, TIME3)
The only difference between the two data files is in the way in which they are coded. The actual data is the same in both files, and the results of the analysis of variance will be identical. The second example is twoway ANOVA in a completely randomized design (Type 4). A twoway analysis of variance is used to examine the effect that two independent variables have on the dependent variable. Because of the high cost of conducting experiments and the possibility of interaction effects, researchers often use a twoway design to get the most out of each experiment. For example, let's say we are studying the GROWTH (dependent variable) of four different hybrid SEEDS. We are also interested in whether any brand of FERTILIZER works better than the others. Instead of conducting two separate experiments, we conduct only one and analyze the results with a twoway ANOVA. This has an added advantage because it will take into account the interaction between fertilizer and seed type. The first form of the command syntax for the twoway ANOVA is:
ANOVA (4) <Dependent variable> (<Factor 1>) (<Factor 2>)
Since there are two factors is this design, the second form of the command could specify either factor as the variable list:
ANOVA (4) (<Factor 1 variable list>) (<Factor 2>)
ANOVA (4) (<Factor 1>) (<Factor 2 variable list>)
Notice that the last two forms of the command are variations of the same syntax. Since there are two factors, they both must be specified in the command syntax. The actual syntax depends upon the way the data file is structured. The first form of the command is used when there is a dependent variable and a variable for each factor. A sample data file format for this experimental design might look like this:
67 1 A (record 1  yield this acre is 67 fertilizer used is coded as 1 type of hybrid seed is coded as A) 54 2 A (record 2  yield this acre is 54 fertilizer used is coded as 2 type of hybrid seed is coded as A) 3 A (record 3  yield this acre is missing  crop died fertilizer used is coded as 3 type of hybrid seed is coded as A) 52 1 B (record 4  yield this acre is 52 fertilizer used is coded as 1 type of hybrid sees is coded as B) 61 2 B (record 5  yield this acre is 61 fertilizer used is coded as 2 type of hybrid seed is coded as B) 27 3 B (record 6  yield this acre is 27 fertilizer used is coded as 3 type of hybrid seed is coded as B)
Two commands to run a twoway ANOVA with this data file are:
ANOVA (4) V1 (V2) (V3) ANOVA (4) GROWTH (FERTILIZER) (SEED)
The data file could contain the same information formatted in another way. For example, the following data file contains the same information as the previous file except that the dependent variable (GROWTH) is specified for each type of fertilizer.
67 54 A (record 1  growth for all 3 fertilizers with seed type A) 52 61 27 B (record 2  growth for all 3 fertilizers with seed type B)
Variable one is the growth for FERTILIZER1, variable two is the growth for FERTILIZER2, and variable three is the growth for FERTILIZER3. Variable four is the SEED type. Three commands to perform a twoway ANOVA with this type of data file format are:
ANOVA (4) (V1, V2, V3) (V4) ANOVA (4) (V1V3) (V4) ANOVA (4) (FERTILZER1, FERTILIZER2, FERTILIZER3) (SEED)
For a final example of data file formatting, we'll look at a typical threefactor factorial experiment (Type = 6). This model is similar to the previous model except that three experimental factors are being examined. There are now four possible data file formats corresponding to four forms of syntax:
ANOVA (6) <Dep. var.> (<Factor A>) (<Factor B>) (<Factor C>)
ANOVA (6) (<Factor A variable list>) (<Factor B>) (<Factor C>)
ANOVA (6) (<Factor A>) (<Factor B variable list>) (<Factor C>)
ANOVA (6) (<Factor A>) (<Factor B>) (<Factor C variable list>)
A medical researcher was studying the painrelieving properties of two different drugs (A and B). In addition to testing the drugs themselves, she wanted to compare high and low doses as well as the method of administration (oral or intravenously). The dependent variable is a measure of pain RELIEF (on a scale of 0 to 9), factor A is the DRUG type, factor B is the DOSE and factor C is the METHOD of administration. In the first data file format, the dependent variable and each factor represent unique variables in the data file.
8 A H O (Pain relief 8  drug A  high dose  oral admin.) 6 A L O (Pain relief 6  drug A  low dose  oral admin.) 9 A H I (Pain relief 9  drug A  high dose  iv. admin.) 8 A L I (Pain relief 8  drug A  low dose  iv. admin.) 5 B H O (Pain relief 5  drug B  high dose  oral admin.) 4 B L O (Pain relief 4  drug B  low dose  oral admin.) 6 B H I (Pain relief 6  drug B  high dose  iv. admin.) 5 B L I (Pain relief 5  drug B  low dose  iv. admin.)
The commands to run the ANOVA with this data file organization are:
ANOVA (6) V1 (V2) (V3) (V4) ANOVA (6) RELIEF (DRUG) (DOSE) (METHOD)
The last three forms of the command are used when one of the factors is specified as part of a variable list. For example, both methods of administering the drug could be contained in the same data record. The data file would appear like this:
A H 8 9 (Drug A  high dose  relief for oral & iv admin.) A L 6 8 (Drug A  low dose  relief for oral & iv admin.) B H 5 6 (Drug B  high dose  relief for oral & iv admin.) B L 4 5 (Drug B  low dose  relief for oral & iv admin.)
The commands to run the ANOVA with this data file format would be:
ANOVA (6) (V1) (V2) (V3, V4) ANOVA (6) (DRUG) (DOSE) (ORAL, IV)
It is especially important to match the appropriate command syntax with the format of the data file. The versatility of StatPac to read several different data file formats makes it easy to analyze most data sets. The best advice is to plan the analysis before entering data. Descriptive StatisticsDescriptive statistics for each cell may be printed or suppressed with the option DS=Y or DS=N, respectively. In multifactor experiments it is often desirable to print descriptive statistics for each factor controlled for the other factors. To print controlled descriptive statistics, use the option DS=C. The descriptive statistics printout will contain the count, mean and unbiased standard deviation.
Example of Descriptive Statistics
Anova TableThe ANOVA table is the heart of the analysis. The Ftest reveals whether or not there are significant differences between the levels of the experimental factor(s). The actual terms that appear in the ANOVA table depend on the type of design. Generally, experiments involve assigning cases to groups on the basis of some experimental condition and observing the differences between the groups on the dependent variable. As the differences between the groups increase, so will the Fratio. The actual formula for a particular Ftest depends upon the ANOVA design and whether the factors are fixed or random. A significant Fratio means that there is a significant difference between the means of the dependent variable for at least two groups. For example, in a completely randomized twofactor factorial analysis (Type = 4), there are three Fratios: one for the A factor, one for the B factor and one for the AB interaction. Their interpretation is as follows: 1. The Fratio for factor A tests whether the factor A variable has a significant effect on the response of the dependent variable, averaged over all levels of the factor B variable. A significance level less than .05 is generally considered significant. 2. The Fratio for factor B tests whether the factor B variable has a significant effect on the response of the dependent variable, averaged over all levels of the factor A variable. 3. The Fratio for interaction tests whether there is significant interaction between the factor A and factor B variables. Interaction results from the failure of differences between responses at the different levels of one of the variables to remain constant over the different levels of the other variable. If the interaction term is significant, the Ftest for factor A and B should be interpreted with care. The next step is generally to examine the means for all pairs of levels of the two variables.
Example of an ANOVA Table
Fixed and Random FactorsWhen conducting tests of significance in multifactor designs, you must specify whether each of the factors is fixed or random. This will determine which of the mean squares in the analysis of variance table is used for the denominator of the Fratio. The concept of whether a factor is fixed or random can be determined using the following reasoning. Assume a factor has a potential (or population) of P levels which may be quite large. The experimenter may group the P potential levels into p effective levels by either combining adjoining levels or deliberately selecting what are considered to be representative levels. While p is less than P, the effective levels still represent the entire potential (or population). Whenever the selection of the p levels from the potential P levels is determined by some systematic nonrandom procedure, the factor is fixed. In the special case where the number of levels (p) of a factor is equal to P (no levels were grouped), the factor is also fixed. Examples of fixed factors include rates of application, varieties, types of compound, etc. With fixed factors, we are generally interested in estimating fixed effects associated with the specific levels of the fixed factors. In contrast to this systematic selection procedure, if p levels of a factor included in the experiment represent a random sample from the potential P levels, the factor is considered to be a random factor. For example, if a random sample of p of the P potential hospitals is included in the experiment, the factor (hospitals) is a random factor. In most practical situations in which random factors are encountered, p is quite small relative to P. Examples of random factors include people, herds, plants, lots, hospitals, etc. With random factors, we are generally interested in estimating the variability present in these factors. The F1, F2 and F3 options may be used to set a factor as fixed or random. For example, to set factors 2 and 3 as random factors, you would use the option:
OPTIONS F2=R F3=R
The formulas for the F tests depend upon the type of design and whether the factors are fixed or random. If a factor is inappropriately specified as random, StatPac will simply continue processing it as if it were random and will be unable to detect the error. Critical F ProbabilityThe analysis of variance by itself can reveal that differences exist between different levels of the experimental condition. That is, a significant Fratio indicates a significant difference between at least two of the levels. It does not actually tell where the differences occur. The lsd ttests between all the combinations of means reveal where the actual difference(s) is (are). The ttests will only be performed if the Fratio is significant at the critical F probability. For example, if the critical F probability is equal to .05 (CF=.05), the ttests will be performed only if the Fratio is less than or equal to .05. The ttests will be performed for only those pairs of means that have a significant Fratio. Take the example where factor A has a significant Fratio and factor B does not. The lsd ttests will be performed between all combinations of levels of factor A, while no ttests will be performed between the levels of factor B. Critical T ProbabilityIf an Fratio is significant at the critical F probability, StatPac will run through the pair of cell means and compute the lsd tstatistic and probability of t. We are usually only interested in those combinations where the tstatistic is significant. The critical t probability allows the selective printing of the tstatistics depending on the probability of t. For example, if the critical t probability is set to .05 (CT=.05), only those tvalues that have probabilities of .05 or less will be printed. The t statistic will reveal differences between the group means. If any t is significant, it will be printed. This procedure, called the new lsd (least significant difference) ttest, is considered to be one of the most conservative posthoc tests.
Example of a tTest Printout
Category CreationThe actual categories (levels for each factor) can be created either from the study design value labels (CC=L) or from the data itself (CC=D). When the categories are created from the labels, the value labels themselves will be used to define the levels for each factor. Any data that does not match up with a value label (e.g., mispunched data) will be counted as missing. When categories are created from the data, the data itself will be used to define the levels, whether or not there is a matching value label. Print CodesThe code for each level can be printed or suppressed with the PC=Y and PC=N options, respectively. KruskalWallis TestThe nonparametric equivalent of an analysis of variance is the KruskalWallis test. The data is ranked, and the sum of the ranks for each of the groups is used to calculate the statistic. The probability is determined using the chisquare distribution with the degrees of freedom equal to the number of groups minus one. Labeling and Spacing Options
ANOVA ExamplesThe following pages give a brief description of the eleven analysis of variance designs which StatPac can analyze along with simple examples and the statistical tests for each of these designs. It is important to note that, in many cases, more than one design may be appropriate for a given data set.
1. One Factor Completely Randomized Design Syntax:
ANOVA (1) <Dependent variable> (<Factor A>) ANOVA (1) (<Factor A variable list>)
Discussion: This is the simplest design and the easiest to carry out. The design contains only one factor, and can handle unequal numbers of observations per level. An Example: In an attempt to study fat absorption
in doughnuts, 24 doughnuts were prepared (six doughnuts from each of four
kinds of fats). The dependent variable is grams of fat absorbed, and the
factor variable is the type of fat. The factor contains four levels (four
types of fat were tested). The researcher accidentally dropped one of the
doughnuts from the second type of fat, so the second type of fat contains
five observations instead of six.
The analysis of variance table follows:
2. Randomized Complete Block Design Syntax:
ANOVA (2) <Dependent variable> (<Factor A>) (<Factor B>) ANOVA (2) (<Factor A variable list>) (<Factor B>) ANOVA (2) (<Factor A>) (<Factor B variable list>)
Note: Factor A is always the experimental treatment Factor B is always the replication or block
Discussion: This design is easy to carry out. It is essentially a oneway analysis of variance with replications (blocks). This design always contains exactly one observation per cell. Units assigned to the same block are as similar as possible in responsiveness, thus increasing the precision of treatment comparisons by eliminating blocktoblock variation. Blocks can represent time, location or experimental material. Examples of blocks include repeated testing over time, littermates, and groups of experimental plots as similar as possible in terms of fertility, drainage, and liability to attack by insects. An Example: A researcher wants to study the effects four seed treatments and a control group (a total of five treatment levels) on the germination of soybean seeds. The factor variable is the type of treatment (1 to 5). Five germination beds were prepared for each level of treatment and 100 seeds factors were planted in each bed. Thus, the replications are the five beds. The dependent variable is the number of plants in each bed which failed to germinate. There are two factors: treatment and replication. The analysis of variance table follows:
3. Randomized Complete Block Design With Sampling Syntax:
ANOVA (3) <Dep. var.> (<Factor A>) (<Factor B>) (<Factor C>) ANOVA (3) (<Factor A variable list>) (<Factor B>) (<Factor C>) ANOVA (3) (<Factor A>) (<Factor B variable list>) (<Factor C>) ANOVA (3) (<Factor A>) (<Factor B>) (<Factor C variable list>)
Note: Factor A is always the experimental treatment Factor B is always the replicate or block Factor C is always the sampling or determinations
Discussion: This design is the same as the previous design with more than one observation per experimental unit. Experiments often contain more than one observation per experimental unit when the researcher wishes to estimate the reliability of measurement. With this design, the error term is broken down into experimental error and sampling error. Sampling error measures the failure of observations made on any experimental unit to be precisely alike. Experimental error is usually expected to be larger than sampling error. In other words, variation among experimental units is expected to be larger than variation among subsamples of the same unit. An Example: The objective of an experiment was to study the effect of corn variety on protein content. A field was divided into three similar plots. Each plot was subdivided into fourteen sections. Fourteen different varieties of corn were planted in each plot (one in each section). After harvest, two protein determinations were made on each variety of corn in each plot. The dependent variable was the protein content. Factor A was the type of corn and contained 14 levels. Factor B was the replication (the three plots). Factor C was the two determinations. The analysis of variance table follows:
4. TwoFactor Factorial in Completely Randomized Design Syntax:
ANOVA (4) <Dependent variable> (<Factor A>) (<Factor B>) ANOVA (4) (<Factor A variable list>) (<Factor B>) ANOVA (4) (<Factor A>) (<Factor B variable list>)
Discussion: When compared to a onefactoratatime approach, factorial designs are superior because they enable interactions between different factors to be explored. Instead of performing two experiments (one for each factor), the researcher can perform one experiment to determine the effects of each factor and their interaction. Unbalanced designs are acceptable. An Example: Sixty baby male rats were randomly assigned to one of six feeding treatments. The dependent variable is the weight gain of the rats. The feeding treatments were a combination of two factors, source and level of protein. Three of the rats died before the experiment was completed. The six feeding treatments were a combination of two factors: Factor A (3 levels): Source of protein: Beef, Cereal, Pork Factor B (2 levels): Level of protein: High, Low The analysis of variance table follows:
5. TwoFactor Factorial in Randomized Complete Block Design Syntax:
ANOVA (5) <Dep. var.> (<Factor A>) (<Factor B>) (<Factor C>) ANOVA (5) (<Factor A variable list>) (<Factor B>) (<Factor C>) ANOVA (5) (<Factor A>) (<Factor B variable list>) (<Factor C>) ANOVA (5) (<Factor A>) (<Factor B>) (<Factor C variable list>)
Note: Factor A is always an experimental treatment Factor B is always an experimental treatment Factor C is always the replicate or block
Discussion: This ANOVA design is the same as the previous twoway factorial design except that replications have been added. This design will always contain exactly one observation per cell. It should be noted that for small experiments, the degrees of freedom for error may be quite small with this design. An Example: Riboflavin content of collard leaves can be determined by a chemical technique known as fluorometric determination. An experiment was designed to study this technique. Factor A is the size of leaf used to make the determination (0.25 grams and 1.00 grams), and factor B is the effect of the inclusion of a permanganateperoxide clarification step in the chemical process. The procedure was replicated on three successive days. There is one observation for each cell of the design. The dependent variable is apparent riboflavin concentration (mg./gm.) in collard leaves. The analysis of variance table follows:
6. ThreeFactor Factorial in Completely Randomized Design Syntax:
ANOVA (6) <Dep. var.> (<Factor A>) (<Factor B>) (<Factor C>) ANOVA (6) (<Factor A variable list>) (<Factor B>) (<Factor C>) ANOVA (6) (<Factor A>) (<Factor B variable list>) (<Factor C>) ANOVA (6) (<Factor A>) (<Factor B>) (<Factor C variable list>)
Note: Factor A is always an experimental treatment Factor B is always an experimental treatment Factor C is always an experimental treatment
Discussion: The threeway design is used when there are three factors to investigate. There are three main effects (A, B and C) and four interaction effects (AB, AC, BC and ABC). Unbalanced designs are acceptable. In the special case where there is only one observation per cell, the error term becomes equal to zero. In this case, it is usually assumed that the threeway interaction term is not significantly different from the error term, and the threeway interaction is used in place of the error term. With small numbers of levels for the factors, this can leave very few degrees of freedom for the error (3way interaction) term. An Example: A researcher was investigating the effect of various fertilizers on the growth of carrots. The three factors were nitrogen (N), potassium (K) and phosphorous (P). Three levels of concentration levels were tested for each factor (low, medium, and high). This resulted in 27 different fertilizer combinations. The dependent variable is the weight of the carrot root (in grams) grown under each of the fertilizer conditions. Note that in the special case of one observation per cell (i.e., one carrot for each fertilizer combination), no error term appears in the ANOVA table.
The analysis of variance table is as follows:
7. ThreeFactor Nested Design Syntax:
ANOVA (7) <Dep. var.> (<Factor A>) (<Factor B>) (<Factor C>) ANOVA (7) (<Factor A variable list>) (<Factor B>) (<Factor C>) ANOVA (7) (<Factor A>) (<Factor B variable list>) (<Factor C>) ANOVA (7) (<Factor A>) (<Factor B>) (<Factor C variable list>)
Note: Factor A is always the experimental unit Factor B is always a subunit of Factor A Factor C is always a subunit of Factor B
Discussion: When each sample is composed of subsamples, we have a nested or hierarchical design. The objective of this design is to estimate the variance components associated with the various nested factors. Unbalanced designs are acceptable. In the special case of exactly one observation per cell, the error term is zero, and it will not be printed. An Example: An investigator wanted to estimate calcium concentration in leaves of turnip plants. Four plants were taken at random and three leaves (samples) were randomly selected from each plant. Two subsamples were then taken from each leaf, and calcium was determined by microchemical methods. The objectives of the experiment were to estimate the variability in concentration across plants, between leaves of the same plant, and within subsamples of the same leaf.
The analysis of variance table follows:
8. SplitPlot With Completely Randomized Design of Main Plots Syntax:
ANOVA (8) <Dep. var.> (<Factor A>) (<Factor B>) (<Factor C>) ANOVA (8) (<Factor A variable list>) (<Factor B>) (<Factor C>) ANOVA (8) (<Factor A>) (<Factor B variable list>) (<Factor C>) ANOVA (8) (<Factor A>) (<Factor B>) (<Factor C variable list>)
Note: Factor A is always an experimental treatment Factor B is always an experimental treatment Factor C is always the replicate or block
Discussion: The term splitplot comes from agricultural experimentation. Splitplot designs contain two treatment factors. The main plots are the experimental units for one of the factors, and the subplots are the experimental units for the other factor. Splitplots are a repeated measure design. An Example: In this experiment, six subjects were divided into two groups according to the method they were told to use for calibrating dials. Three subjects used method A1 to calibrate the dials, and three subjects used method A2. Each subject was told to calibrate four differently shaped dials (B1, B2, B3 and B4). The dependent variable is the accuracy of each calibration attempt. Factor A is the method of calibrating dials, and factor B is the shape of dials. The three subjects in each group are the replicates. This design will always contain exactly one observation per cell. The analysis of variance table follows:
9. SplitPlot With Randomized Complete Block Design of Main Plots
Syntax:
ANOVA (9) <Dep. var.> (<Factor A>) (<Factor B>) (<Factor C>) ANOVA (9) (<Factor A variable list>) (<Factor B>) (<Factor C>) ANOVA (9) (<Factor A>) (<Factor B variable list>) (<Factor C>) ANOVA (9) (<Factor A>) (<Factor B>) (<Factor C variable list>)
Note: Factor A is always an experimental treatment Factor B is always an experimental treatment Factor C is always the replicate or block
Discussion: This design is preferable to the previous design when homogeneous blocks of experimental units are available, thus allowing a more accurate comparison of treatments by eliminating intrablock variability. Each AB treatment combination is replicated in each block. There will always be exactly one observation per cell with this design. An Example: This example studies the effects of alfalfa variety and the date of harvest on yields. Six plots (replicates) were used. Factor A is alfalfa variety and factor B is the date of harvest. Factor C are the replicates (six plots). The analysis of variance table follows:
10. SplitPlot With SubUnit Treatments Arranged in Strips Syntax:
ANOVA (10) <Dep. var.> (<Factor A>) (<Factor B>) (<Factor C>) ANOVA (10) (<Factor A variable list>) (<Factor B>) (<Factor C>) ANOVA (10) (<Factor A>) (<Factor B variable list>) (<Factor C>) ANOVA (10) (<Factor A>) (<Factor B>) (<Factor C variable list>)
Note: Factor A is always an experimental treatment Factor B is always an experimental treatment Factor C is always the replicate or block
Discussion: Instead of randomizing the subunit treatment independently within each unit, it is often necessary (or desirable) to have the subunit treatment arranged in strips across each replication. This design has an advantage over the previous design because it allows the determination of experimental error (error not attributable to either main factor). The layout may be particularly convenient for some field experiments. This design always contains exactly one observation per cell. This design sacrifices precision on the main effects of A and B in order to provide higher precision on the interaction term which will generally be more accurately determined than in either randomized blocks or the simple splitplot design. For a 5 by 3 design, the appropriate arrangement (after randomization) might be as shown below for 2 replications: Replication 1 Replication 2 a3 a1 a2 a0 a4 a1 a4 a0 a2 a3 b2 b1 b0 b2 b1 b0
An Example: The researcher used ten varieties and three generations of corn seed to study the effect of yield. The generations (a, b and c) appear in strips across blocks as well as the hybrid number. The analysis of variance table follows:
Syntax:
ANOVA (11) <Dep. var.> (<Factor A>) (<Factor B>) (<Factor C>) ANOVA (11) (<Factor A variable list>) (<Factor B>) (<Factor C>) ANOVA (11) (<Factor A>) (<Factor B variable list>) (<Factor C>) ANOVA (11) (<Factor A>) (<Factor B>) (<Factor C variable list>)
Note: Factor A is always the row factor Factor B is always the column factor Factor C is always the treatment factor
Discussion: Latin square designs are very efficient when a small number of treatments are being tested because treatment comparisons are made more precise by eliminating row and column effects. The basic characteristic of a Latin square design is that each treatment appears once in each row and once in each column. With small numbers of treatments, there are few degrees of freedom for error. The limitation of the Latin square for a large number of treatments is due to the requirement that there be the same number of replications as treatments. Thus, the most generally used Latin squares vary from 4 by 4 to 10 by 10. Latin square designs are also useful to study sequences of treatments and/or carryover of treatments. This design will always contain exactly one observation per cell. An Example: The field layout for 5 irrigation treatments (A, B, C, D and E) was as follows: Columns 1 2 3 4 5 Row 1 E D A B C Row 2 C E D A B Row 3 A C B E D Row 4 D B E C A Row 5 B A C D E
The analysis of variance table follows:
Summary of Anova DesignsAnova Type=1 OneFactor Completely Randomized Design Factor A  experimental treatment Balanced or unbalanced design
Anova Type=2 Randomized Complete Block Design Factor A  experimental treatment Factor B  replicates or blocks Balanced design only
Anova Type=3 TwoFactor Factorial in Completely Randomized Design Factor A  experimental treatment Factor B  replicates or blocks Factor C  sampling or determinations Balanced design only
Anova Type=4 TwoFactor Factorial in Completely Randomized Design Factor A  experimental treatment Factor B  experimental treatment Balanced and unbalanced design
Anova Type=5 TwoFactor Factorial in Randomized Complete Block Design Factor A  experimental treatment Factor B  experimental treatment Factor C  replicates or blocks Balanced design only
Anova Type=6 ThreeFactor Factorial in Completely Randomized Design Factor A  experimental treatment Factor B  experimental treatment Factor C  experimental treatment Balanced or unbalanced design
Anova Type=7 ThreeFactor Nested Design Factor A  experimental unit Factor B  subunit of factor A Factor C  subunit of factor B Balanced or unbalanced design
Anova Type=8 SplitPlot With Completely Randomized Design of Main Plots Factor A  experimental treatment Factor B  experimental treatment Factor C  replicates or blocks Balanced design only
Anova Type=9 SplitPlot With Randomized Complete Block of Main Plots Factor A  experimental treatment Factor B  experimental treatment Factor C  replicates or blocks Balanced design only
Anova Type=10 SplitPlot With SubUnit Treatments Arranged in Strips Factor A  experimental treatment Factor B  experimental treatment Factor C  replicates or blocks Balanced design only
Anova Type= Factor A  row factor Factor B  column factor Factor C  experimental treatment Balanced design only
Canonical correlation analysis is a powerful multivariate statistical technique to study the intercorrelation structure of two sets of variables (variable set 1 and variable set 2). Each "set" of variables must contain at least two variables, although the number of variables in each set does not need to be the same. StatPac defines set 1 as the set with the larger number of variables, and set 2 as the set with the smaller number of variables. This convention is used to speed up the execution of the analysis. Often (but not necessarily), one set is regarded as a "dependent" set and the other set is regarded as an "independent" set. For example, buying behavior could be a "dependent" variable set, and various personality characteristics of the buyers could be the "independent" variable set. Usually, the variables in each set represent two distinct variable "domains" that are conceptually different and measured on the same individuals. Canonical correlation analysis is related to many other statistical techniques. Consider the following analogy. If each set of variables were to contain only one variable, simple correlation analysis would be equivalent to canonical correlation analysis. If one of the sets has one variable and the other set has two or more variables, then multiple regression analysis would be equivalent to canonical correlation analysis. Canonical correlation analysis can be seen as an extension of multiple regression when a single dependent variable is replaced by a set of dependent variables. Canonical correlation analysis first calculates two new variables called the "first canonical variable pair". One of the new variables is calculated from set 1, and the other new variable is calculated from set 2. These new variables are constructed through linear combinations of the original variables so that the new canonical variable from set 1 has maximum correlation with the new canonical variable from set 2. The second canonical variable pair is created in a similar fashion. One canonical variable is constructed from the first set of original variables and has maximum correlation with the second canonical variable created from the second set of original variables. This process is subject to the constraint that the second canonical variable pair is uncorrelated with the first canonical variable pair. In essence, the first canonical pair (one canonical variable from each set) is chosen so that the correlation between the two sets of original variables is maximized. The second set of canonical variables maximizes the "remaining correlation" between the two sets of variables (not picked up by the first set of canonical variables). The same philosophy applies to subsequent canonical variable pairs created. This concept of independence of each canonical variable pair is similar to that involved in calculating principal components. The difference lies in the way the canonical variable pairs (or weights assigned to the original variables of each of the two sets) are created. In canonical correlation, the weights for each pair of canonical variables are calculated with the intent of maximizing "remaining" correlation between the two sets of variables, while in principal components the weights are calculated with the intent of maximizing the remaining "correlation (or variance) structure" in one set of variables. Canonical analysis provides a powerful method of studying the correlation structure between two sets of variables. For example, assume we are interested in studying the intercorrelation of variables from two different tests, each containing 20 variables. The correlation matrix would contain 400 different correlation coefficients to examine. This would be a very large task. The canonical correlation analysis simplifies this task by: 1. Determining the maximum correlation possible of any linear combination of the original variables in each set. 2. Deriving two sets of weighting coefficients (one for each of the two sets of variables) to arrive at a new composite variable (canonical variate) representing each set of variables. This provides a structured way of examining the contribution of each original variable (from each set). More specifically, this allows one to pinpoint groups of variables from each set that are highly correlated. Often, these groups can be given theoretical meaning. 3. Deriving additional linear functions which maximize the remaining correlation between the two sets, subject to being independent of the preceding set(s) of linear compounds. 4. Testing the statistical significance of the correlation measures, thereby determining the minimum number of linear functions required to account for the correlation structure of the two sets. 5. Assessing the overall extent of correlation between the two sets of variables. For a detailed mathematical discussion of canonical correlation analysis (as well as the interpretation of the canonical correlation output of several examples), the user is referred to the Lohnes and Cooley (1971) book given in the bibliography. The syntax of the command to run canonical correlation analysis is:
CANONICAL <Variable list 1> WITH <Variable list 2>
As an example, a researcher wished to study the correlation between 3 physiological measurements WEIGHT (V1), WAIST (V2) and PULSE (V3) with 3 exercise variables CHINS (V4), SITUPS (V5) and JUMPS (V6). The command to run canonical correlation analysis could be specified in several ways:
CANONICAL WEIGHT, WAIST, PULSE WITH CHINS, SITUPS, JUMPS CANONICAL V1 V2 V3 WITH V4 V5 V6 CANONICAL V1V3 WITH V4V6 CA V1V3 WITH V4V6 (Note: CANONICAL can be abbreviated as CA)
The variable list can consist of variable labels and/or variable numbers. Either a comma or a space can be used to separate the variables from each other. The keyword WITH is used to separate the two sets of variables. Descriptive StatisticsThe mean and standard deviations for all variables in both sets can be printed with the descriptive statistics option (DS=Y).
Example of Descriptive Statistics
Simple Correlation MatrixThe simple correlation matrix is printed with the SC=Y option.
Example of a Simple Correlation Matrix
RSquared TableThis option (RS=Y) provides the multiple correlation (rsquared) analysis of each variable in each set, regressed on all variables in the other set. The rsquared value given in the first row, is the squared multiple correlation of the first variable in the first set, with the entire set of variables in the second set. The rsquared value in the last row is the squared multiple correlation of the last variable in the second set, with the entire set of variables in the first set, and so forth. This output is a useful supplement to canonical correlation analysis, especially when used in conjunction with the redundancy analysis output described below. The output from this option allows one to measure whether certain variables are more correlated than other variables (within the same set) to the other set as a whole.
Example of an RSquared Table
Standardized CoefficientsStandardized coefficients for all canonical variables in both sets are printed with the option RC=Y. The standardized canonical variable coefficients are printed for the first set of variables, followed by those of the second set. These are the weights assigned to each original variable (standardized by subtracting the mean and dividing by the standard deviation) to construct the new variables (standardized canonical variable pairs) for each set.
Example of a Standardized Coefficients Table
Correlation of Canonical Pair with Original VariablesThe canonical variable loadings are provided for each set in a separate table. These loadings are the correlation of the canonical variables with the original variables in the set and are similar to factor loadings in factor analysis. The loadings (correlations) therefore indicate the relative contribution of each original variable (in each set) in constructing the new canonical variables. This is the most useful table in understanding the correlation structure of the two sets of variables, and it provides insight into interpretation of the canonical variable pairs.
Example of Canonical Variable Loadings
Redundancy TableCanonical redundancy analysis examines 1) how well the original variables can be predicted from the canonical variables, and 2) how much of the "variance" of the two sets is in common (i.e., how much of the "variance" in one set can be accounted for by the other set). If RT=Y, a redundancy table is printed for each of the 2 sets of variables. Each table provides two statistics for each canonical variable. The statistic given in the first column represents the proportion of the "variance" in the set of variables being analyzed that is accounted for by each canonical variable in that set. Since all canonical variables are independent, the total figure (last row) presents a measure of how well the canonical variables have been able to account for the "variance" in the original variables. It should be noted that, for the second set of variables, this total figure will always be 1 since there are as many new variables as original variables and the new variables are independent. The statistic in the second column gives the proportion of the "variance" in the set that is explained by each canonical variable in the other set. This statistic is usually referred to as the "redundancy of a set given the other", and measures the extent of overlap between the two sets with respect to that canonical variable pair. A small redundancy value indicates that practically no overlap across sets is present in that particular pair of canonical variables. The total figure (last row of the second column) provides an overall measure of the extent to which one set is a good predictor of the other (based on the canonical variable pairs constructed). The redundancy measure is important because a very large canonical coefficient could be the result of a very large zeroorder correlation of just one variable of one set with just one variable of the other set, and the remainder of the variables in the two sets could be essentially uninvolved. In this case, the canonical correlation for that pair of canonical variables would be high but the redundancy for that pair of canonical variables would be low. The redundancy criteria would therefore be a better measure of the extent of overlap in the two "sets" than the measure provided by the canonical correlation associated with that canonical correlation pair.
Example of a Redundancy Table
Rao's FStatistic TableVarious statistics, including Rao's Fstatistic, are included in this table. It provides information to evaluate the extent of the correlation between the two sets of variables, the number of canonical variable pairs that are required to represent this correlation, and how successful the canonical variables have been in summarizing the interdependence of the two sets of variables. Until recently, Rao's Fstatistic table was virtually the only tool a researcher had to evaluate the above criteria. More recently, however, researchers have come to realize that these statistics can be misleading in the absence of the redundancy analysis table. The statistics in this table, combined with the redundancy table statistics, provide a more powerful approach to studying the performance of the canonical correlation analysis. The canonical correlation given in the first row of this table (second column) is the correlation between the first pair of canonical variables (i.e., the correlation of the first new variable created from the first set of original variables with that of the first new variable created from the second set of variables). By construction, this is the maximum correlation that can be obtained between any linear combination of the first set of variables and any linear combinations of variables in the second set. The canonical correlation given in the second row of this table is the correlation between the second pair of canonical variables, and so forth. The maximum number of canonical variable pairs that can be constructed is equal to the number of variables in the second set. Remember that, by construction, the ith canonical variable pair is independent of all the canonical variable pairs before it, thus implying that the second canonical variable pair represents two new variables that are maximally correlated, subject to the condition that they are not picking out any of the correlation structure that was already picked up by a previous canonical variable pair. The third column (canonical rsquared) is simply the square of the canonical correlation and represents the amount of variance in one canonical variate that is accounted for by its canonical variable counterpart in the other set. By definition, the canonical rsquare given in the first row should be greater than the rsquare value obtained by regressing only one variable from a set (say the first) with all the variables from the other set (the second set). This can be verified with the rsquared table (option RS). Should any of the values in the rsquared table be close to the first canonical rsquare, it is doubtful that the other variables in that set are adding much to the correlation with the other set. The fourth column provides Rao's Fstatistic to test how
many canonical variable pairs are needed to adequately represent the
correlation structure of the two sets of variables. Some software uses
The fifth and sixth columns give the degrees of freedom associated with this Fstatistic, while the seventh column gives the significance level (probability value) associated with the Fstatistic. The probability value in the first row provides a test of whether the canonical variables have accounted for a significant amount of the correlation structure between the two sets of variables (i.e., at least the first canonical correlation is significantly different from zero). The probability value given in the second row tests whether all the canonical correlations (except the first) are significantly different from zero. The probability value in the third row tests whether all canonical correlations (except the first two) are significantly different from zero, and so forth. StatPac provides output on all canonical variable pairs possible; it is up to the user to determine (based on magnitude of the canonical correlations as well as the significance level of the Fstatistics) how many of these pairs are relevant. The last column provides Wilk's lambda as each of the canonical variables is added; this criterion is used in calculating the Fstatistic.
Example of a Rao's FStatistic Table
Mean SubstitutionMean substitution is one method often used to reduce the problem of missing information. Sometimes, canonical correlation analysis is difficult because if one variable is not known, the whole record must be excluded from the analysis. This could substantially reduce the number of records that are included in the analysis, especially when there are many variables in the analysis. Mean substitution overcomes this problem by replacing any missing variable with the mean of that variable. Save Canonical PairResearchers often want to save the first two canonical variables for future analysis. You can save them with the options command SP=Y. At the conclusion of the analysis you will be given the opportunity to merge the new variables into the original data. Labeling and Spacing Options
Perceptual mapping refers to a broad range of market research techniques to study consumer perceptions of products ("brands") in a class based on the product attributes. To achieve this objective, an attempt is made to reduce the dimensionality of the product/attribute space; plots (perceptual maps) are then used to graphically display consumer perceptions of "brands" in a category. Marketing researchers use perceptual mapping to identify new opportunities and marketing strategies. One of its primary application is to analyze the effectiveness of marketing campaigns designed to modify people's perceptions of a product. The four most frequently used attributebased statistical tools for perceptual mapping are discriminant analysis, principal components analysis, factor analysis and correspondence analysis. The MAP command uses the correspondence analysis approach to perceptual mapping. Correspondence analysis is an exploratory statistical technique for examining multivariate categorical data (usually in the form of a banners table) and graphically displaying both the rows and columns of the table in the same multidimensional space. The object of correspondence analysis is to graphically summarize the correlations among row and column categories. In marketing applications, columns are usually brands, and rows are image attributes. However, since the row and column categories are treated equally, the axes of the banners table could be swapped and the results would be the same. Correspondence analysis produces perceptual maps showing how consumers perceive various brands in relation to a set of image attributes. Brands with similar image profiles appear closer together on the map. Correspondence analysis is synonymous to the following techniques: dual scaling, method of reciprocal averages, optimal scaling, pickany scaling, canonical analysis of contingency tables, categorical discriminant analysis, homogeneity analysis, quantification of qualitative data. Correspondence analysis starts with the usual banners (crosstabs) table. The table always contains one row variable and one column variable. These are referred to as the active variables. The analysis begins by calculating the usual (contingency table) chisquare statistic, or more specifically, the chisquare statistic divided by sample size (Pearson's meansquared contingency coefficient). This quantity is often referred to as total inertia. A large chisquare statistic implies association between rows and columns (i.e., lack of independence between rows and columns). The purpose of correspondence analysis is to plot the nature of this association in a reduced dimensionality. It is similar to principal components in that new components (or factors) are extracted, each being independent of each other. If the first few components account for most of the "association in the table", we have been successful in reducing the dimensionality of the rows and columns. The difference between principal components and correspondence analysis is the "variability" we are trying to explain. In principal components analysis, we are trying to explain the "correlation" or "covariance" structure of a set of variables, while in correspondence analysis we are trying to explain the "association" between rows and columns of a table of frequencies. The decomposition of the chisquare proceeds by calculating two factors or vectors (one for the column levels and the other for the row levels). These factors are extracted such that the association between them is maximized (i.e., explains as much of the chisquare statistic as possible). Next, a second pair of factors is extracted. Their association is maximized to explain as much of the remaining chisquare statistic (subject to the constraint that this second factor pair is uncorrelated with the first factor pair). The same philosophy applies to subsequent factor pairs created. This concept of independence of each factor pair is similar to the calculation of principal components. The difference lies in the way the factor pairs (or weights assigned to the levels of each row and column) are created. In correspondence analysis, the weights for each factor pair are calculated with the intent of maximizing "remaining" association between row and column levels, while in principal components the weights are calculated with the intent of maximizing the remaining "correlation (or variance) structure" in the variables. The syntax of the command to run correspondence analysis is:
MAP <Active row variable> BY <Active column variable>
The only restriction on the data is that both variables must be categorical (alpha or numeric), and there must be at least three categories on each axis. However, since the row and column categories are treated equally, both can be either objects (brands) or attributes. A simple example might be to map the relationship between purchasing DECISION (V1) and INCOME (V2). The command might be expressed in many ways:
MAP INCOME BY DECISION MAP DECISION BY INCOME MAP V1 BY V2 MA V2 BY V1 (Note: MAP may be abbreviated as MA)
Passive VariablesStatPac can perform multiple correspondence analysis. This means that passive rows and columns (often referred to a supplemental rows and columns) can be added to the perceptual maps. These are not included in the actual analysis, but rather, are superimposed upon the perceptual map of the active variables. Passive rows and columns are requested by adding them to the Map command. To include passive variables, the syntax for the Map command becomes:
MAP <Active row variable><Passive row variables> BY <Active column variable><Passive column variables>
Passive rows and columns are often used as reference points for active points in the plot. For example, if the active columns was brands, an example of a passive column point could be a hypothetical brand or a brand from a previous similar study. Stacking VariablesCorrespondence analysis always uses one active row variable and one active column variable (each containing at least three categories). It is often desirable, however, to perform correspondence analysis on contingency tables that contain more than two dimensions. Stacking variables allows you to combine two or more variables into a single variable. The Stack command may be used to create a new variable that represents all possible combinations of two or more other variables. The following example would create a single variable called DEMOGRAPHICS, and it would be used as the active column variable. The number of categories in the new stacked variable will be the product of the number of categories in each of the individual variables.
STACK DEMOGRAPHICS = AGE SEX MAP DEMOGRAPHICS BY BRAND
If AGE had five categories and SEX had two categories, the resultant column variable (DEMOGRAPHICS) would contain ten categories. This would cover all the possible combinations of categories. Stacking variables should be used carefully because of the potential for a huge number of rows or columns. The maximum number of rows or columns is 150. Count/Percent TableCorrespondence analysis always begins with a banners table. The banner table itself is called the count/percent table. It may be printed with the CP=Y option or suppressed with the CP=N option. The options to control the appearance of the Count/Percent table are identical the options for the Banner command. In the following example, the active row is Party Affiliation and the active column is Attitude on the New Government Proposal. Annual Income is a passive row, and Liberal/Conservative is a passive column.
Example of a Count/Percent Table
Design Summary for Abbreviated Labeling on MapPerceptual maps can easily become overly cluttered, it is sometimes necessary to abbreviate the value labels with just a number. The design summary will contain the "plot number" for each row and column variable (i.e., a numeric abbreviation that can be used when creating large perceptual maps). The AB=Y option may be used to abbreviate the map so it uses numbers instead of value labels. A design summary will then be printed that prints the value labels for all active and passive rows and columns. This feature is only important when there are many rows and columns in the banners table.
Example of a Design Summary
Correspondence Analysis Eigenvalue SummaryThe correspondence analysis eigenvalue summary table is always included in the output. Each nontrivial eigenvalue represents a dimension (factor pair). The maximum number of eigenvalues for any contingency table is one less than the minimum of the number of rows and the number of columns in the contingency table. The sum of all eigenvalues equals the chisquare statistic of independence divided by sample size (total inertia). The eigenvectors of the nontrivial eigenvalues define the coordinates of the factor pairs. If the first few eigenvalues are large relative to the remaining eigenvalues, it is then possible to display the association between rows and columns in a one or twodimensional table. The eigenvalue table allows us to determine how much "information" is lost by ignoring all dimensions except for the first (or first and second). The first two dimensions should account for nearly 100% of the total row and column association (i.e., they completely explain all the association between the rows and columns).
Example of Correspondence Analysis Eigenvalue Summary Table
Summary of Row and Column PointsThe correspondence analysis summary of row and column points table presents useful insight into the nature of the association between rows and columns, and is printed for each analysis. This table provides information on the first two factor pairs (i.e., the first 2 dimensions) only.
Example of Summary of Row and Column Points Table
The first column (Coordinate) gives the factor scores for the first dimension. These are the coordinates that are used to plot the first dimension or axis. By definition, these coordinates are calculated to account for as much of the association in the contingency table as possible. Row and column levels with coordinates close to zero do not account for any of the association explained by the first dimension; they may however be important in the second dimension or axis. The first correlation (Corr.) column gives the correlation of each row and column with the first dimension. A high value implies that the particular row (column) is important in describing the first dimension. The first contribution (Contr.) column gives the row and column contribution to the association "pickedup" by the first dimension (expressed as a percentage). A row or column with a high marginal has a larger contribution than a row or column with a low marginal. It is important to note the difference between the concepts measured by the Corr. and Contr. columns. Corr. simply measures the correlation of the rows and columns with the first dimension. It is therefore useful in defining the first dimension. Contr., on the other hand, provides a measure of how useful a particular row or column is in explaining the contingency table association for the first dimension. A row or column may be highly correlated with the first dimension but explain very little of the association between rows and columns. The Contr. column is useful in locating outliers. When a row or column has a very large absolute contribution and a large coordinate, it can be considered an outlier and has a major role in determining the first and/or second coordinates. Thought should be given to redefining this outlier point as a passive (supplementary) point and performing the analysis again without this point, thereby eliminating that point's influence in the creation of the first few axis. The point can be superimposed on the axis calculated for the remaining points. The second column of coordinates gives the factor scores for the second dimension. These are the coordinates that are used to plot the second dimension or axis. By definition, these coordinates are calculated to account for as much of the association in the contingency table not accounted for by the first dimension. The second Corr. column gives the correlation of each row and column level with the second dimension. A high value implies that the particular row (column) is important in describing the second dimension. The second Contr. column is the row's (and column's) contribution to the association "pickedup" by the second dimension (expressed as a percentage). If a row or column has low contribution on both the first and second dimension, this can be due to one or more of the following reasons: 1. There is no association between that row (column) and the levels of columns (rows). 2. The third (and higher) dimensions account for a significant portion of the association in the contingency table and cannot be ignored. 3. The row or column marginal is small and therefore has only a small effect on the chisquare statistic. CarrollGreenShaffer Scaling of CoordinatesThe option CG=Y requests that the CarrollGreenShaffer scaling of coordinates be used. If CG=N, the usual "French school" correspondence analysis coordinates are calculated. These "French school" coordinates do not allow one to compare distances between column and row points but rather only distance between column points or between row points. Carroll, Green and Shaffer claim that their method of scaling of coordinates allow one to compare both within and across group (row/column) distances. If eigenvalues are almost equal, the CGS and "French school" plots look almost the same, but if the first eigenvalue is considerably greater than the second, the plots can look very different. PlotsThe correspondence analysis plots show the relationship between a row and all columns or between a column and all rows. When using the "French school" technique (CG=N), distances between a row and a column point cannot be interpreted. When using the CarrollGreenSchaffer coordinates, distances between a row and a column point can be examined. Four types of maps can be created with the PL option. A. twodimension rowcolumn plot B. onedimension rowcolumn plot C. twodimension row plot D. twodimension column plot
For example, the following option would create a twodimension rowcolumn plot, a twodimension row plot, and a twodimension column plot.
OPTIONS PL=ACD The twodimension rowcolumn plot will show both row and column variables using the first two dimensions. If the first two eigenvalues (dimensions) account for a large portion of the total chisquare, then this plot provides a useful summary of the association between row and column points.
Example of a TwoDimension RowColumn Map
In general, the more the row and column points are spread out, the higher the association and hence the higher the chisquared statistic. Columns and rows represented by points in the plot relatively far from the origin account for relatively large portions of the lack of independence in the contingency table. Rows positioned close together in the map have similar profiles across the columns. Analogously, columns positioned close together in the plot exhibit similar profiles across the rows. The twodimension row plot is especially useful when there are a lot of row and column points to plot and/or one wishes to examine the distance between row points only. The twodimension column plot is especially useful when there are a lot of row and column points to plot and/or one wishes to examine the distance between column points only. The onedimension rowcolumn plot is especially useful when the first dimension explains most of the association in the table. In this case the value of the first eigenvalue would be close to Pearson's meansquare contingency coefficient.
The following bibliography is recommended for more detailed information on the advanced analyses module of StatPac for Windows:
Regression Analysis Draper N.R. and H. Smith  Applied Regression Analysis,
2nd Ed., Henderson, H.V. and P.F. Velleman, (1981)  Building Multiple Regression Models Interactively. Biometrics, Vol. 37, pp. 391411. Jenrich, R.I.  Stepwise Regression in Statistical Methods for Digital Computers, edited by Kurt Enstein, Anthony Ralston, Herbert S. Wiff, (1977). Lewis, C.D.  Industrial and Business Forecasting
Methods.
Probit and Logistic Regression Analysis Gunderson, M., (1974)  Retention of Trainees  A Study with Dichotomous Dependent Variables. Journal of Econometrics, Vol. 2, pp. 7993. Tobin, J.  The Application of Multivariate Probit Analysis to Economic Survey Data. Cowles Foundation Discussion Paper No. 1, December (1955). Principal Components, Factor and Multicollinearity Analysis Afifi, A.A. and Azen, S.P.  Statistical Analysis: A Computer Oriented Approach. New York, Academic Press, Inc. (1972). Belsley, D.A., Kuh, E. and Welsch, R.E  Regression
Diagnostics: Identifying Influential Data and Sources of Collinearity.
Hocking, R.R. and Pendleton, O.J., (1983)  The Regression Dilemma. Communications in Statistics, Vol. A12, pp. 497527. Kim, JaeOn and C.W. Mueller  Factor Analysis: Statistical Methods and Practical Issues. California, Sage Publications, Inc. (1978). Kramer, C.Y., (1978)  An Overview of Multivariate Analysis. Journal of Dairy Science, Vol. 61, pp. 848854. Veldman, D.J.  Fortran Programming for the Behavioral
Sciences.
Cluster Analysis Anderberg, M.R. Cluster
Analysis for Applications. Milligan, G.W., (1980)  An Examination of the Effect of Six Types of Error Perturbation on Fifteen Clustering Algorithms. Psychometrika, Vol. 45, pp. 325342. Pung, G. and Stewart, D.W., (1983)  Cluster Analysis in Marketing Research: Review and Suggestions for Applications. Journal of Marketing Research, Vol. 20 (May), pp. 134148. Spath, H.  Cluster Analysis Algorithms.
Analysis of Variance Cochran, W.G. and G.M. Cox  Experimental Designs, 2nd Edition, Wiley Books, (1957). Finney, D.J.  Experimental Design and its Statistical
Basis, The Quenouille, M.H.  The Design and Analysis of Experiments, Charles Griffin and Co. Ltd., (1953). Snedecor, G.W. and W. G. Cochran  Statistical Methods.
Sixth Edition, The Steel, R.G.D. and J.H. Torrie  Principles and Procedures of Statistics, McGrawHill Book Company, Inc., (1960). Winer, B.J.  Statistical Principles in Experimental Design, McGrawHill Book Company, Inc., (1962).
Canonical Correlation Analysis Cooley, W.W. and Lohnes, P.R.  Multivariate Data
Analysis (Chapter 6). Green, P.E., Halbert, M.H. and Robinson, P.J., (1966)  Canonical Analysis: An Exposition and Illustrative Application. Journal of Marketing Research, Vol. 3, pp. 3239.
Correspondance Analysis CarrollGreenShaffer, (1986)  Journal of Marketing Research, Vol. 23, pp. 271280. Dillion, William R., Frederick, Donald G., Tangpanichdee, Vanchai (1982)  A Note on Accounting for Sources of Variation in Perceptual Maps. Journal of Marketing Research, Vol. 19 (August), pp. 302311. Fox, Richard J., (1988)  Perceptual Mapping Using the
Basic Structure Matrix Decomposition. Journal of the Greenacre, M.J.,  Theory and Applications of
Correspondance Analysis. Greenacre, M.J. (1989)  Journal of Marketing Research, Vol. 26, pp. 358368. 