StatPac for Windows User's Guide
StatPac Home
 

Overview

System Requirements and Installation

System Requirements

Installation

Unregistering & Removing the Software from a PC

Network Operation

Updating to a More Recent Version

Backing-Up a Study

Processing Time

Server Demands and Security

Technical Support

Notice of Liability

Paper & Pencil and CATI Survey Process

Internet Survey Process

Basic File Types

Codebooks (.cod)

Data Manager Forms (.frm)

Data Files (.dat)

Internet Response Files (.asc or .txt)

Email Address Lists (.lst or .txt)

Email Logs (.log)

Rich Text Files (.rtf)

HTML Files (.htm)

Perl Script (.pl)

Password Files (.text)

Exported Data Files (.txt and .csv and .mdb)

Email Body Files (.txt or .htm)

Sample File Naming Scheme for a Survey

Customizing the Package

Problem Recognition and Definition

Creating the Research Design

Methods of Research

Sampling

Data Collection

Reporting the Results

Validity

Reliability

Systematic and Random Error

Formulating Hypotheses from Research Questions

Type I and Type II Errors

Types of Data

Significance

One-Tailed and Two-Tailed Tests

Procedure for Significance Testing

Bonferroni's Theorem

Central Tendency

Variability

Standard Error of the Mean

Inferences with Small Sample Sizes

Degrees of Freedom

Components of a Study Design

Elements of a Variable

Variable Format

Variable Name

Variable Label

Value Labels

Valid Codes

Skip Codes for Branching

Data Entry Control Parameters

Missing OK

Auto Advance

Caps Only

Codebook Tools

The Grid

Codebook Libraries

Duplicating Variables

Insert & Delete Variables

Move Variables

Starting Columns

Print a Codebook

Variable Detail Window

Codebook Creation Process

Method 1 - Create a Codebook from Scratch

Method 2 – Create a Codebook from a Word-Processed Document

Spell Check a Codebook

Multiple Response Variables

Missing Data

Changing Information in a Codebook

Overview

Data Input Fields

Form Naming Conventions

Form Creation Process

Using the Codebook to Create a Form

Using a Word-Processed Document to Create a Form

Variable Text Formatting

Field Placement

Value Labels

Variable Separation

Variable Label Indent

Value Labels Indent

Space between Columns

Valid Codes

Skip Codes

Variable Numbers

Variable List and Detail Windows

Data Input Settings

Select a Specific Variable

Finding Text in the Form

Replacing Text in the Form

Saving the Codebook or Workspace

Overview

Keyboard And Mouse Functions

Create A New Data File

Edit Or Add To An Existing Data File

Select A Different Data File

Change Fields

Change Records

Enter A New Data Record

View Data For A Specified Record Number

Find Records That Contain Specified Data

Duplicate A Field From The Previous Record

Delete A Record

Data Input Settings

Compact Data File

Double Entry Verification

Print A Data Record

Variable List & Detail Windows

Data File Format

Overview

HTML Email Surveys

Plain Text Email Surveys

Brackets

Item Numbering

Codebook Design for a Plain Text Email Survey

Capturing a Respondent's Email Address

Filtering Email to a Mailbox

General Considerations for Plain Text Email

Overview

Internet Survey Process

Server Setup

Create the HTML Survey Pages

Upload the Files to the Web server

Test the survey

Download and import the test data

Delete the test data from the server

Conduct the survey

Download and import the data

Display a survey closed message

Server Setup

FTP Login Information

Paths & Folder Information

Design Considerations for Internet Surveys

Special Variables for Internet Surveys

Script to Create the HTML

Command Syntax & Help

Saving and Loading Styles

Survey Generation Procedure

Script Editor

Imbedded HTML Tags

Primary Settings

HTML Name (HTMLName=)

Banner Image(s)  (BannerImage=)

Heading  (Heading=)

Finish Text & Finish URL (FinishText= and FinishURL=)

Cookie (Cookie=)

URL to Survey Folder  (WebFolderURL=)

Advanced Settings - Header & Footer

RepeatBannerImage

RepeatHeading

PageNumbers

ContinueButtonText

SubmitButtonText

ProgressBar

FootnoteText & FootnoteURL

Advanced Settings - Finish & Popups

Thanks

Closed

HelpWindowWidth & HelpWindowHeight

HelpLinkText

LinkText

PopupBannerImage

PopupFullScreen

Advanced Settings - Control

Method

Email

RestartSeconds

MaximizeWindow

BreakFrame

AutoAdvance

BranchDelay

Cache

Index

ForceLoaderSubmit

ExtraTallBlankLine

RadioTextPosition

TextBoxTextPosition

LargeTextBoxPosition

LargeTextBoxProgressBar

Advanced Settings - Fonts & Colors

Global Attributes

Heading, Title, Text, & Footnote Attributes

Instructions, Question, and Response Attributes

Advanced Settings - Passwords - Color & Banner Image

LoginBannerImage

LoginBGColor

LoginWallpaper

LoginWindowColor

Advanced Settings - Passwords - Text & Control

PasswordType

LoginText

PasswordText

LoginButtonText

FailText

FailButtonText

ShowLink

EmailMe

KeepLog

Advanced Settings - Passwords - Single vs. Multiple

Password (single password method)

PasswordFile (multiple passwords method)

PasswordField & ID Field (multiple passwords method)

Advanced Settings - Passwords - Technical Notes

Advanced Settings - Server Overrides

ActionTag

StorageFolder

ScriptFolder

Perl

MailProgram

Branching and Piping

Randomization (Rotations)

Survey Creation Script - Overview

Using Commands More than Once in a Script

Survey Creation - Specify Text

Heading

Title

Text

FootnoteText

Instructions

Question

Survey Creation - Spacing and pagination

BlankLine

NewPage

Survey Creation - Images and Links

Image

Link

Survey Creation - Help Windows

Survey Creation - Popup Windows

Survey Creation - Objects

Radio Buttons for a Single Variable

Radio Buttons for Grouped Variables (matrix style)

DropDown Menu

TextBox for a Single Variable

TextBoxes for Grouped Variables

CheckBox for Multiple Response Variables

ListBox

Uploading and Downloading Files from the Server

Auto Transfer

FTP

Summary of the Most Common Script Commands

Overview

Format of an Email Address File

Extract Email Addresses

List Statistics

Join Two or More Lists

Split a List

Clean, Sort, and Eliminate Duplicates

Add ID Numbers to a List

Create a List of Nonresponders

Subtract One List From Another List

Merge an Email List into a StatPac Data File

Send Email Invitations

Using an ID Number to Track Responses

Email Address File

Body Text File

Sending Email

Overview

Mouse and Keyboard Functions

Designing Analyses

Continuation Lines

Comment Lines

V Numbers

Keywords

Analyses

Variable List

Variable Detail

Find Text

Replace Text

Options

Load, Save, and Merge Procedure Files

Print a Procedure File

Run a Procedure File

Results Editor

Graphics

Table of Contents

Keyword Index

Keywords Overview

Categories of Keywords

Keyword Help

Ordering Keywords

Global and Temporary Keywords

Permanently Change a Codebook and Data File

Backup a Study

STUDY Command

DATA Command

SAVE Command

WRITE Command

MERGE Command

HEADING Command

TITLE Command

FOOTNOTE Command

LABELS Command

OPTIONS Command

SELECT and REJECT Commands

NEW Command

LET Command

STACK Command

RECODE Command

COMPUTE Command

AVERAGE, COUNT and SUM Commands

IF-THEN … ELSE Command

SORT Command

WEIGHT Command

NORMALIZE Command

LAG Command

DIFFERENCE Command

DUMMY Command

RUN Command

REM Command

Reserved Words

Reserved Word RECORD

Reserved Word TOTAL

Reserved Word MEAN

Reserved Word TIME

Analyses Index

Analyses Overview

LIST Command

FREQUENCIES Command

CROSSTABS Command

BANNERS Command

DESCRIPTIVE Command

BREAKDOWN Command

TTEST Command

CORRELATE Command

Advanced Analyses Index

REGRESS Command

STEPWISE Command

LOGIT and PROBIT Commands

PCA Command

FACTOR Command

CLUSTER Command

DISCRIMINANT Command

ANOVA Command

CANONICAL Command

MAP Command

Advanced Analyses Bibliography

Utility Programs

Import and Export

StatPac and Prior Versions of StatPac Gold

Access and Excel

Comma Delimited and Tab Delimited Files

Files Containing Multiple Data Records per Case

Internet Files

Email Surveys

Merging Data Files

Concatenate Data Files

Merge Variables and Data

Aggregate

Codebook

Quick Codebook Creation

Check Codebook and Data

Sampling

Random Number Table

Random Digit Dialing Table

Select Random Records from Data File

Compare Data Files

Conversions

Date Conversions

Currency Conversion

Statistics Calculator Menu

Distributions Menu

Normal distribution

T distribution

F distribution

Chi-square distribution

Counts Menu

Chi-square test

Fisher's Exact Test

Binomial Test

Poisson Distribution Events Test

Percents Menu

Choosing the Proper Test

One Sample t-Test between Percents

Two Sample t-Test between Percents

Confidence Intervals around a Percent

Means Menu

Mean and Standard Deviation of a Sample

Matched Pairs t-Test between Means

Independent Groups t-Test between Means

Confidence Interval around a Mean

Compare a Sample Mean to a Population Mean

Compare Two Standard Deviations

Compare Three or more Means

Correlation Menu

Sampling Menu

Sample Size for Percents

Sample Size for Means

Advanced Multivariate Statistics

Advanced Analyses Index

The Advanced Analyses module adds multivariate procedures that are not available in the basic StatPac for Windows package. These commands may be used in a procedure when the Advanced Analyses module has been installed.

 

ANOVA

Canonical

Cluster

Discriminant

Factor

Logit

Map

PCA

Probit

Regress

Stepwise

 

 

REGRESS Command

The REGRESS command may be used to perform ordinary least squares regression and curve fitting. Ordinary least squares regression (also called simple regression) is used to examine the relationship between one independent and one dependent variable. After performing an analysis, the regression statistics can be used to predict the dependent variable when the independent variable is known.

People use regression on an intuitive level every day. In business, a well-dressed man is thought to be financially successful. A mother knows that more sugar in her children's diet results in higher energy levels. The ease of waking up in the morning often depends on how late you went to bed the night before.

Quantitative regression adds precision by developing a mathematical formula that can be used for predictive purposes.

The syntax of the command to run a simple regression analysis is:

 

REGRESS  <Dependent variable> <Independent variable>

 - or -

REGRESS  <Dependent variable list> WITH <Independent variable list>

 

For example, a medical researcher might want to use body weight (V1=WEIGHT) to predict the most appropriate dose for a new drug (V2=DOSE). The command to run the regression would be specified in several ways:

 

REGRESS DOSE WITH WEIGHT

REGRESS DOSE WEIGHT

RE V2 WITH V1     (Note: REGRESS may be abbreviated as RE)

RE V2 V1

 

Notice that the keyword WITH is an optional part of the syntax. However, if you specify a variable list for either the dependent or independent variable, the use of the WITH keyword is mandatory. When a variable list is specified, a separate regression will be performed for each combination of dependent and independent variables.

The purpose of running the regression is to find a formula that fits the relationship between the two variables. Then you can use that formula to predict values for the dependent variable when only the independent variable is known.

The general formula for the linear regression equation is:

 

                y = a + bx

 

where:

                 x is the independent variable

                 y is the dependent variable

                a is the intercept

                b is the slope

 

Curve Fitting

Frequently, the relationship between the independent and dependent variable is not linear. Classical examples include the traditional sales curve, learning curve and population growth curve. In each case, linear (straight-line) regression would present a distorted picture of the actual relationship.

Several classical non-linear curves are built into StatPac and you can simply ask the program to find the best one. Transformations are used to find an equation that will make the relationship between the variables linear. A linear least squares regression can then be performed on the transformed data.

The process of finding the best transformation is known as "curve fitting". Basically, it is an attempt to find a transformation that can be made to the dependent and/or independent variable so that a least squares regression will fit the transformed data. This can be expressed by the equation:

 

(transformed y) = intercept + slope * (transformed x)

 

Notice the similarity to the least squares equation. The difference is that we are transforming the independent variable and predicting a transformed dependent variable. To solve for y, use the formula to untransform y and apply it to both sides of the equation.

 

y = Untransformation of (intercept + slope * (transformed x))

 

In addition to the built-in transformations, any non-linear relationship that can be expressed as a mathematical formula can be explored with the COMPUTE statement. It is possible to transform both the dependent and independent variables with the COMPUTE statement.

The transformations that are built into StatPac are known as Box-Cox transformations. They are:

 

Options   

Transformation

TX=A  or  TY=A   

Automatic

TX=B  or  TY=B   

Reciprocal

TX=C  or  TY=C   

Reciprocal root

TX=D  or  TY=D  

Reciprocal fourth root

TX=E  or  TY=E   

Log

TX=F  or  TY=F    

Fourth root

TX=G  or  TY=G   

Square root

TX=H  or  TY=H   

No transformation

TX=I  or  TY=I      

Robust technique (no transformation)

 

For example, to apply a square root transformation to the independent variable and no transformation to the dependent variable, the options statement would be:

 

OPTIONS TX=G TY=H

 

The following option statement would try to fit your data to a classical S-Curve. It says to apply a reciprocal transformation to the independent variable and a log transformation to the dependent variable:

 

OPTIONS TX=B TY=E

 

The program also contains an automatic feature to search for the best transformation. When TX or TY is set to automatic, the program will select the transformation that produces the highest r-squared. To get a complete table of all combinations of the transformations, set both TX and TY to automatic.

 

OPTIONS TX=A TY=A

 

The result will produce a table of all possible combinations of transformations and the R-Squared statistics:

 

Example of a Transformation Table

 

The Box-Cox transformations can be expressed mathematically:

 

 

Transformation

Untransformation

Reciprocal

z = 1/(y+k)

y = (z)-1 - k

Reciprocal root

 z = 1/(y+k)^.5

y = (z)^-2 - k

Reciprocal fourth root

z = 1/(y+k)^.25

y = (z)^-4 - k

Log

 z = Log(y+k)      

y = Exp(z) - k

Fourth root

z =  (y+k)^.25

y = (z)^4 - k

Square root

z =  (y+k)^.5

y = (z)^2 - k

Notes:

        z    is the transformed data value

        y    is the original data value

        k    is a constant used in the transformation

 

Two cautions should be noted when using transformations.

1. When a reciprocal transformation is used, the sign of the correlation coefficient may no longer indicate the direction of the relationship in the untransformed data.

2. Some transformations may not be possible for some data. For example, it is not possible to take the log or square root of a negative number or the reciprocal of zero. When necessary, StatPac will automatically add a constant to the data to prevent this type of error.

One problem with least squares regression is its susceptibility to extreme or unusual data values. In many cases, even a single extreme data value can distort the regression results. A technique called robust regression is included in StatPac to overcome this problem. Robust regression mathematically adjusts extreme data values through an iterative process. The effect is to reduce the distortion in the regression line caused by the outlying data value(s).

The robust process makes successive adjustments to extreme data values by examining the median residual and using a weighted least squares regression to adjust the outliers.

If robust regression is specified for either the x or y transformation, no other built-in transformations will be used (even if a transformation is specified for the other variable).

Statistics

The regression statistics provide the information needed to adequately evaluate the "goodness-of-fit"; that is, how well the regression line explains the actual data. The statistics include correlation information, descriptive statistics, error measures and regression coefficients. This data can be used to predict future values for the dependent variable and to develop confidence intervals for the predictions.

 

Example of a Statistics Printout

 

Correlation is a measure of association between two variables. StatPac calculates the Pearson product-moment correlation coefficient. Its value may vary from minus one to plus one. A minus one indicates a perfect negative correlation, while a plus one indicates a perfect positive correlation. A correlation of zero means there is no relationship between the two variables. When a transformation has been specified, the correlation coefficient refers to the relationship of the transformed data.

The coefficient of determination (r-squared) is the square of the correlation coefficient. Its value may vary from zero to one. It has the advantage over the correlation coefficient in that it may be interpreted directly as the proportion of variance in the dependent variable that can be accounted for by the regression equation. For example, an r-squared value of .49 means that 49% of the variance in the dependent variable can be explained by the regression equation. The other 51% is unexplained error.

The standard error of estimate for regression measures the amount of variability in the points around the regression line. It is the standard deviation of the data points as they are distributed around the regression line. The standard error of the estimate can be used to specify the limits and confidence of any prediction made and is useful to obtain confidence intervals for y' given a fixed value of x.

Regression analysis enables us to predict one variable if the other is known. The regression line (known as the "least squares line") is a plot of the expected value of the dependent variable for all values of the independent variable.

The difference between the observed and expected value is called the residual. It can be used to calculate various measures of error. The measures of error in StatPac are the mean percent error (MPE), mean absolute percent error (MAPE) and the mean squared error (MSE).

Using the regression equation, the variable on the y axis may be predicted from the score on the x axis. The slope of the regression line (b) is defined as the rise divided by the run. The y intercept (a) is the point on the y axis where the regression line would intercept the y axis. The slope and y intercept are incorporated into the regression equation as:  y = a + bx

The significance of the slope of the regression line is determined from the student's t statistic. It is the probability that the observed correlation coefficient occurred by chance if the true correlation is zero. StatPac uses a two-tailed test to derive this probability from the t distribution. Probabilities of .05 or less are generally considered significant, implying that there is a relationship between the two variables. Although StatPac does not calculate the F statistic, it is simply the square of the t statistic for the slope.

Data Table

The data table provides a detailed method to examine the errors between the predicted and actual values of the dependent variable. StatPac allows printing of the table to more closely study the residuals. Typing the option DT=Y will cause the output to include a data table.

 

Example of a Data Table

 

Outlier Definition and Adjustment

Outliers (extreme data points) can have dramatic effects on the slope of the regression line. One method to deal with outliers is to use robust regression (TX=I or TY=I). There are two other common methods to deal with outliers. The first is to simply eliminate any records that contain an outlier and then rerun the regression without those data records. The other method is known as data trimming, where the highest and lowest extreme values are replaced with a value that limits the standardized residual to a predetermined value. The OA option is used to set the outlier adjustment method. It may be set to OA=N (none), OA=D (delete), or OA=A (adjust).

Both methods use a two-step process. First the regression is performed using the actual values for the dependent variable and standardized residuals are calculated for each predicted value. When a standardized residual exceeds a given z-value, the record is flagged. Then the regression is run again and the flagged records are either eliminated (OA=D), or the value of the dependent variable is adjusted to the value defined by the outlier definition z-value (OA=A). For example, if the outlier definition is set to 1.96 standard deviations (OD=1.96), the upper and lower two and a half percent of the outliers would be flagged. Then the dependent variables for the flagged records would be modified to a value that would produce an outlier of plus or minus 1.96. Finally, the regression would be rerun using the modified dependent variable values for the flagged records. Flagged data records will be shown with an asterisk in the data table.

It is important to note that the outlier adjustment process is only performed once because each regression would produce a new set of standardized residuals that would exceed the outlier definition value (OD=z). That is, any set of data with sufficient sample size will yield a set of outliers, even if the data has already been adjusted. Allowing the outlier adjustment process to repeat indefinitely would eventually result in an adjustment to nearly every data record.

When outlier adjustment is used, the program will also report adjusted means and standard deviations for the dependent variable.  This refers to the recalculated mean after deleting or adjusting the data.

Confidence Intervals & Confidence Level

Confidence intervals provide an estimate of variability around the regression line. Narrow confidence intervals indicate less variability around the regression line. The option CI=Y will include the confidence intervals in the data table.

Prediction intervals, rather than confidence intervals, should be used if you intend to use the regression information to predict new values for the dependent variable. Both the confidence intervals and the prediction intervals are centered on the regression line, but the prediction intervals are much wider. The option CI=P will print the prediction intervals in the data table.

The actual confidence or prediction interval is set with the CL option. The CL option specifies the percentage level of the interval. For example, if CI=P and CL=95, the 95% prediction intervals would be printed in the data table.

Residual Autocorrelation Function Table

Examining the autocorrelation of the residuals is often used in time-series analysis to evaluate how well the regression worked. It is a way of looking at the "goodness-of-fit" of the regression line. If the residuals contain a pattern, the regression did not do as well as we might have desired.

A residual autocorrelation table is the correlation between values that occur at various time lags. For example, at time lag one, you are looking at the correlation between adjacent values; at time lag two, you are looking at the correlation between every other value, etc. To select the residual autocorrelation function table, type the option AC=Y.

 

Example of a Residual Autocorrelation Function Table

 

Expanding Standard Error

You may use the EX option to set the standard error limits of the residual autocorrelation function to a fixed value or to expand with increasing time lags.

A study of sampling distributions on autocorrelated time series was made by Bartlett in 1946. He found that, as one goes out further in time, the standard error increases with successive time lags. ("Theoretical Specifications of Sampling Properties of Autocorrelated Time Series", Bartlett, 1946.) It is only in recent years that his findings have been accepted by the forecasting community.

When EX=Y, the residual autocorrelation function error limits will widen with each successive time lag. If EX=N, the standard error limits will remain constant.

Force Constant to Zero

The option CZ=Y can be used to calculate a regression equation with the constant equal to zero. If this is done, the regression line is forced through the origin. Note that forcing the constant to zero disables calculation of the correlation coefficient, r-squared, and the standard error of estimate. For this reason, it is not possible to set the transformation parameter for either the independent or dependent variable to automatic, because there are no r-squared statistics to compare. Furthermore, confidence intervals, which are calculated from the standard error of estimate, cannot be computed. The option CZ=N results in a standard regression equation.

Save Results

Many times researchers want to save results for future study. By using the option SR=Y, the predicted values, residuals and confidence or prediction intervals can be saved so they can be merged with the original data file. At the completion of the analysis, you will be given the opportunity to merge the predictions and residuals.

Predict Interactively

When performing a regression, predicting values for the dependent variable for specific values of the independent variable may be desired. This is known as interactive prediction. Select interactive prediction by entering the option PR=Y. After the completion of the tabular outputs, the user will be prompted to enter a value for the independent variable. The program will predict the value for the dependent variable based on the regression equation. Confidence and/or prediction intervals will also be given.

Labeling and Spacing Options

 

Option

Code

Function

Labeling

LB

Sets the labeling for descriptive statistics to print the variable label (LB=E), the variable name (LB=N), or the variable number (LB=C).

Decimal Places

DP

Sets the number of decimal digits that will be shown.

 

STEPWISE Command

Multiple regression is an extension of simple regression. It examines the relationship between a dependent variable and two or more explanatory variables (also called independent or predictor variables). Multiple regression is used to:

1. Predict the value of a dependent variable using some or all of the independent variables. The aim is generally to explain the dependent variable accurately with as few independent variables as possible.

2. To examine the influence and relative importance of each independent variable on the dependent variable. This involves looking at the magnitude and sign of the standardized regression coefficients as well as the significance of the individual regression coefficients.

The syntax of the command to run a stepwise regression is:

 

STEPWISE <Dependent variable> <Independent variable list>

 

For example, we might try to predict annual income (V1=INCOME) from age (V2=AGE), number of years of school (V3=SCHOOL), and IQ score (V4=IQ). The command to run the regression could be specified in several different ways:

 

STEPWISE INCOME, AGE, SCHOOL, IQ

STEPWISE V1,V2,V3,V4

ST INCOME V2-V4    (Note: STEPWISE may be abbreviated as ST)

STEPWISE V1-V4

 

In each example, the dependent variable was specified first, followed by the independent variable list. The variable list itself may contain up to 200 independent variables and can consist of variable names and/or variable numbers. Either a comma or a space can be used to separate the variables from each other.

The multiple regression equation is similar to the simple regression equation. The only difference is that there are several predictor variables and each one has its own regression coefficient.

The multiple regression equation is:

 

Y' = a + b1 x1 + b2 x2 + b3 x3 + ... bn xn

 

where

 

Y'     is the predicted value

A     is a constant

B1   is the estimated regression coefficient for variable 1

X1   is the score for variable 1

B2   is the estimated regression coefficient for variable 2

X2   is the score for variable 2

 

 Descriptive Statistics

The mean and standard deviations for all the variables in the equation can be printed with the DS=Y option.

 

Example of a Descriptive Statistics Printout

 

Regression Statistics

The regression statistics can be selected with option RS=Y. They give us an overall picture of how successful the regression was.

The coefficient of multiple determination, frequently referred to as r-squared, can be interpreted directly as the proportion of variance in the dependent variable that can be accounted for by the combination of predictor variables. A coefficient of multiple determination of .85 means that 85 percent of the variance in the dependent variable can be explained by the combined effects of the independent variables; the remaining 15 percent would be unexplained.

The coefficient of multiple correlation is the square root of the coefficient of multiple determination. Its interpretation is similar to the simple correlation coefficient. It is basically a measure of association between the predicted value and the actual value.

The standard error of the multiple estimate provides an estimate of the standard deviation. It is used in conjunction with the inverted matrix to calculate confidence intervals and statistical tests of significance.

When there are fewer than 100 records, StatPac will apply an adjustment to the above three statistics, and the adjusted value will be printed. The adjustment is for a small n and its value should be used.

The variability of the dependent variable is made up of variation produced by the joint effects of the independent variables and some unexplained variance. The overall F-test is performed to determine the probability that the true coefficient of multiple determination is zero. Typically, a probability of .05 or less leads us to reject the hypothesis that the regression equation does not improve our ability to predict the dependent variable.

 

Example of the Regression Statistics Printout

 

Regression Coefficients

The regression coefficients can be printed with the RC=Y option. The output includes the constant, coefficient, beta weight, F-ratio, probability, and standard error for each independent variable.

Each coefficient provides an estimate of the effect of that variable (in the units of the raw score) for predicting the dependent variable. The beta weights, on the other hand, are the standardized regression coefficients and represent the relative importance of each independent variable in predicting the dependent variable.

The F-ratio allows us to calculate the probability that the influence of the predictor variable occurred by chance. The t-statistic for each independent variable is equal to the square root of its F-ratio.

The standard error of the ith regression coefficient can be used to obtain confidence intervals about each regression coefficient in conjunction with its F-ratio.

 

Example of Regression Coefficients Printout

 

Simple Correlation Matrix

After performing a regression analysis, it is a good idea to review the simple correlation matrix (SC=Y). If two variables are highly correlated, it is possible that the matrix is not well conditioned and it might be beneficial to run the regression again without one of the variables. If the coefficient of multiple determination does not show a significant change, you might want to leave the variable out of the equation.

 

Example of a Simple Correlation Printout

 

Partial Correlation Matrix

The partial correlation matrix (often called the variance-covariance matrix) is obtained from the inverse of simple correlation matrix. It can be selected with the option PC=Y. This is useful in studying the correlation between two variables while holding all the other variables constant.

A significant partial correlation between variables A and B would be interpreted as follows: When all other variables are held constant, there is a significant relationship between A and B. The partial correlation matrix will be printed for those variables remaining in the equation after the stepwise procedure.

 

Example of a Partial Correlation Matrix Printout

 

Inverted Correlation Matrix

The solution to a multiple regression problem is obtained through a technique known as matrix inversion. The inverted correlation matrix is the inversion of the simple correlation matrix. It may be selected with the option IC=Y.

In examining the inverted matrix, we are specifically interested in the values along the diagonal. They provide a measure of how successful the matrix inversion was. If all the values on the diagonal are close to one, the inversion was very successful and we say the matrix is "well conditioned". If, however, we have one or more diagonal values that are high (greater than ten), we have a problem with collinearity (high correlations between independent variables).

 

Example of an Inverted Matrix Printout

 

Print Each Step

You can print the statistics for each step of the stepwise procedure using the option PS=Y. This may be important when you want to study how the inclusion or deletion of a variable affects the other variables.

 

Example of the Print Steps Output

 

Summary Table

A good way to get an overview of how the steps proceeded and what effect each step had upon the r-squared is to print a summary table. To print the summary table, use the option ST=Y.

    

Example of a Summary Table

 

Data Table

The data table provides a detailed method to examine the residuals. StatPac allows printing of the table to more closely study the residuals. Using the option DT=Y will cause the output to include a data table.

A "residual" is the difference between the observed value and the predicted value for the dependent variable (the error in the prediction). The standardized residuals which appear in the data table are the residuals divided by the standard error of the multiple estimate. Therefore, the standardized residuals are in standard deviation units. In large samples, we would expect 95 percent of the standardized residuals to lie between -1.96 and 1.96.

 

Example of a Data Table

 

Outlier Definition and Adjustment

Outliers (extreme data points) can have a dramatic effect on the stability of a multiple regression model. There are two common methods to deal with outliers in multiple regression models. The first is to simply eliminate any records that contain an outlier and then rerun the regression without those data records. When using OA=D (setting the outlier adjustment to delete),  records containing the highest and lowest extreme residuals are deleted from the analysis.  The other method is where the dependent variable is adjusted for records with the highest and lowest extreme residuals. That is, the dependent variable is modified to a value that limits the standardized residual to a predetermined value. The OA option is used to set the outlier adjustment method. It may be set to OA=N (none), OA=D (delete), or OA=A (adjust).

Both methods use a two-step process. First the regression is performed using the actual values for the dependent variable and standardized residuals are calculated for each predicted value. When a standardized residual exceeds a given z-value, the record is flagged. Then the regression is run again and the flagged records are either eliminated (OA=D), or the value of the dependent variable is adjusted to the value defined by the outlier definition z-value (OA=A). For example, if the outlier definition is set to 1.96 standard deviations (OD=1.96), the upper and lower two and a half percent of the outliers would be flagged. Then the dependent variables for the flagged records would be modified to a value that would produce an outlier of plus or minus 1.96. Finally, the regression would be rerun using the modified dependent variable values for the flagged records. Flagged data records will be shown with an asterisk in the data table.

The stepwise procedure presents a problem for data trimming. The stepwise procedure often reduces the number of independent variables to a subset of the original list of independent variables. Data trimming involves using the standardized residuals to adjust the value of the dependent variable for some records. Rerunning the stepwise procedure with different values for some of the dependent variables could result in a different set of independent variables being stepped into the model, especially when there are highly correlated independent variables. To avoid this problem, StatPac reruns the multiple regression using the same independent variables selected in the first stepwise procedure. These variables are forced into the model so that the analysis runs in the non-stepwise mode.

It is important to note that the outlier adjustment process is only performed once because each regression would produce a new set of standardized residuals that would exceed the outlier definition value (OD=z). That is, any set of data with sufficient sample size will yield a set of outliers, even if the data has already been adjusted. Allowing the outlier adjustment process to repeat indefinitely would eventually result in an adjustment to nearly every data record.

It is suggested that the user actually examine records that are flagged as extreme outliers before allowing the program to make any adjustments. Outlier adjustments assume that the data for all independent variables is acceptable. A mispunched data value for an independent variable could result in an extreme prediction that gets flagged as an outlier. Therefore, visual inspection is the best way to guarantee the successful handling of outliers.

Confidence Intervals & Confidence Level

Confidence intervals provide an estimate of variability around the regression line. Narrow confidence intervals indicate less variability around the regression line. The option CI=Y will include the confidence intervals in the data table.

Prediction intervals, instead of confidence intervals, should be used if you intend to use the regression information to predict new values for the dependent variable. Both the confidence intervals and the prediction intervals are centered on the regression line, but the prediction intervals are much wider. The option CI=P will print the prediction intervals in the data table.

The actual confidence or prediction interval is set with the CL option. The CL option specifies the percentage level of the interval. For example, if CI=P and CL=95, the 95% prediction intervals would be printed in the data table.

Number of Variables to Force

The ability to force variables into an equation is important for several reasons:

1. A researcher often wishes to replicate the analysis of another study and, therefore, to force certain core variables into the equation, letting stepwise regression choose from the remaining set.

2. Some variables may be cheaper or easier to measure, and the user may want to see whether the remaining variables add anything to the equation.

3. It is common to force certain design variables into the equation.

4. When independent variables are highly correlated, one of them may be more accurate than the rest, and you may want to force this variable into the equation.

The FO option specifies the number of variables to force into the regression equation. To perform a standard (non-stepwise) multiple regression, set the FO option to the number of independent variables or higher. FO=200 will always force all independent variables into the equation. Thus, the FO option may be used to eliminate the stepwise part of the multiple regression procedure. If you force all variables into the equation, the multiple regression will contain only one step, where all variables are included in the equation.

The variables to be forced are taken in order from the list of independent variables. For instance, the option FO=3 forces the first three variables from the list of independent variables. Therefore, any variables you want to force should be specified at the beginning of the independent variable list.

F to Enter & F to Remove

When faced with a large number of possible explanatory variables, two opposed criteria of selecting a regression equation are usually involved:

1. To make the equation useful for predictive purposes, we would like our model to include as many of the independent variables as possible so that reliable fitted values can be determined.

2. Because of the costs involved in obtaining information on a large number of independent variables, and subsequently monitoring them, we would like the equation to include as few of the independent variables as possible.

The compromise between these two extremes is generally called "selecting the best regression". This involves multiple executions of multiple regression in an attempt to add variables to improve prediction or remove variables to simplify the regression function. Stepwise regression provides a partial automation of this procedur