A00-240: SAS Certified Statistical Business Analyst Using SAS 9 – Regression and Modeling

Exam ID: A00-240

Exam Name: SAS Certified Statistical Business Analyst Using SAS 9 – Regression and Modeling

Successful candidates should have experience in:

Try Online Exam »

  • Analysis of variance.
  • Linear and logistic regression.
  • Preparing inputs for predictive models.
  • Measuring model performance.

SAS A00-240 Exam Summary:

Exam Name SAS Certified Statistical Business Analyst Using SAS 9
Exam Code   A00-240
Exam Duration   120 minutes
Exam Questions   60
Passing Score   68%
Exam Price   $180 (USD)
Books  Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Sample Questions   SAS Statistical Business Analyst Certification Sample Question
Practice Exam   SAS Statistical Business Analyst Certification Practice Exam

SAS A00-240 Exam Topics:

Objective   Details
ANOVA – 10%
Verify the assumptions of ANOVA – Explain the central limit theorem and when it must be applied
– Examine the distribution of continuous variables (histogram, box -whisker, Q-Q plots)
– Describe the effect of skewness on the normal distribution
– Define H0, H1, Type I/II error, statistical power, p-value
– Describe the effect of sample size on p-value and power
– Interpret the results of hypothesis testing
– Interpret histograms and normal probability charts
– Draw conclusions about your data from histogram, box-whisker, and Q-Q plots
– Identify the kinds of problems may be present in the data: (biased sample, outliers, extreme values)
– For a given experiment, verify that the observations are independent
– For a given experiment, verify the errors are normally distributed
– Use the UNIVARIATE procedure to examine residuals
– For a given experiment, verify all groups have equal response variance
– Use the HOVTEST option of MEANS statement in PROC GLM to asses response variance
Analyze differences between population means using the GLM and TTEST procedures – Use the GLM Procedure to perform ANOVA
CLASS statement
MODEL statement
MEANS statement
OUTPUT statement
– Evaluate the null hypothesis using the output of the GLM procedure
– Interpret the statistical output of the GLM procedure (variance derived from MSE, F value, p-value R**2, Levene’s test)
– Interpret the graphical output of the GLM procedure
– Use the TTEST Procedure to compare means
Perform ANOVA post hoc test to evaluate treatment effect – Use the LSMEANS statement in the GLM or PLM procedure to perform pairwise comparisons
– Use PDIFF option of LSMEANS statement
– Use ADJUST option of the LSMEANS statement (TUKEY and DUNNETT)
– Interpret diffograms to evaluate pairwise comparisons
– Interpret control plots to evaluate pairwise comparisons
– Compare/Contrast use of pairwise T-Tests, Tukey and Dunnett comparison methods
Detect and analyze interactions between factors – Use the GLM procedure to produce reports that will help determine the significance of the interaction between factors. MODEL statement
– LSMEANS with SLICE=option (Also using PROC PLM)
– Interpret the output of the GLM procedure to identify interaction between factors:
– p-value
– F Value
– R Squared
Linear Regression – 20%
Fit a multiple linear regression model using the REG and GLM procedures – Use the REG procedure to fit a multiple linear regression model
– Use the GLM procedure to fit a multiple linear regression model
Analyze the output of the REG, PLM, and GLM procedures for multiple linear regression models – Interpret REG or GLM procedure output for a multiple linear regression model: convert models to algebraic expressions
– Convert models to algebraic expressions
– Identify missing degrees of freedom
– Identify variance due to model/error, and total variance
– Calculate a missing F value
– Identify variable with largest impact to model
– For output from two models, identify which model is better
– Identify how much of the variation in the dependent variable is explained by the model
– Conclusions that can be drawn from REG, GLM, or PLM output: (about H0, model quality, graphics)
Use the REG or GLMSELECT procedure to perform model selection – Use the SELECTION option of the model statement in the GLMSELECT procedure
– Compare the differentmodel selection methods (STEPWISE, FORWARD, BACKWARD)
– Enable ODS graphics to display graphs from the REG or GLMSELECT procedure
– Identify best models by examining the graphical output (fit criterion from the REG or GLMSELECT procedure)
– Assign names to models in the REG procedure (multiple model statements)
Assess the validity of a given regression model through the use of diagnostic and residual analysis – Explain the assumptions for linear regression
– From a set of residuals plots, asses which assumption about the error terms has been violated
– Use REG procedure MODEL statement options to identify influential observations (Student Residuals, Cook’s D, DFFITS, DFBETAS)
– Explain options for handling influential observations
– Identify collinearity problems by examining REG procedure output
– Use MODEL statement options to diagnose collinearity problems (VIF, COLLIN, COLLINOINT)
Logistic Regression – 25%
Perform logistic regression with the LOGISTIC procedure – Identify experiments that require analysis via logistic regression
– Identify logistic regression assumptions
– logistic regression concepts (log odds, logit transformation, sigmoidal relationship between p and X)
– Use the LOGISTIC procedure to fit a binary logistic regression model (MODEL and CLASS statements)
Optimize model performance through input selection – Use the LOGISTIC procedure to fit a multiple logistic regression model
– Perform Model Selection (STEPWISE, FORWARD, BACKWARD) within the LOGISTIC procedure
Interpret the output of the LOGISTIC procedure – Interpret the output from the LOGISTIC procedure for binary logistic regression models: Model Convergence section
– Testing Global Null Hypothesis table
– Type 3 Analysis of Effects table
– Analysis of Maximum Likelihood Estimates table
– Association of Predicted Probabilities and Observed Responses
Score new data sets using the LOGISTIC and PLM procedures – Use the SCORE statement in the PLM procedure to score new cases
– Use the CODE statement in PROC LOGISTIC to score new data
– Describe when you would use the SCORE statement vs the CODE statement in PROC LOGISTIC
– Explain how to score new data when you have developed a model from a biased sample
Prepare Inputs for Predictive Model Performance – 20%
Identify the potential challenges when preparing input data for a model – Identify problems that missing values can cause in creating predictive models and scoring new data sets
– Identify limitations of Complete Case Analysis
– Explain problems caused by categorical variables with numerous levels
– Discuss the problem of redundant variables
– Discuss the problem of irrelevant and redundant variables
– Discuss the non-linearities and the problems they create in predictive models
– Discuss outliers and the problems they create in predictive models
– Describe quasi-complete separation
– Discuss the effect of interactions
– Determine when it is necessary to oversample data
Use the DATA step to manipulate data with loops, arrays, conditional statements and functions – Use ARRAYs to create missing indicators
– Use ARRAYS, LOOP, IF, and explicit OUTPUT statements
Improve the predictive power of categorical inputs – Reduce the number of levels of a categorical variable
– Explain thresholding
– Explain Greenacre’s method
– Cluster the levels of a categorical variable via Greenacre’s method using the CLUSTER procedure
FREQ, VAR, ID statement
Use of ODS output to create an output data set
– Convert categorical variables to continuous using smooth weight of evidence 
Screen variables for irrelevance and non-linear association using the CORR procedure – Explain how Hoeffding’s D and Spearman statistics can be used to find irrelevant variables and non-linear associations
– Produce Spearman and Hoeffding’s D statistic using the CORR procedure (VAR, WITH statement)
– Interpret a scatter plot of Hoeffding’s D and Spearman statistic to identify irrelevant variables and non-linear associations
Screen variables for non-linearity using empirical logit plots – Use the RANK procedure to bin continuous input variables (GROUPS=, OUT= option; VAR, RANK statements)
– Interpret RANK procedure output
– Use the MEANS procedure to calculate the sum and means for the target cases and total events (NWAY option; CLASS, VAR, OUTPUT statements)
– Create empirical logit plots with the SGPLOT procedure
– Interpret empirical logit plots 
Measure Model Performance – 25%
Apply the principles of honest assessment to model performance measurement – Explain techniques to honestly assess classifier performance
– Explain overfitting
– Explain differences between validation and test data
– Identify the impact of performing data preparation before data is split 
Assess classifier performance using the confusion matrix – Explain the confusion matrix
– Define: Accuracy, Error Rate, Sensitivity, Specificity, PV+, PV-
– Explain the effect of oversampling on the confusion matrix
– Adjust the confusion matrix for oversampling
Model selection and validation using training and validation data   – Divide data into training and validation data sets using the SURVEYSELECT procedure
– Discuss the subset selection methods available in PROC LOGISTIC
– Discuss methods to determine interactions (forward selection, with bar and @ notation)
– Create interaction plot with the results from PROC LOGISTIC
– Select the model with fit statistics (BIC, AIC, KS, Brier score) 
Create and interpret graphs (ROC, lift, and gains charts) for model comparison and selection   – Explain and interpret charts (ROC, Lift, Gains)
– Create a ROC curve (OUTROC option of the SCORE statement in the LOGISTIC procedure)
– Use the ROC and ROCCONTRAST statements to create an overlay plot of ROC curves for two or more models
– Explain the concept of depth as it relates to the gains chart 
Establish effective decision cut-off values for scoring   – Illustrate a decision rule that maximizes the expected profit
– Explain the profit matrix and how to use it to estimate the profit per scored customer
– Calculate decision cutoffs using Bayes rule, given a profit matrix
– Determine optimum cutoff values from profit plots
– Given a profit matrix, and model results, determine the model with the highest average profit