MWSUG 2008 Paper Presentations

Paper presentations are the heart of any SASŪ users group conference. The MWSUG 2008 conference will include both contributed and invited papers on a variety of topics. These papers will be organized into the sections listed below.

Click on the section name to view the paper titles, authors, and abstracts.



Application Development

Author Paper Title (click for abstract)
Goodman, Tony Development of Forcast Accuracy Comparison Tool (FACT) for Air Force Legacy Systems Using SAS
Karafa, Matthew Data Set Investigator - Automate Exception Reporting for an electronic data dictionary with %DSI()
King, Barry Scheduling College Classes Using Operations Research Techniques
Li, Shihong Using PSI to monitor predictive model stability in the database marketing industry
Patel, Piyush Making Sense of Enterprise Business Intelligence (EBI) Log Configuration Files to Gain Valuable Insight on User Behavior
Zheng, Steve How to generate dynamical and flexible codes in a Clinical Trial



Data Visualization

Author Paper Title (click for abstract)
Osborne, Anastasiya Let Me Look At It! Graphic Presentation of Any Numeric Variable
Ciarlariello, Paul A Picture is Worth a Thousand Data Points: Increase Understanding by Mapping Data
Li, Shiqun (Stan) Creating Special Symbols in SASŪ Graph
Wright, Wendi HOW version of Graphing Class



Pharmaceutical Applications

Author Paper Title (click for abstract)
Justice, E.E. Risk Factors and Risk Associated with Hospital Stays in Patients with Myalgia
Allmer, Christine Inverse Prediction Using SAS Software: A Clinical Application
Ramamurthy, Amurthur Global Clinical Data Classification: A Discriminant Analysis
Redick, Jacob Factors Influencing the Length of Stay in a General Hospital for Inpatients Diagnosed with Depression
Tang, Guoxin Data Mining and Analysis to Lung Disease Data
Ugiliweneza, Beatrice Mastectomy versus Lumpectomy in Breast Cancer Treatment
Zhou, Jenny Blinding Sponsors or Open Label Studies: Challenges and Solutions
Dunnigan, Keith Confidence Interval Calculation for Binomial Proportions



SAS Presents

Author Paper Title (click for abstract)
Hemedinger, Chris Find Out What You're Missing: Programming with SASŪ Enterprise Guide
Rodriguez, Robert N. Getting Started with ODS Statistical Graphics in SASŪ 9.2
Secosky, Jason Sampler of What's New in Base SASŪ 9.2
Wicklin, Rick SASŪ Stat Studio: A Programming Environment for High-End Data Analysts
DelGobbo, Vincent Tips and Tricks for Creating Multi-Sheet Microsoft Excel Workbooks the Easy Way with SASŪ
Whitcher, Mike New SASŪ Performance Optimizations to Enhance Your SASŪ Client and Solution - Access to the Database
DelGobbo, Vincent You Want ME to use SASŪ Enterprise GuideŪ ??



Statistics and Data Analysis

Author Paper Title (click for abstract)
Ames, Jeff Long-Term Value Modeling in the Automobile Industry
Bena, James Survival Methods for Correlated Time-to-Event Data
Cerrito, Patricia The Difference Between Predictive Modeling and Regression
Cerrito, Patricia The Over-Reliance on the Central Limit Theorem
Finch, Holmes Imputation of Categorical Missing Data: A Comparison of Multivariate Normal and Multinomial Methods
Liu, Dachao A Comparison between correlated ROC Curves Using SAS and DBM MRMC - A Case Study in Image Comparison
Ghosh, Jagannath Use of System Function in Creating Macro for Survival Analysis
Ramamurthy, Amurthur SIPOC and Recurssive Partitioning



Tutorials

Author Paper Title (click for abstract)
Carpenter, Art Advanced PROC REPORT: Doing More in the Compute Block
Gao, Yubo Exploring Efficient Ways to Collapse Variables
Chiang, Alan A Multivariate Ranking Procedure to Assess Treatment Effects
Lafler, Kirk SASŪ Tips, Tricks and Techniques
Rice, Daniel Reduced Error Logistic Regression: Completely Automated Reduced Error Data Mining in SAS
Mo, Daojun Using Direct Standardization SASŪ Macro for a Valid Comparison in Observational Studies




Presentation Abstracts


A01. Development of Forcast Accuracy Comparison Tool (FACT) for Air Force Legacy Systems Using SAS
Tony Goodman, Dynamics Research Corporation

The Requirements Integration Process Improvement Team (RIPIT) at Wright Patterson Air Force Base utilizing SAS language and procedures coupled with Adobe Flash multimedia technology has developed the Forecast Accuracy Comparison Tool or FACT Plus, enabling Air Force legacy systems to reduce time and cost of forecasting supply and maintenance needs for Air Force spare parts by 99.5%.

In September 2006, RIPIT was approached by functional analysts and statisticians working legacy supply chain management systems at Wright Patterson Air Force Base. The challenge: Could we use new technology to build a web-based application to help with forecasting supply and maintenance needs for Air Force spare parts? At the time, forecasting in the Secondary Items Requirements System (SIRS) (D200A) required subject matter experts, involved in juggling myriad screens and interfaces. Users utilized an array of methodologies to determine the forecast accuracy for a single Air Force spare part. Employing existing legacy tools, the process could typically take an item manager or equipment specialist hours per stock number to retrieve and analyze the information needed for each forecast. In addition, this process was susceptible to human error. Users had no way of examining all of their information as a whole. These disparate processes were too complex for commercial off the shelf (COTS) tools to handle. D200A, consisting of approximately 100,000 plus records and thousands of data elements, called for the development of a custom web application to integrate these techniques and information. The tool has automated these various processes and methodologies and incorporated them into a one-stop centralized application. A process which formerly took 1 man hour or more can now be executed in 15 seconds. In the year FACT Plus has been used, it has resulted in a manpower cost avoidance of over $304,000 to the Air Force.*

In addition, D200A uses multiple forecasting methods to satisfy differing forecast requirements. In the past, an analyst would be forced to repeat tasks when moving from one method to the next. FACT Plus offers the users the capability to manipulate key elements within the interface via interactive Flash technology to make changes and execute “What If” scenarios for any given stock number. Users can toggle between methods and mix and match techniques on the fly. These results are automatically saved via Local Shared Objects in Flash. Along with increased efficiency, FACT Plus grants users the additional time to work on the problem.

FACT Plus integrates a mix of technologies including Base SAS, SAS IntrNet, SAS GRAPH, htmSQL, JavaScript, and Adobe Flash. All of these technologies are intertwined and communicate with one another. We will demonstrate how these various technologies are utilized within one powerful application which we feel performs a valuable purpose for its users and is indeed unique in its field.

* For the period March 2007 through March 2008, we used hit counts as our measure. FACT Plus had 11,500 hits or visitors to the site. Assuming that each hit represents one analyst querying FACT Plus to acquire forecast information, with an analyst at a pay rate of $26.56 (GS-9, Step 5 is the manpower standard used by the AF) about one hour to gather all the salient data. Using the 11,500 hits as our baseline, we calculate a yearly manpower cost of $305,440. With FACT Plus offering the same information in only 15 seconds, or .416% of the time it would have taken using legacy methods, the cost has been reduced to $1272.66 for the same period, for a cost avoidance to the Air Force of $304,167.34.


A02. Data Set Investigator - Automate Exception Reporting for an electronic data dictionary with %DSI()
Matthew Karafa, Cleveland Clinic Foundation

Data cleaning and data sleuthing can be the most tedious and time consuming part of any analysis. As part of the data elicitation process we often collect metadata about valid values, and other data rules that can be used to quickly check for such problems. By using a small amount of this metadata about the data set\u2019s variables the provided macro, %DSI(), can produce a data exception report both by rule and by record ID which can be quickly turned back to the client for data correction. %DSI() takes a comma separated data file defining data “rules” with fields including variable name, type, valid values and ranges. These are then applied to the data set using internal macros that report the records and values that violate the given set of rules. These are then organized into a fairly simple, MSWord compliant HTML document, which can be returned to the client for action.


A03. Scheduling College Classes Using Operations Research Techniques
Barry King, Butler University

This paper presents efforts taken at Butler University’s College of Business Administration to construct semester class schedules using a mixed integer linear programming procedure (PROC OPTMILP) to develop a schedule of classes and an assignment procedure (PROC ASSIGN) to assign faculty to the schedule. It also discusses our experiences with the effort and presents suggestions for improvement.

At Butler University, multiple sections of classes are scheduled in an attempt to give students a wide choice of options in constructing their individual class schedule. Many constraints are imposed such as not having senior level classes on Fridays so seniors can better participate in internship activities and not having junior and senior classes in the same discipline offered at the same day and time. There is also desire from administration to spread the classes out during the day to avoid an unusual number of classes being offered during the prime late morning and early afternoon hours thus better using classroom facilities.

Our work departs from previous efforts in that it takes a two-stage optimization approach to the problem faced at Butler University and solves the problem with solvers available through SAS.

Each of the two programs has a data import and development stage, a solution stage, and a reporting stage all written in the SAS programming language.


A04. Using PSI to monitor predictive model stability in the database marketing industry
Shihong Li, ChoicePoint Precision Marketing

Predictive models help streamline the decision-making process by improving the quality and consistency of decisions being made. In order to achieve maximum effectiveness and ensure maximum profitability for the client, front-end reports must be put in place to track model stability through the model’s entire life cycle. PSI applications can be developed to serve this important business need for database marketing customers. PSI is an acronym for Population Stability Index (PSI). Population Stability Indices are calculated and monitored using a methodology known as “Entropy”. The PSI application is a tool for creating front-end reports that track model stability. PSI is utilized as stability metric and is widely used and has been proved to be very effective. More than that, by incorporating model metadata management in database marketing environments, PSI easily provides great flexibility in creating time-series-based front-end reporting that can leverage dynamic model attribute metadata tables to simplify new model implementation and old model retirement. The PSI application helps proactively inform customers of changes in data that may affect the performance of their predictive models and allows them to make well-informed preemptive adjustments if needed. PSI reporting allows clients to dynamically adjust marketing strategies and quickly react to in-market changes thus providing them with a valuable edge against their competitors who have not implemented an effective application to measure model stability.

Benefits:

  1. Creates standardized front-end reports to monitor predictive model stability;
  2. Utilizes the PSI methodology which has been proved very effective in detecting data shifts;
  3. When utilizing a metadata-model PSI provides the ability for time series comparison reports and simplifies new model implementation and old model retirement
  4. Updates regularly to provide dynamic monitoring;
  5. Maximizes model effectiveness and enables maximum client profitability;
  6. Adds a competitive edge over analytic modeling services alone.


A05. Making Sense of Enterprise Business Intelligence (EBI) Log Configuration Files to Gain Valuable Insight on User Behavior
Piyush Patel, ChoicePoint

This paper will focus on the benefits of monitoring login activity. Without reliable logs, it can be very difficult to keep track of usage. From a security point of view logs are one of the most important assets contained on the server. After all, without logs, how will you know who is accessing your system and what information has been accessed from your server? By reading logging information, you can determine how much time business associates and/or clients are spending on your portal, creating reports or performing analysis. Therefore, it is imperative that your logs not miss a beat.

SAS EBI includes some very powerful features, yet sometimes they are not always enabled. This paper describes how to modify the logging configuration files for Portal Web Applications and how to parse through the log file to monitor and track users’ logon activity. This paper will also provide a step-by-step guide through the use of Information Delivery Portal or Web Report Studio to analyze portal log usage.

Benefits:

  1. Monitor user logon activity to gain insight into usage trends
  2. Compute portal usage
  3. Ability to analyze server usage


A06. How to generate dynamical and flexible codes in a Clinical Trial
Steve Zheng, Eli Lilly

In clinical trial studies, we typically develop and move TFLs to production before data lock to ensure blinded condition. Consequently, there are many unknown information associated with this process, and we try our best to minimize occasions to recheck programs in production. Therefore, it requires many technical skills to achieve this goal, especially advanced macro skills. It is essential how to read existing data as parameters, use those parameters as condition and pass to future codes. I will use two biggest trials in our company as example to demonstrate how to utilize those skills in real studies.


D01. Let Me Look At It! Graphic Presentation of Any Numeric Variable
Anastasiya Osborne, Farm Service Agency (USDA)

Have you ever been asked to produce a high quality, management-friendly report in record time? Have you ever spent time typing ranges for PROC FORMAT to apply in tables or maps? During Congressional hearings, U.S. Department of Agriculture (USDA) often gets urgent requests to graphically represent politically-sensitive data. This paper presents a SASŪ macro that was developed to allow flexibility in choosing a dataset, a variable in question, and a number of groups for statistical analysis. The macro then produces the results in an Excel spreadsheet, and an ODS output. It also automatically creates a format for the variable that can be used in PROC GMAP to produce an impressive map. The macro reduces programming time by eliminating time-consuming tasks to analyze the variable and manually type ranges for PROC FORMAT.

Being a member of the Economic and Policy Analysis Staff at the Farm Service Agency (FSA), USDA requires stamina and creativity. A stream of urgent requests to produce ad-hoc reports with statistical analysis of data can come at any time. The effort in creating these reports can be time-consuming and inefficient, especially when analysis of unfamiliar data is needed within a short period of time, as, for example, during Congressional deliberations. This is when SAS MACRO facilities can be handy. MACRO saves time and automates a tedious mistake-prone process of typing format ranges, so that the mind of the analyst is freed to tackle more complicated issues. This automated approach to analyze the variable, create a user-defined format, and map data drastically reduces staff time to produce a report.


D02. A Picture is Worth a Thousand Data Points: Increase Understanding by Mapping Data
Paul Ciarlariello, Sinclair Community College

It is sometimes difficult to understand what the data is telling you, especially when you are staring at a page full of numbers. Help get your message across more clearly and effectively by painting a pretty picture. This paper presents step-by-step instructions for transforming numeric data to visual data by using SAS, mapping software, and the SAS portal. This example shows how student enrollment differences between one year and the next can be plotted onto a map by ZIP code. Note that although this example relates specifically to the education market, this process can easily be translated for use in many other areas.


D03. Creating Special Symbols in SASŪ Graph
Shiqun (Stan) Li, Minimax Information

This paper will present several techniques to embed special characters and special symbols into SASŪ graphs. The special symbols can be Greek letters, mathematical symbols, subscription, superscription, underline, and user designed symbols. The symbols can be created in the titles, footnotes, axis labels, or the graph area of a SAS graph. This presentation is prepared for an intermediate and advanced audience.


D04. HOW version of Graphing Class
Wendi Wright, CTB McGraw-Hill

Starting with a SAS PLOT program, we will transfer this plot into PROC GPLOT and I will show you the many and varied ways you can improve the look of the plot using SAS GRAPH statements. We will make the plot really shine by customizing titles, footnotes, symbols, legends, axes and even the reference line. At each step, a hands-on example will be presented where the user will choose their own features such as symbol colors and placement of the legend. In the end, you will have built your own personalized graph using the Title, Footnote, Symbol, Legend, and Axis statements.


P01. Risk Factors and Risk Associated with Hospital Stays in Patients with Myalgia
E.E. Justice, University of Louisville

OBJECTIVE: The risk factors for myalgia were examined along with other data associated with these risk factors involving the hospital stay of patients with myalgia. METHOD: Data were collected from hospitals around the United States through the NIS, and these data were narrowed down to those patients suffering from myalgia. These data were then analyzed using SAS Enterprise Guide 4. Data visualization techniques, logistic regression and linear models were used to achieve the desired results.

RESULTS: It was determined that females are the most abundant among myalgia sufferers with a peak age around 56. The male subjects with myalgia had a broad peak of 43 to 65 years of age. This condition has occurred in most women by the age of 58. It was also determined that the Asian/Pacific Islanders demonstrated a peak age of around 70 in comparison to the average age of 58.3. Asians have the lowest probability of accumulating less than $20,000 in total charges and Whites, Blacks and Native Americans have the highest. Asians also have the highest probability among the races of accumulating between $38,000 and $58,000 in charges. Whites were determined to have the least probability of staying less than five days in the hospital and Asians have the highest probability of staying between 11 and 16 days. A linear model revealed that the following DX and DRG codes are significant in predicting total charges and also surround heart and blood conditions: Transfusion of packed cells, anemia (unspecified), venous catheterization (not elsewhere classified), of native coronary artery, congestive heart failure (unspecified), and atrial fibrillation.

CONCLUSION: There is currently limited data on the risk factors of myalgia and these results will hopefully be a start to learning more about the condition.


P02. Inverse Prediction Using SAS Software: A Clinical Application
Christine Allmer, Mayo Clinic

An important application of regression methodology is in the area of prediction. Oftentimes investigators are interested in predicting a value of a response variable (Y) based on the known value of the predictor variable (X). However, sometimes there is a need to predict a value of the predictor variable (X) based on the known value of the response variable (Y). In such situations, it is improper to simply switch the roles of the response and predictor variables to get the desired predictions i.e., regress X on Y. A method that accounts for the underlying assumptions while estimating or predicting X from known Y is known as inverse prediction. This approach will be illustrated using the PROC CORR, PROC REG, and PROC GPLOT procedures in SASŪ. The calculations for the 95% confidence limits for a predicted X from a known Y will also be presented. The macro and its application will be demonstrated using data from clinical / laboratory studies.


P03. Global Clinical Data Classification: A Discriminant Analysis
Amurthur Ramamurthy, Covance

The variation in data across geographies and over time has not been well documented. The data set to appropriately make this analysis requires consistent global methods and consistent global calibration, which is difficult to document. Covance Central Laboratories operates with a single global method and with consistent global method calibration.

Global population data sets were analyzed using Discriminant Analysis to document the differences in the clinical trial populations around the world. Analysis performed using Average of Normals (AON), 99 and 95 % non -parametric truncated data show minimal geographical differences.


P04. Factors Influencing the Length of Stay in a General Hospital for Inpatients Diagnosed with Depression
Jacob Redick, University of Louisville

Hospital inpatients are consumers of hospital bedtime, a precious commodity in our society; the length of stay in a general hospital by an inpatient is of interest to the medical community (doctors, nurses, and patients) as well as the businesses that bring them together (hospitals, insurers, HMOs, etc.) Furthermore, the financial costs of hospital stays are of particular interest to any parties responsible for their payment. Comorbid depression is linked to extended stays, and among depressed patients, labor & delivery patients are the most common; however, Labor & Delivery patients average significantly shorter lengths of stay (2.5 days vs. 5.4 days) and lower costs ($6,900 vs. $18,900) than other inpatients with comorbid depression. Among the twenty most prevalent diagnoses, urinary tract infections, anemia, fluid & electrolyte disorders, and hypertension extended the lengths of stay most; of the top twenty procedures performed; diagnostic vascular catheterization, respiratory intubation & mechanical ventilation, hemodialysis, and blood transfusions were correlated with longer stays and higher expenses in the hospital. Alcohol & drug rehabilitation/detoxification were shown to decrease an inpatient’s length of stay, although this may be partly due to transfer to another facility. There were fewer significant diagnoses or procedures related to labor & delivery patients, with lesser impact in that subgroup. It was determined that, while longer lengths of stay can significantly inflate total charges, the reliability of the length of stay as a predictor is usually below 95%. The variation in length of stay and total charges remains largely unexplained by these findings.


P05. Data Mining and Analysis to Lung Disease Data
Guoxin Tang, University of Louisville

Objective: To examine the relationship between patient outcomes and conditions of the patients undergoing different treatments for lung disease.

Method: SAS Enterprise Guide was used to obtain Lung disease data from the NIS (National Inpatient Sample) by using CATX and RXMATCH statements. We first concatenate all 15 columns of diagnosis codes into one text string using the CATX statement. The RXMATCH looks for the initial code of ‘162’ that finds all patients with a diagnosis of lung disease. The ICD 9 code of 162 means malignant neoplasm of trachea, bronchus, and lung. Kernel Density Estimation was used to examine the lung disease by Age, Length of Stay and Total Charges, which show the relationships among these outcomes by using data visualization. Then we use SAS Text Miner to investigate relationships in co-morbid diagnoses. To investigate, we used SAS Text Miner to define clusters of diagnoses. Then we can inspect the results by defining a severity measure using text analysis.

Results: After filtering by lung disease, there were more than 8000 observations in the data. The examination reveals that there was certainly a relationship between lung disease and Age, Length of Stay and Total Charge. Patients with lung diseases increase inpatient events starting at age 35, accelerating at age 42, and decreasing at 74. They have a higher probability of a stay of 4 days which indicating that there was a higher probability of higher cost.

Conclusion: By using the Kernel Density Estimation and Text Miner, we obtained the statistical information about the Age, Length of Stay and Total Charges for patients with lung diseases. Cluster analysis also gave us five diagnose clusters with ranking of severity by severity measure.


P06. Mastectomy versus Lumpectomy in Breast Cancer Treatment
Beatrice Ugiliweneza, University of Louisville

Objective: To extract information and analyze the cost of mastectomy and lumpectomy as breast cancer treatments using SAS.

Methods: The data used are from the National Inpatient Sample (NIS). It contains a stratified sample of all hospital patient visits from 37 participating states. First, we extract breast cancer cases among all the data and then focus on those treated by mastectomy and lumpectomy. Then, data analysis techniques are used to examine and compare these two major surgical treatments. We used linear models in SAS/STAT to examine the data, and also PROC GPLOT methods.

Results: For the data used, the study shows that the cost of mastectomy treatment is lower than the cost of lumpectomy treatment. Moreover, the analysis shows that mastectomy is more used than lumpectomy.

Conclusion: SAS is a good tool for statistical data analysis, data mining and data visualization. Further study will include claims data to investigate longitudinal patient outcomes.


P07. Blinding Sponsors or Open Label Studies: Challenges and Solutions
Jenny Zhou, Eli Lilly and Company

Although double blinding (blind treating physicians and patients) is the optimal approach to minimize the bias in clinical research, it’s not always feasible to conduct double blind studies. For open label studies, it’s often desirable to blind the study sponsors to reduce potential bias and increase credibility of trial results. However, open label studies usually create some challenges to blind sponsors. In this paper, we go over various types of CRF data that could unblind sponsors and then propose some methods to scramble the data in order to blind sponsors. We implement the proposed methods with three SAS macros and also provide a real example for illustration.

In some therapeutic areas, although desirable, it can be difficult, sometimes even not feasible or ethical, to conduct a double blind trial. To minimize potential bias due to knowing treatment level aggregate data while trial is ongoing, it is important to blind/scramble the database during the course of trial, especially if the trial is for the purpose of registration. We begin this paper by introducing various kinds of data collected on the CRF that can unblind, continue with a proposal of several methods that can be used separately, or combined, to blind the CRF data, and implement these methods with three SAS macros. In addition, we will demonstrate with examples of how these macros can be used to serve the blinding purpose in a clinical trial.


P08. Confidence Interval Calculation for Binomial Proportions
Keith Dunnigan, Statking Consulting, Inc.

Some of the most common and first learned calculations in statistics involve estimating proportions and calculating confidence intervals. The Wald method, which is easy to calculate and common to most statistics textbooks, has significant issues for a large range of n and p. The Wald method will be presented and contrasted to the Wilson Score method and exact Clopper Pearson method. SAS code will be presented for calculating confidence intervals by each of the three methods. In addition, SAS code for sample size calculation by the Wald and Wilson Score methods will be given. Finally illustrative examples will be presented.


SAS01. Find Out What You're Missing: Programming with SASŪ Enterprise Guide
Chris Hemedinger, SAS Institute

In this paper, you can read about the productivity gains that you can enjoy when you add SASŪ Enterprise GuideŪ to your SAS programming toolbox. You will see how to perform old tasks in a new way as well as how to accomplish some tasks that would have been very difficult—if not impossible—without the benefit of an integrated tool like SAS Enterprise Guide. Topics in this paper include:


SAS02. Getting Started with ODS Statistical Graphics in SASŪ 9.2
Robert N. Rodriguez, SAS Institute

ODS Statistical Graphics (or ODS Graphics for short) is major new functionality for creating statistical graphics that is available in a number of SAS software products, including SAS/STATŪ, SAS/ETSŪ, SAS/QCŪ, and SAS/GRAPHŪ. With the production release of ODS Graphics in SAS 9.2, over sixty statistical procedures have been modified to use this functionality, and they now produce graphs as automatically as they produce tables. In addition, new procedures in SAS/GRAPH use this functionality to produce plots for exploratory data analysis and for customized statistical displays.

SAS/GRAPH is required for ODS Graphics functionality in SAS 9.2. This paper presents the essential information you need to get started with ODS Graphics in SAS 9.2. ODS Graphics is an extension of ODS (the Output Delivery System), which manages procedure output and lets you display it in a variety of destinations, such as HTML and RTF. Consequently, many familiar features of ODS for tabular output apply equally to graphs. For statistical procedures that support ODS Graphics, you invoke this functionality with the ods graphics on statement. Graphs and tables created by these procedures are then integrated in your ODS output destination. ODS Graphics produces graphs in standard image file formats, and the consistent appearance and individual layout of these graphs are controlled by ODS styles and templates, respectively. Since the default templates for procedure graphs are provided by SAS, you do not need to know the details of templates to create statistical graphics. However, with some understanding of the underlying Graph Template Language, you can modify the default templates to make changes to graphs that are permanently in effect each time you run the procedure.

Alternatively, to facilitate making immediate changes to a particular graph, SAS 9.2 introduces the ODS Graphics Editor, a point-and-click interface with which you can customize titles, annotate points, and make other enhancements.


SAS03. Sampler of What's New in Base SASŪ 9.2
Jason Secosky, SAS Institute

Coding with SAS is easier than ever with SAS 9.2. This paper highlights the top new features and performance improvements in DATA step, PROC SQL, and PROC SORT. Included are writing functions with DATA step syntax, improved performance when accessing an external database from PROC SQL, more intuitive and culturally acceptable sorting with PROC SORT, and several "Top 10" SASware Ballot items.


SAS04. SASŪ Stat Studio: A Programming Environment for High-End Data Analysts
Rick Wicklin, SAS Institute

SAS Stat Studio 3.1 is new statistical software in SAS 9.2 that is designed to meet the needs of high-end data analysts— innovative problem solvers who are familiar with SAS/STATŪ and SAS/IMLŪ but need more versatility to try out new methods. Stat Studio provides a rich programming language, called IMLPlus, that blends an interactive matrix language (IML) with the ability to call SAS procedures as functions and to create customized dynamic graphics. For standard tasks, Stat Studio provides the same interactive graphics and statistical capabilities available in SAS/INSIGHTŪ, and so it serves as a programmable successor to SAS/INSIGHT.

With Stat Studio, you can build on your familiarity with SAS/STAT or SAS/IML to write programs that explore data, fit models, and relate the results to the data with linked graphics. You can programmatically add legends, curves, maps, or other custom features to plots. You can write interactive analyses that respond to your input to analyze only selected subsets of the data. You can move seamlessly between programming and interactive analysis.

A previous paper (Wicklin and Rowe, 2007) introduced Stat Studio and presented examples of the point-and-click interface. This paper focuses on programming aspects of Stat Studio; the goal is to demonstrate techniques that are straightforward in Stat Studio but might be difficult to implement in other software. Not all programming statements are described in detail in this paper; for more information see the Stat Studio documentation. The main ideas in this paper are illustrated by using meteorological data.


SAS05. Tips and Tricks for Creating Multi-Sheet Microsoft Excel Workbooks the Easy Way with SASŪ
Vincent DelGobbo, SAS Institute

Transferring SASŪ data and analytical results between SAS and Microsoft Excel can be difficult, especially when SAS is not installed on a Windows platform. This paper discusses using the new XML support in Base SASŪ-9 software to create multi-sheet Microsoft Excel workbooks (versions 2002 and later). You will learn step-by-step techniques for quickly and easily creating attractive multi-sheet Excel workbooks that contain your SAS output, and also tips and tricks for working with the ExcelXP ODS tagset. Most importantly, the techniques that are presented in this paper can be used regardless of the platform on which SAS software is installed. You can even use them on a mainframe! The use of SAS server technology is also discussed. Although the title is similar to previous papers by this author, this paper contains new and revised material not previously presented.


SAS06. New SASŪ Performance Optimizations to Enhance Your SASŪ Client and Solution - Access to the Database
Mike Whitcher, SAS Institute

The SQL procedure has been used for years as the way many SAS clients and solutions query for their data. Examine the new SQL performance optimizations that have been added to this bellwether procedure, optimizations designed to greatly expand query pass-through capability to databases and shorten your SAS client and solution query response times. Also, see the new SQL enhancements for use with SAS data sets. Whether you use SASŪ Web Report Studio, SASŪ Marketing Automation or other SAS clients or solutions, or still submit your SQL queries in batch, you owe it to yourself to see how you can make them run faster.


SAS07. You Want ME to use SASŪ Enterprise GuideŪ ??
Vincent DelGobbo, SAS Institute

Starting with SASŪ 9, one copy of SAS Enterprise Guide is included with each PC SAS license. At some sites, desktop PC SAS licenses are being replaced with a single server-based SAS license and desktop versions of Enterprise Guide. This presentation will introduce you to the Enterprise Guide product, and provide you with some good reasons why you should consider using it.


S01. Long-Term Value Modeling in the Automobile Industry
Jeff Ames, Ford Motor Company

Businesses often classify their customer base in terms of the customers' predicted long-term value (LTV). LTV may influence marketing strategies, particularly CRM and concern resolution. This paper describes an approach to LTV calculations in the automobile industry. The emphasis of this presentation is on one aspect of LTV which is the choice of "next new-vehicle segment". SAS code related to "next-segment" predictive modeling is outlined. The implementation of this predictive model within a SAS scoring platform is presented.


S02. Survival Methods for Correlated Time-to-Event Data
James Bena, Cleveland Clinic

The use of product-limit (Kaplan-Meier) estimation and Cox proportional hazards modeling is common when measuring time-to-event data, especially in the presence of censoring. Presenting results from both methods provides the magnitude of loss within the levels of a given variable, and a relative measure of failure risk between the levels. However, since both above methods assume independence of the observations, correlated measurements require adjustment to avoid underestimation of the variance, and overestimation of the statistical significance.

In this paper, we focus on the case of clustered results, where a single observed unit may have several unique observations. The variances of Kaplan-Meier estimates from PROC LIFETEST are adjusted for the clustering using Taylor-series approximation. The standard errors of estimated hazard ratios from Cox proportional hazards models fit using PROC TPHREG are altered using the sandwich estimator, effectively fitting a marginal model. Application of these methods are described using a medical example, assessing the quality of stents implanted in patients with vascular disease.

A SAS macro is described that performs both of these adjusted analyses, and then creates a table using the Kaplan-Meier survival estimates at specified time points and hazard ratios from the marginal Cox proportional hazards model.


S03. The Difference Between Predictive Modeling and Regression
Patricia Cerrito, University of Louisville

Predictive modeling includes regression, both logistic and linear, depending upon the type of outcome variable. However, as the datasets are generally too large for a p-value to have meaning, predictive modeling uses other measures of model fit. Generally, too, there are enough observations so that the data can be partitioned into two or more datasets. The first subset is used to define (or train) the model. The second subset can be used in an iterative process to improve the model. The third subset is used to test the model for accuracy.

The definition of “best” model needs to be considered as well. In a regression model, the “best” model is one that satisfies the criteria of uniform minimum variance unbiased estimator. In other words, it is only “best” in the class of unbiased estimators. As soon as the class of estimators is expanded, “best” no longer exists, and we must define the criteria that we will use to determine a “best” fit. There are several criteria to consider. For a binary outcome variable, we can use the misclassification rate. However, especially in medicine, misclassification can have different costs. A false positive error is not as costly as a false negative error if the outcome involves the diagnosis of a terminal disease. We will discuss the similarities and differences between the types of modeling.


S04. The Over-Reliance on the Central Limit Theorem
Patricia Cerrito, University of Louisville

The objective is to demonstrate the theoretical and practical implication of the central limit theorem. The theorem states that as n approaches infinity, the distribution of the sample mean approaches normality with mean equal to the population mean and variance equal to the population variance divided by n. However, as n approaches infinity, the variance of the mean approaches zero. In practice, the population variance is unknown, and so the sample variance is used to estimate the population distribution. In that case, we assume the format of a t-distribution, which requires the assumption that the population is itself normally distributed. In this presentation, we use data visualization to show some problems that can occur when assuming that n is sufficiently large to assume that the sample mean is normally distributed. In particular, we use PROC SURVEYSELECT to sample data from non-normal distributions to compare the distribution of the sample mean to that of the population mean.


S05. Imputation of Categorical Missing Data: A comparison of Multivariate Normal and Multinomial Methods
Holmes Finch, Ball State University

Missing data are a common problem for data analysts in a variety of fields. Researchers have demonstrated that ignoring data points with missing data (listwise or pairwise deletion) can result in biased parameter estimates as well as a reduction in power for hypothesis testing. A number of methods have been developed for imputing values for missing data, some of which have been subsequently shown to be less than optimal (e.g., mean substitution, hot deck imputation). On the other hand, more sophisticated methods for imputing missing values have demonstrated their utility with continuous data. Perhaps foremost among these imputation methods is Multiple Imputation based on data augmentation, which can be carried out using PROC MI in SAS. While this methodology, which is based on the normal distribution, has proven to be very effective for dealing with missing data in the case of continuous variables, there remain questions about how useful it is when the variables in question are categorical in nature. Prior research has found that when this multiple imputation method is used with dichotomous or polytomous data and the results are rounded to fit within the confines of the existing data structure, resulting estimates of proportions are biased. This bias is largely not present when these values are not rounded. A method for imputing missing categorical data responses has been developed and is based on the multinomial distribution. However, the computational burden for this approach is such that it can be difficult to use for a large number of variables. Nonetheless, because it is based upon a true categorical data distribution, it may be superior to the normal based approach when dealing with missing data for dichotomous or polytomous variables. To this point, very little research in the way of direct comparison of the effectiveness of the normal and multinomial approaches to imputing missing categorical data has been published. The current simulation study made such a direct comparison of the estimates of proportions for dichotomous data with missing responses when the normal (with and without rounding) and multinomial based imputation methods were used. The goal of the research was to compare the quality of these proportion estimates for the normal based approach (as carried out by SAS PROC MI), which theoretically is not appropriate, to those based on data imputed using the theoretically more appropriate multinomial distribution (as carried out using functions in the R software program). A variety of missing data structures, sample sizes and population proportion values were studied.


S06. A Comparison between correlated ROC Curves Using SAS and DBM MRMC - A Case Study in Image Comparison
Dachao Liu, Northwestern University

Receiver Operating Characteristic (ROC) analysis is often used in evaluation of a diagnostic test, related to decision making in cost/benefit analysis or medical practice. ROC curves in the diagnosis of multi-reader multi-case setting are correlated. There are ways to compare these correlated ROC curves. In this case study, we have 3 LCD imaging machines or display configurations, and 12 readers. Each reader read 180 cases with 60 being cancer case and 120 normal benign cases. As a case study, this paper will discuss how the data was processed in SAS and how the comparison of ROC curves was made in both SAS and DBM MRMC.


S07. Use of System Function in Creating Macro for Survival Analysis
Jagannath Ghosh, MedFocus LLC

In this paper, we will show the use of some system functions and will write a macro which will automatically find the variables, create new variables according to study specific requirement. Basically we will be creating two variables (for simplicity) which will show overall survival time and time to disease progression for one particular study and then we will be using survival analysis based on these two variables and censoring information. Please note that the study data will be fake data, however, it will convey the real world example how the survival analysis is performed for various studies. The purpose of writing this paper is to show the power and usefulness of SAS in clinical research (basically studies which requires death and survival information, such as cancer and HIV).


S08. SIPOC and Recurssive Partitioning
Amurthur Ramamurthy, Covance Central Laboratories

SIPOC a frequently used tool stands for Suppliers, Inputs, Process, Outputs and Customer. This tool is commonly used in the Define phase of Six Sigma to map the flow (at a high level) of the process relating Supplier/Inputs to Outputs/Customer. Listed uses of this tool include, identification of process boundaries (scoping) and gaps. In this work we describe a novel roadmap exploiting the gap identification capability of SIPOC and use it as a problem-solving precursor to the Statistical tools that follow in the Analyze phase of the Six Sigma Project.

Analyze phase of a Six Sigma projects typically involves the use of brainstorming tools to list potential Key Process Input Variables, X’s, and prioritize these inputs using an FMEA type of framework. This is followed by Statistical validations using one of many Hypothesis tests, linking potential X’s to the response variable also referred to as the big Y. In this work we have used Recursive Partitioning (RP) a powerful data-mining tool to validate SIPOC outputs (Gaps).

Transactional projects present an abundance of discrete independent variables (X’s) and frequently discrete responses (Y’s). A combination of SIPOC gap analysis coupled with the use of a general-purpose tool such as Recursive partitioning has brought about rapid closure of transactional Six Sigma projects.


T01. Advanced PROC REPORT: Doing More in the Compute Block
Art Carpenter, CA Occidental Consultants

One of the unique features of the REPORT procedure is the Compute Block. This PROC step tool allows the use of most DATA step statements, logic, and functions, and through the use of the compute block you can modify existing columns, create new columns, write text, and more! This provides the SAS programmer a level of control and flexibility that is unavailable in virtually all other procedures. Along with this flexibility comes complexity and this complexity often thwarts us as we try to write increasingly interesting compute blocks. The complexity of the compute block includes a number of column identification and timing issues that can confound the PROC REPORT user. Of course to make matters even more interesting, there can be multiple compute blocks that can interact with each other and these can execute for different portions of the report table. This tutorial will discuss the essential elements of the compute block, its relationship to the processing phases, and how it interacts with temporary variables.


T02. Exploring Efficient Ways to Collapse Variables
Yubo Gao, University of Iowa Hospitals and Clinics

There are many situations where we would want to find the frequency distributions of discrete results. Usually, more than one method is available in SAS. With today’s computer technology, data files under a million records are not a problem. But if the data file is much larger than that, such as those present in major credit card/retailers companies and university/hospitals, efficient methods should be sought in order to reduce resource utilization, such as time expended. Based on a frequency distribution question posted at the SAS-L community, this paper first reviewed the methods suggested by the SAS-L subscribers, and then proposed two other methods. Next, a performance comparison among these methods in terms of CPU time usage was made by solving an expanded example, the comparison showed that the ratios of CPU time usage between the slowest method and the fastest one is 12, and the result may serve as a benchmark when solving similar problems.


T03. A Multivariate Ranking Procedure to Assess Treatment Effects
Alan Chiang, Eli Lilly and Company

In early phase clinical studies, it is often difficult to assess the effects of a set of biomarker variables when the individual variables do not appear to have statistically significant effects. To address this situation, we propose a method of U-scores applied to subsets of multivariate data. We illustrate the usefulness of this approach through simulations, considering various combinations of correlations and underlying distributions, and compare the statistical power to the existing tests: Hotelling's T2 and nonparametric rank sum tests. Finally we apply this approach in a Phase I clinical study to help assess the treatment effects of an investigative drug on rheumatoid arthritis.


T04. SASŪ Tips, Tricks and Techniques
Kirk Lafler, Software Intelligence Corporation

The base-SASŪ System offers users the power of a comprehensive DATA step programming language, an assortment of powerful PROCs, a macro language that extends the capabilities of the SAS System, and user-friendly interfaces including the SAS Display Manager. This presentation highlights numerous SAS tips, tricks and techniques using a collection of proven code examples related to effectively using the SAS Display Manager and its many features; process DATA step statements to handle subroutines and code libraries; deliver output in a variety of formats; construct reusable code; troubleshoot and debug code; and an assortment of other topics. This paper illustrates several tips, tricks and techniques related to the usage of the Base-SAS software. We will examine a variety of topics including SAS System options, DATA step programming techniques, logic conditions, output delivery and ODS, macro programming, and an assortment of other techniques.


T06. Reduced Error Logistic Regression: Completely Automated Reduced Error Data Mining in SAS
Daniel Rice, Rice Analytics

Reduced Error Logistic Regression (RELR) is a new 100% automated machine learning method that is fully implemented in SAS software and was featured at the SAS M2007 Conference. RELR “is not your grandfather’s logistic regression”, as it can reduce error significantly compared to other predictive modeling methods. RELR’s automation arises because it has no arbitrary or validation-sample parameters and it reduces error automatically. RELR's error reduction results from symmetrical constraints consistent with Extreme Value properties of the Logit error. These constraints also lead to a prior ordering of the importance of variables, so the vast majority of variables are excluded to avoid dimensional curse. Hence, RELR can solve very high dimensional problems quite efficiently. RELR allows higher order polynomial terms and interactions to whatever order specified, but can give very parsimonious solutions with reasonable stability. This paper introduces RELR by comparing it to another machine learning method based upon logistic regression: Penalized Logistic Regression (PLR). Results from difficult predictive modeling problems will then be presented to show that RELR can yield significantly better fit accuracy and less overfitting compared to Penalized Logistic Regression, Support Vector Machines, Decision Trees, Partial Least Squares, Neural Networks, and Forward-Select Logistic Regression. It will be seen that RELR’s advantage is most apparent in abundant error problems such as those with smaller sample segments and/or large number of correlated variables.


T07. Using Direct Standardization SASŪ Macro for a Valid Comparison in Observational Studies
Daojun Mo, Eli Lilly and Company

Observational studies are usually imbalanced in the factors associated with the outcome measures. Simply presenting the descriptive results or the P values from an unadjusted between-group comparison could lead to a biased conclusion. Direct standardization is one of the methods for binary data that reveal the valid association between comparison groups. Direct standardization is often implemented in a spreadsheet by copying and pasting the data. This becomes tedious in a study that explores multiple outcome measures. We thus developed a SASŪ macro that is adaptable to many types of observational studies which consider binary outcome measures. Examples are given to demonstrate the concept of direct standardization, and how to use the macro.




Content is © MWSUG 2008
Website maintained by Joshua Horstman.