Skip to Content

Archive of Seminar Abstracts

2009 Seminars

July 15, 2009

Kevin Coombes, Ph.D.
Associate Professor, Department of Bioinformatics and Computational Biology
M. D. Anderson Cancer Center

Is it possible to find believable clusters in array data?

As everyone knows, clustering algorithms always find clusters. And different clustering algorithms (even on the same data set) always find different clusters. How do you decide which algorithm to use? How can you tell which clusters are "better"---that is, more robust, more reproducible, or more believable?

June 24, 2009

Krishna Sankhavaram
Director, Research Information Systems
M. D. Anderson Cancer Center

Research Station, TREX, and Future Plans

I will briefly describe of architecture of Research Station and give a short demonstration of the current TREX application. I will then discuss our future plans, how bioinformatics researchers and analysts can work with research station, and how advanced bioinformatics analysis methods might be incorporated into it.

June 22, 2009

Susan Tucker, Ph.D., Professor
Yuan Ji, Ph.D., Assistant Professor
Department of Bioinformatics and Computational Biology
M. D. Anderson Cancer Center

BCB External Review – Reprise 3

This is the last of three Mondays on which the BCB faculty who met with the external review board will reprise their presentations for the benefit of the department as a whole.

The speakers this week are:
Dr. Susan Tucker: Projects in Computational Biology
Dr. Yuan Ji: Bayesian Predictive Approaches in Bioinformatics and Translational Medicine
Dr. John Weinstein: Integrative Bioinformatics for Biomarker and Biosignature Discovery

June 15, 2009

Kevin Coombes, Ph.D., Associate Professor
Keith Baggerly, Ph.D., Associate Professor
Jonas Almeida, Ph.D., Professor
Department of Bioinformatics and Computational Biology
M. D. Anderson Cancer Center

BCB External Review – Reprise 2

This is the second of three Mondays on which the BCB faculty who met with the external review board will reprise their presentations for the benefit of the department as a whole.

The speakers this week are:
Dr. Kevin Coombes: Translational Bioinformatics: Bench to Bedside via Computer
Dr. Keith Baggerly: Forensic Bioinformatics and Reproducible Research
Dr. Jonas Almeida: Semantic Web and Cloud Computing

June 8, 2009

John Weinstein, M.D., Ph.D., Professor and Chair
Shoudan Liang, Ph.D., Professor
Department of Bioinformatics and Computational Biology
M. D. Anderson Cancer Center

BCB External Review – Reprise 1

The BCB faculty who met with the external review board have agreed to reprise their presentations for the benefit of all those who could not attend. This week, Dr. John Weinstein will present his overview of BCB, the challenges we face, and our future plans. Dr. Shoudan Liang will follow with a short presentation highlighting some of his recent research on next generation sequencing technology.

May 18, 2009

John Weinstein, M.D., Ph.D.
Professor and Chair, Department of Bioinformatics and Computational Biology
M. D. Anderson Cancer Center

Into the integromic maelstrom: Merging different molecular data types for biomarker discovery

May 13, 2009

John Patrick Ferguson, Ph.D.
Postdoctoral Fellow, Department of Statistics
Yale University, New Haven, Connecticut

Bayesian alternatives to P-values and their use in Selecting and Ranking Populations (pdf)

April 23, 2009

Jordan Hiller
JMP Genomics Application Scientist
SAS Institute Inc., Cary, North Carolina

Combined Analysis of Copy Number and Expression Data in JMP Genomics 4

As genomic technologies mature and their costs decrease, it is now common to have multiple types of measurements conducted on the same biological samples. This wealth of data presents new challenges for analysis, but it also presents opportunities for deeper understanding of normal biological and pathological processes. This seminar will present an integrative analysis in JMP Genomics of copy number variation (CNV) and gene expression data on a publicly available dataset of hepatocellular carcinoma and non-tumoral cirrhotic liver samples.

JMP Genomics software from SAS brings statistical analysis and visual exploration of genomics data to the desktop. Designed to help research scientists and biostatisticians understand data generated from large genetics, expression or exon microarray and proteomics studies, JMP Genomics links interactive graphics to advanced statistics. By tying together JMP and SAS with over 50,000 lines of custom code, JMP Genomics marries trusted analytics with the point-and-click interface and interactive visualization capabilities of JMP.

April 15, 2009

Joseph C. Gardiner, Ph.D.
Professor of Epidemiology and Chairperson, Department of Epidemiology
Michigan State University, East Lansing, Michigan

Stochastic Models in Cost-Effectiveness Analysis

Cost-effectiveness analysis (CEA) is a collection of techniques for structuring comparisons between competing interventions. It can inform decision-making by providing means for optimizing health benefits from a specified budget, or finding the lowest cost strategy for a specified health benefit. Markov processes are useful in modeling the dynamics of patient health outcomes as they unfold over time. States of the process represent health conditions or health states. We use a continuous-time finite-state Markov process to incorporate patient costs as they are incurred during sojourn in health states and in transition from one health state to another. By combining these expenditure streams, the net present value is the discounted expected total cost over a specified time period. Other metrics widely used in CEA such as net health benefit, net health cost and the cost-effectiveness ratio and measures of health benefit such as life expectancy and quality-adjusted life years are defined as functions of expected values. We outline approaches to estimation of these summary statistics from health outcome and cost data that might be incompletely ascertained in some patients. Regression models are used to incorporate patient-specific demographic and clinical characteristics and their impact on the metrics used in CEA can be assessed.

April 6, 2009

Parsa Mirhaji, M.D.
Assistant Professor, School of Health Information Sciences
The University of Texas Health Science Center at Houston

BLUE: Clinical Text Understanding Meets Translational Clinical Research, A Minimal Syntactic Semantic System for Biomedical Language Understanding and Extraction

Although many techniques have been introduced for processing of unconstrained text in clinical settings, natural language processing (NLP) is not yet conceived as an integral component of electronic health record (HER) systems, and its utilization and adoption does not match its potentials. Current methods of NLP are generally specialized for limited use in certain domains (e.g., tumor detection in chest radiography reports) and are not easily and efficiently extensible to new domains. However, recent initiatives advocating for translational research call for generation of technologies that can integrate unstructured clinical data with structured data and provide a unified interface for queries, search and information retrieval. It is also critical to be able to contextualize and repurpose clinical information from electronic health records systems and research databases for multidisciplinary research in a collaborative and distributed environment envisioned by the CTSA program. That is, technologies for the natural language processing (NLP) of clinical texts should be evaluated not only in terms of their validity and reliability in their intended environment, but also in light of their interoperability, and ability to support information sharing, integration and contextualization in a network of loosely coupled, disparate information systems.

The Biomedical Language Understanding and Extraction system (code named BLUE-Text) aims at construction of a dynamically flexible, customizable, consistent, formal, and explicit representation of unconstrained clinical text. BLUE-Text uses a formal information and knowledge representation framework to represent a self-descriptive output that is immediately ‘understandable’ for computer programs, for automated processing, and computations.

April 1, 2009

Elena B. Elkin, Ph.D.
Assistant Attending Outcomes Research Scientist
Memorial Sloan-Kettering Cancer Center, New York, New York

Geographic Access and the Use of Screening Mammography

Screening mammography rates vary geographically and have recently declined. Inadequate mammography resources in some areas may impair access to this technology. We assessed the relationship between availability of mammography machines and the use of screening in two cohorts: female respondents age 40 or older to the Behavioral Risk Factor Surveillance Survey and a 5% nationwide sample of female Medicare beneficiaries. Our findings suggest that in counties with few or no mammography machines, limited availability of imaging resources is a barrier to screening.

March 19, 2009

Geert Molenberghs, Ph.D., and Geert Verbeke, Ph.D.
Professors, Interuniversity Institute for Biostatistics and Statistical Bioinformatics
Hasselt University, Diepenbeek, Belgium
Katholieke Universiteit Leuven, Belgium

On the Identifiability of the Random-Effects Distribution in Mixed Models

Mixed models can be viewed as models in which the observed data are augmented with unobservable constructs, i.e., random effects, and where inference is based on the marginal distribution of the observations, obtained from integrating over a pre-specified, often parametric, distribution for the random effects, called the mixing distribution. In some contexts, sensitivity of the inferences with respect to the assumed mixing distribution has been reported. On the other hand, it is not clear to what extent the unobservable random effects are truly identified from the observed data. In this presentation, focus will be on two aspects of the identifiability of random effects in mixed models.

First, it will be shown that, for any given mixed model, a new mixed model can be constructed with the same marginal fit to the observed data, but with a different random-effects distribution. This implies that the mixing distribution cannot be identified from the observations, unless additional restrictions are imposed. One possible restriction is the assumption that the conditional model for the data, given the random effects, is correct. It will be indicated that this issue is not specific to random effects, but applies to all latent structures (latent classes, latent variables, frailties, incomplete data, censoring, etc.).
In the second part of the presentation, it will be shown how one then can use the so-called gradient function as a simple graphical diagnostic tool to assess whether the assumed mixing distribution produces an adequate fit to the data, in terms of marginal likelihood. The method does not require any additional calculations in addition to the computations needed to fit the model, and can be applied to every type of mixed model (linear, generalized linear, non-linear), with univariate as well as multivariate random effects.

All results will be illustrated using real data.

March 16, 2009

Edward Dougherty, Ph.D.
Robert M. Kennedy ’26 Chair, Professor, Department of Electrical and Computer Engineering
Texas A&M University, College Station, Texas

High-Throughput Genomics: Epistemological Impediments and the Promise of Systems-Based Medicine

The accumulation of high-throughput genomic data has fueled a desire for the development of systems biology. As recognized by Norbert Wiener more than sixty years ago, the theoretical backbone of systems biology will lie in systems theory. To the extent that genomics plays a role in systems biology, genomic signal processing, including its diversity of engineering aspects, such as stochastic processes, control theory, information theory, and pattern recognition, will be critical. Having experienced more than a decade of activity, the great promise of GSP has clearly been revealed in the potential for molecular-based diagnostics and optimal therapeutic intervention strategies in the context of gene regulatory networks, but obstacles to successful development have also been exposed. This talk will discuss epistemological impediments to progress in translational genomics and the revolutionary change in treatment that will ultimately be brought about by a medicine grounded in the mathematical theory of dynamical systems.

March 11, 2009

Satoshi Morita
Professor, Department of Biostatistics and Epidemiology
Yokohama City University Medical Center, Yokohama, Japan

A Bayesian hierarchical mixture model for platelet derived growth factor receptor phosphorylation to improve estimation of progression-free survival in prostate cancer

Advances in understanding the biological underpinnings of many cancers have led increasingly to the use of molecularly targeted anti-cancer therapies. Because the platelet-derived growth factor receptor (PDGFR) has been implicated in the progression of prostate cancer bone metastases, it is of great interest to examine possible relationships between PDGFR inhibition and therapeutic outcomes. Here, we analyze the association between change in activated PDGFR (p-PDGFR) and progression free survival (PFS) time based on large within-patient samples of cell-specific p-PDGFR values taken before and after treatment from each of 88 prostate cancer patients. To utilize these paired samples as covariate data in a regression model for PFS time, and because the p-PDGFR distributions are bimodal, we first employ a Bayesian hierarchical mixture model to obtain a deconvolution of the pre-treatment and post-treatment within-patient p-PDGFR distributions. We evaluate fits of the mixture model and a non-mixture model that ignores the bimodality by using a supnorm metric to compare the empirical distribution of each p-PDGFR data set with the corresponding fitted distribution under each model. Our results show that first using the mixture model to account for the bimodality of the within-patient p-PDGFR distributions, and then using the posterior within-patient component mean changes in p-PDGFR so obtained as covariates in the regression model for PFS time provides an improved estimation.

March 4, 2009

Murali Haran, Ph.D.
Assistant Professor, Department of Statistics
The Pennsylvania State University, State College, Pennsylvania

Towards Automating MCMC Algorithms for Spatial Generalized Linear Models

Markov chain Monte Carlo (MCMC) algorithms provide a very general recipe for estimating properties of complicated distributions. While their use has become commonplace and there is a large literature on MCMC theory and practice, MCMC users still have to contend with several challenges --- determining how to construct a good algorithm, deciding whether an MCMC algorithm is producing accurate estimates and determining an appropriate length (stopping rule) for the Markov chain. I will describe some approaches for automating these decisions in the context of spatial generalized linear models, an important class of models that result in challenging posterior distributions. These approaches combine analytical approximations for constructing provably fast mixing MCMC algorithms and take advantage of recent developments in MCMC theory. I will conclude with a description of the application of these algorithms to data.

February 23, 2009

Robin D. Dowell-Deen, D.Sc.
Postdoctoral Fellow, Computer Science and Artificial Intelligence Laboratory
Massachusetts Institute of Technology, Cambridge, Massachusetts

From Genotype to Phenotype with Functional Genomics Comparisons

We have conducted a functional comparison of two Saccharomyces cerevisiae yeast strains that exhibit dramatically different phenotypic behaviors despite being approximately as close in evolutionary distance as two individual humans. We sequenced, annotated, transcription profiled, and constructed a complete deletion library (functional analysis) for the Σ1278b strain. By comparison to the yeast reference strain (S288C) we both obtain mechanistic explanations for known differences between the strains and accurately predict additional phenotypes. A number of the observed genomic differences between Σ1278b and S288C have implications for cell surface phenotypes. The (mostly) subtelomeric FLO genes confer a number of cell-surface phenotypes that we have previously shown to be under both genetic and epigenetic control. We identify two new FLO genes within the Σ1278b genome. Furthermore, our whole genome tiling expression data identifies competitive non-coding intergenic transcripts at a number of the FLO genes in Σ1278b. A detailed study of these transcripts at the FLO11 (MUC1) locus sheds light on the complex regulatory relationship between transcription factors, chromatin remodelers, and transcription. The net effect of all regulatory stimulus on the FLO genes is variegated transcription, which results in phenotypic variation in cell adhesion and filamentation phenotypes within clonal populations of yeast cells. Our findings have broader implications for how pathogenic fungi may evolve their genomes to switch their cell-surface molecules and evade the immune system.

February 16, 2009

Ming Yuan, Ph.D.
Assistant Professor, School of Industrial and Systems Engineering
Georgia Institute of Technology, Atlanta, Georgia

Sparse Gaussian Graphical Model Estimation using l1 Regularization

We propose penalized likelihood methods for estimating the concentration matrix in the Gaussian graphical model. The methods lead to a sparse and shrinkage estimator of the concentration matrix that is positive definite, and thus conduct model selection and estimation simultaneously. The implementation of the methods is nontrivial because of the positive definite constraint on the concentration matrix, but we show that the computation can be done effectively by taking advantage of recent advances in convex optimization.
Simulations and real examples demonstrate the competitive performance of the new methods.

February 9, 2009

Xiaoning Qian, Ph.D.
Research Assistant Professor (Statistics), Associate Research Scientist (Electrical Engineering), Department of Electrical and Computer Engineering
Texas A&M University, College Station, Texas

Analysis and Control for Gene Regulatory Networks

Probabilistic Boolean networks constitute a class of gene regulatory networks to model biological processes with the network dynamics determined by logic-rule regulatory functions in conjunction with probabilistic parameters involved in network transitions. Since our ultimate purpose for studying networks is to apply intervention to living organisms, it is incumbent that we analyze long-run network behavior based on the underlying Markov chain. Its steady-state distribution reflects the long-run behavior of the network and it can give insight into the dynamics or momentum existing in a system. The change of steady-state distribution caused by possible perturbations to a network is the key measure for intervention. We derive analytic results for changes in the steady-state distributions of probabilistic Boolean networks resulting from modifications to the underlying regulatory rules and probabilistic parameters. From these analytic results, we derive both optimal and greedy intervention strategies to obtain therapeutic benefits for future drug design or gene therapy design, and analyze the sensitivity of gene regulatory networks. The preliminary results in two real biological networks have shown that our methods can potentially serve as future intervention strategies to identify potential drug targets and design gene-based therapeutic strategies.

February 2, 2009

Elizabeth Purdom, Ph.D.
NSF Bioinformatics Post-Doctoral Fellow, Department of Statistics
University of California at Berkeley, Berkeley, California

Statistical Problems in Estimating Alternative Splicing

The biological process of alternative splicing allows cells to encode for different protein products in the same stretch of DNA by including or excluding segments of DNA when translating the DNA into proteins. It is now understood that alternative splicing and transcript events are an important method of regulation of gene expression and can also be important in the study of disease, particularly cancer. Many high-throughput technologies, generally used for measuring overall levels of gene expression, have been adapted to measure alternative splicing. We will discuss the statistical and computational challenges in detecting and quantifying alternative splicing. In particular we will focus on analyzing data from two different types of technologies: Affymetrix's Exon microarray and Next-generation Sequencing. We will focus on two questions 1) detecting the regions of the gene (exons) that undergo alternative splicing and 2) quantifying expression levels for the resulting products (isoforms) that are created by piecing these exons together. We will discuss our methodologies for addressing these problems as well as the questions of quality control and robustness necessary in the analysis of these platforms.

January 26, 2009

Romesh C. Stanislaus, Ph.D.
Instructor, Department of Bioinformatics and Computational Biology
M. D. Anderson Cancer Center

RPPAML/RIMS: A meta data format and an information management system for Reverse Phase Protein Arrays

In this report an RPPA Information Management System (RIMS) is described and made available with open source software. In order to implement the proposed system, we propose a metadata format known as reverse phase protein array markup language (RPPAML). RPPAML would enable researchers to describe, document and disseminate RPPA data. The complexity of the data structure needed to describe the results and the graphic tools necessary to visualize them require a software deployment distributed between a client and a server application. This was achieved without sacrificing interoperability between individual deployments through the use of an open source semantic database, S3DB. This data service backbone is available to multiple client side applications that can also access other server side deployments. The RIMS platform was designed to interoperate with other data analysis and data visualization tools such as Cytoscape. The proposed RPPAML data format hopes to standardize RPPA data. Standardization of data would result in diverse client applications being able to operate on the same set of data. Additionally, having data in a standard format would enable data dissemination and data analysis. (BMC Bioinformatics 2008, 9:555)

January 7, 2009

Soma S. Dhavala
Graduate Student, Department of Statistics
Texas A&M University, College Station, Texas

A Bayesian Semiparametric Model to Analyze Discrete Gene-expression Data

In this talk, we focus on developing new statistical methods to understand Salmonella infection in Bovine Ligated loops. We propose a Bayesian semi-parametric hierarchical model to analyze the differential gene expression data produced using the massively parallel signature sequencing (MPSS) technology. A zero-inflated Poisson likelihood is assumed and a Dirichlet process prior is elicited for the treatment effects. Pair-wise tests for differential expression are carried using the Kullback-Leibler distances. Individual tests are adjusted for multiple hypotheses testing by controlling the pFDR. Pairwise association matrix estimated from Markov-chain Monte-Carlo samples is used to form clusters. We present the results of our analysis that are validated by biologists. We conclude the talk by briefly discussing our on-going work on generalizing the model to accommodate cross-platform data.

2008 Seminars

December 10, 2008

Jong Soo Lee, Ph.D.
Visiting Assistant Professor, Department of Statistics
Carnegie Mellon University, Pittsburgh, Pennsylvania

Pointwise Testing with Functional Data Using the Westfall-Young Randomization Method 

I will present a consideration of hypothesis testing with smooth functional data by performing pointwise tests and applying a multiple comparisons procedure. Methods based on general inequalities (e.g., Bonferroni's method) do not perform well because of the high correlation between observations at nearby points. I will consider the multiple comparison procedure proposed by Westfall and Young (1993) and show that it approximates a multiple comparison correction for a continuum of comparisons as the grid for pointwise comparisons becomes finer. I will describe simulations and an application to real data that have verified the method’s applicability to practical settings. I will also discuss possible future directions of this research.

December 3, 2008

Melanie M. Wall, Ph.D.
Associate Professor, Division of Biostatistics
University of Minnesota, Minneapolis-St Paul, Minnesota

Structural Equation Modeling of Latent Classes 

Latent class analysis typically involves modeling a set of several observed measurements via a single underlying (latent) categorical (class) variable that is meant to capture the associations found among the observed variables. This latent class variable can then be modeled as either an outcome or predictor variable in order to address some research question of interest. Latent class models can be seen applied within the health sciences to multiple diagnostic tests without a gold standard, multiple source or informant data, and multiple symptom assessments.  As applications of this type of modeling of a single latent class variable are becoming more common, it is natural to consider models involving multiple latent class variables. In particular, structural equation models (SEM) of latent class variables will be considered, differing from traditional SEM in that all the latent variables are categorical rather than continuous. In addition to basic main effects type models, models involving interaction effects between different latent class variables on outcomes will be demonstrated as well as structural model relationships between multiple latent class processes (over time). Examples of traditional applications of the single latent class variable models will be given and an application relating social, familial, environmental, and personal factors associated with adolescent obesity will be used to demonstrate the new SEM of latent classes.

November 19, 2008

Mikyoung Jun, Ph.D.
Assistant Professor, Department of Statistics
Texas A&M University, College Station, Texas

Nonstationary Covariance Models for Global Data 

The widespread availability of satellite-based instruments has allowed investigators to measure many geophysical processes on a global scale. Such assessments often show strong nonstationarity in the covariance structure.  I present a flexible class of parametric covariance models that can capture the nonstationarity in global data, and in particular, the strong dependency of covariance structure on latitudes. I apply the discrete Fourier transform to data on regular grids, which enables me to calculate the exact likelihood for large data sets. I apply the proposed covariance model to global total column ozone level data on a given day, and discuss how the model compares with some existing models.

November 12, 2008

Ori Rosen, D.Sc.
Associate Professor, Department of Mathematical Sciences
The University of Texas at El Paso

A Bayesian Regression Model for Multivariate Functional Data           

I will describe a method for analyzing multivariate functional data with unequally spaced observation times that may differ among subjects. Fitting multivariate observations simultaneously rather than fitting each variable separately may be advantageous if the error terms corresponding to each variable are correlated. The proposed method is formulated as a Bayesian mixed-effects model in which the fixed part corresponds to the mean functions, and the random part corresponds to individual deviations from these mean functions. Covariates can be incorporated into both the fixed and the random effects. I will present the results of simulation studies, and will apply the methodology to real data for illustration.

November 7, 2008

Yijian Eugene Huang, Ph.D.
Associate Professor of Biostatistics and Bioinformatics, Rollins School of Public Health
Emory University, Atlanta, Georgia

Quantile Regression with Censored Data            

Quantile regression has been advocated in survival analysis to assess evolving covariate effects. However, challenges arise when the censoring time is neither always observed nor independent of the covariates. In spite of several recent advances attempting to resolve this problem, existing methods either involve complicated algorithms, which lead to difficulties of implementation and asymptotics; or impose a cumulative-probability grid that introduces undesirable grid-dependence of the estimation. To resolve these issues, I introduce fundamental and general quantile calculus on a cumulative probability scale. These results give rise to a novel estimation procedure for censored quantile regression, based on estimating integral equations. I will propose a numerically reliable and efficient algorithm for the computation. This procedure reduces to the Kaplan-Meier method in the k-sample problem, and to standard uncensored quantile regression in the absence of censoring. The proposed regression quantile estimator is uniformly consistent and converges weakly to a Gaussian process. I will describe simulation studies, which have shown good numerical and statistical performance of the proposed method. I will illustrate the method through its application to data from a clinical study.

October 22, 2008

Xihong Lin, Ph.D.
Professor, Department of Statistics, School of Public Health
Harvard University, Boston, Massachusetts   

Genomic-Feature–Based Analysis of Genome-Wide Association Studies 

Conducting a genome-wide association study (GWAS) has become an increasingly popular way to identify genetic variants of a disease through the examination of hundreds of thousands of SNPs across a genome. Investigators can use a GWAS to significantly accelerate the discovery of genetic variants associated with a disease. A common approach to analyzing a GWAS dataset is to test for a single SNP at a time and adjust for multiple comparisons. This approach has been found to have several limitations, including a lack of power and high false positives, which have made it difficult to replicate the findings of top SNPs in validation studies. I will present a biological feature-based GWAS analysis using the kernel machine method through its connection with generalized linear mixed models, and will illustrate the method using the CGEMS breast cancer GWAS data.

October 15, 2008

Robert McCulloch, Ph.D.
Professor, Risk Analysis and Decision Making, McCombs School of Business
The University of Texas at Austin

BART:  Bayesian Additive Regression Trees, with Application to Classification 

In Bayesian Additive Regression Trees (BART), Chipman, George, and McCulloch developed a fully Bayesian approach to the model: y = f(x) + e, where the errors may be drawn from any symmetric distribution. In the spirit of “ensemble models” the unknown function $f$ was modeled as the sum of many simple tree models. The contribution of each individual tree was kept small through the use of a strong regularization prior. The BART was methodology was shown to be very competitive in terms of out-of-sample prediction. However, the BART model, prior, and MCMC algorithm are all geared toward the case in which the response is numeric. I will explore the use of the BART methodology in classification problems, and will discuss different approaches to extending BART to classification.

October 14, 2008

Song Zhang, Ph.D.
Assistant Professor of Clinical Sciences
The University of Texas Southwestern Medical Center, Dallas, Texas

A Bayesian Approach to Ranking and Rater Evaluation with Application to Grant Reviews 

I will describe a Bayesian hierarchical model for the analysis of ordinal data from multirater ranking studies. The model for an item's score includes four latent factors: one is a latent trait determining the true ordering of the items, and the other three are the rater's performance characteristics, including bias, discrimination, and measurement error. The fitted model can be used to rank items based on their latent trait and to evaluate the performance of raters based on their characteristics. I will also describe a simulation-based decision-theoretic approach to determining the optimal number of raters. A loss function is specified accounting for the penalty of incorrect ranking and the cost of raters. I will identify the optimal number of raters for which the loss function is minimized. I will present the results of a simulation study and an application of this method to a grant review dataset.

October 8, 2008

Marco Ferreira, Ph.D.
Assistant Professor, Department of Statistics
University of Missouri - Columbia

Dynamic Multiscale Modeling 

This represents a joint effort with Scott Holan and Adelmo Bertolde. 

I will describe a new class of multiscale spatio-temporal models for Gaussian data. The framework we use decomposes the spatio-temporal observations and underlying process into several scales of resolution. Under this decomposition, the model evolves the multiscale coefficients through time with structural state-space equations. The multiscale decomposition we consider, which includes wavelet decompositions as a particular case, is able to accommodate irregular grids and heteroscedastic errors. The multiscale spatio-temporal framework we developed has several salient attributes. First, the multiscale decomposition leads to an extremely efficient divide-and-conquer estimation algorithm. Second, the multiscale coefficients have an interpretation of their own; thus, the multiscale spatio-temporal framework may offer new insight into understudied multiscale aspects of spatio-temporal observations. Finally, deterministic relationships between different resolution levels are automatically respected for the observations, the latent process, and the estimated latent process. I use two examples to illustrate the use of our multiscale framework. First, I will describe our analysis of a simulated dataset of functional data with temporally evolving functions; then, our analysis of a spatio-temporal dataset on agriculture production in the state of Espirito Santo, Brazil.

October 1, 2008

Sining Chen, Ph.D.
Assistant Professor, Department of Environmental Health Sciences & Biostatistics
John Hopkins School of Public Health, Baltimore, Maryland

Estimation, Prediction and Screening of Colorectal Cancer Risk in Lynch Syndrome 

I will give an overview of the statistical issues surrounding Lynch syndrome, the most common hereditary colorectal cancer syndrome, which also involves several other cancer sites. We will look at (1) the estimation of genetic risks from large, heavily ascertained families with cancer, (2) building mutation carrier probability models including MMRpro, and (3) individualized colonoscopy schedules based on personal risks for colorectal cancer.

September 10, 2008

Bradley P. Carlin, Ph.D.
Professor, Division of Biostatistics, School of Public Health
University of Minnesota, Minneapolis-St. Paul, Minnesota   

Analysis of Marked Point Patterns with Spatial and Nonspatial Covariate Information 

Hierarchical modeling of spatial point process data has historically been plagued by computational difficulties. Likelihoods feature intractable integrals that are themselves nested within a Markov chain Monte Carlo (MCMC) algorithm. I extend customary spatial point pattern analysis in the context of a log-Gaussian Cox process model to accommodate spatially referenced covariates, individual-level risk factors, and individual-level covariates of interest that mark the process. I also use multivariate process realizations to capture dependence among the intensity surfaces across the marks. I illustrate this method using data of breast cancer case locations collected throughout the mostly rural northern part of Minnesota, which are marked by the selection of mastectomy or breast conserving surgery (“lumpectomy”) for breast cancer treatment. The key substantive covariate (driving distance to the nearest radiation treatment facility) is spatially referenced, but other important covariates (notably age and stage) are not. This approach facilitates the mapping of marginal log-relative intensity surfaces for the two treatment options, and resolves the issue of whether women who face long driving distances are significantly more likely to opt for mastectomy while still accounting for all sources of spatial and nonspatial variability in the data. I also briefly discuss methods for statistical boundary analysis (“wombling”) in such settings.

July 28, 2008

David Rossell, Ph.D.
Unit Manager, Biostatistics and Bioinformatics
IRB Barcelona

GaGa: Microarray Differential Expression, Gene Clustering and Class Prediction

A typical microarray study analysis requires multiple complementary analyses, such as gene differential expression, gene clustering, class prediction, gene ontology, network/pathway analyses. I propose a simple and computationally efficient model that unifies several of these tasks in a single framework. The model generalizes the hierarchical gamma/gamma model, first introduced by Newton and Kendziorski, to address several issues that limit the quality of the fit and therefore the reliability of the inference. The main advantage of a unified framework is that information can be shared between different analyses. When building a classifier, the model weights the contribution of each gene by the posterior probability that the gene is differentially expressed. When clustering genes, the clusters are defined according to biologically interpretable expression patterns. I illustrate the approach by walking through several real datasets.

May 29, 2008

María Eglée Pérez, Ph.D.
Associate Professor, Department of Mathematics
University of Puerto Rico - Rio Piedras Campus
San Juan, Puerto Rico

Intrinsic Priors for Testing the Hardy-Weinberg Equilibrium

Testing Hardy-Weinberg equilibrium is a relevant concern, for example, in studies relating genetical configurations with health conditions. The selection of prior distributions for testing Hardy-Weinberg equilibrium is a challenging issue as we are dealing with a low dimensional null hypothesis for a discrete model. In this work, intrinsic priors for testing Hardy-Weinberg equilibrium are calculated using hypothetical training samples from uniform and Haldane priors. Properties of both priors are discussed, and their performances are compared on hypothetical data sets, and on real data from a case-control study of risk factors for gastric cancer in Western Venezuela. Analysis of sensitivity to different training samples sizes is shown, and possible criteria for the selection of the traning sample size are discussed.

May 28, 2008

William Rosenberger, Ph.D.
Professor and Chairman, Department of Statistics
George Mason University, Fairfax, Virginia

Handling Covariates in the Design of Clinical Trials

There has been a split in the statistics community about the need to take into account covariates when designing a clinical trial. There are many advocates of using stratification and covariate-adaptive randomization to promote balance on certain known covariates. However, balance does not always promote efficiency or ensure that more patients are assigned to the better treatment. I describe procedures, including model-based procedures, for incorporating covariates into the design of a clinical trial, and give examples where balance, efficiency, and ethical considerations may be in conflict. I advocate covariate-adjusted response-adaptive (CARA) randomization procedures, a new class of procedures that attempts to optimize both efficiency and ethical considerations while maintaining randomization. I review the philosophy and procedures, and present a few new simulation studies for illustration. 

May 21, 2008

Alejandro A. Vallejos, Ph.D.
Postdoctoral Researcher, Department of Statistics
Pontificia Universidad Católica de Chile, Santiago, Chile

DPackage: An R Package for Bayesian Nonparametric Inference

Although Bayesian nonparametric methods are extremely powerful and have a wide range of applicability within several prominent domains of statistics, they are not as widely used as one might guess. At least part of the reason for this has been the gap between the type of software that many applied users would like to have for fitting models and the software that is currently available. I introduce an R package, DPpackage, that is designed to help bridge this gap. DPpackage allows the user to perform Bayesian inference via simulation from the posterior distributions for models considering Dirichlet processes (DP): Dirichlet process mixtures (DPM), Polya trees (PT), mixtures of triangular distributions, and random Bernstein polynomial priors. The package also includes generalized additive models considering penalized B-splines. I discuss the general syntax and design philosophy of the package, and demonstrate its usage and main features using several examples with an emphasis on semiparametric generalized linear mixed models.

May 19, 2008

Damien Chaussabel, Ph.D., Associate Investigator
Baylor Institute for Immunology Research, Dallas, Texas

A Modular Framework for Biomarker and Knowledge Discovery from Blood Transcriptional Profiling Studies

The analysis of patient blood transcriptional profiles offers a means to investigate immunological mechanisms relevant to human diseases on a genome-wide scale. Such studies also provide a basis for the discovery of clinically-relevant biomarker signatures. However, mining large-scale data for knowledge that is immunological and/or clinically relevant is a challenge. I present a strategy with the goal of reducing the dimension of microarray data. The strategy is based on the identification of transcriptional modules formed by genes coordinately expressed in multiple disease datasets. 

May 14, 2008

Jing Cao, Ph.D.
Assistant Professor, Department of Statistical Science
Southern Methodist University, Dallas, Texas

Bayesian Chi-Squared Goodness-of-Fit Tests for Censored Data Models

Censored survival data can be viewed as a special case of missing data. In problems with missing data, it is common to first impute the unobserved data and then perform the model-checking procedure based on the complete data. For censored data, the complete data include both the uncensored data and the imputed censored data. When there is heavy censoring, it is possible that several partitioning cells of a goodness-of-fit test will contain a high proportion of counts that correspond to imputed data. Such cells can dramatically reduce the power of the resulting test. I present general methodology for testing the adequacy of parametric statistical models applied to data with censoring. The statistic is calculated from posterior samples of probability-transformed Bayesian residuals based on uncensored observations. Under the null hypothesis, the Bayesian residuals are independently and identically distributed according to a uniform distribution. These uniform deviates are then used to construct a statistic that has an asymptotic distribution (Johnson, 2004; 2007). I show that under heavy censoring, the test based on uncensored observations is more powerful than the test based on complete data. Under moderate or light censoring, the two tests are comparable in power. Another advantage of the proposed test is that the diagnostics apply for both simple and composite null hypotheses, and to i.i.d. and general regression settings.

May 12, 2008

Su-Chun Cheng, Ph.D.
Associate Professor, Epidemiology and Biostatistics
University of California, San Francisco

Combination of Multiple Diagnostic Tests for Classifying Censored Event Times

When there are multiple sources of information available, it is often of interest to construct a composite score that can provide classification accuracy better than any individual measurement. In this collaboration, I present robust procedures for optimally combining tests when test results are measured prior to disease onset and disease status evolves over time. To account for censoring of the time of disease onset, the most commonly used approach to combine tests to detect subsequent disease status is to fit a proportional hazards model (Cox, 1972) and use the estimated risk score. However, simulation studies suggest that such a risk score may have poor accuracy when the proportional hazards assumption fails. I present a proposal using a nonparametric transformation model (Han, 1987) as a working model to derive an optimal composite score with theoretical justification. I demonstrate that the proposed score is the optimal score when the model holds and is optimal "on average" among linear scores even if the model fails. Time-dependent sensitivity, specificity, and receiver operating characteristic curve functions are used to quantify the accuracy of the resulting composite score. The model provides consistent and asymptotically Gaussian estimators of these accuracy measures. I present a simple model-free resampling procedure to obtain all consistent variance estimators, and illustrate the new proposals with simulation studies and an analysis of a breast cancer gene expression data set.

May 7, 2008

Lorenzo Trippa
Graduate Student
Department of Biostatistics
M. D. Anderson Cancer Center

A Truly Nonparametric Proportional Hazards Model

I present collaborative work of a novel Bayesian model for event time data. Many recent papers discuss Bayesian inference for the proportional hazards model and other semiparametric models. The semiparametric approach, by definition, still includes some rigid parametric assumptions. This study was motivated by practical limitations of such assumptions. A fully non-parametric prior is defined by extending the Polya tree model to a family of random survival functions indexed by covariates. An important feature of the proposed ap¬proach is that the random survival functions are a priori centered on a proportional hazards model. This allows us to report inference on covari¬ate effects. I analyze a clinical study in which the semiparametric approach appears inadequate.

April 30, 2008

David B. Dahl, Ph.D.
Assistant Professor, Department of Statistics
Texas A&M University, College Station, Texas

Using Prior Information in Bayesian Nonparametric Models

Integration of data from several sources and technologies is a burgeoning field in bioinformatics. Some data naturally lead to formal statistical models, yet others may merely convey proximity among observations. In the context of clustering, methods are often either model-based or distanced-based. In many cases, however, both types of information are available. I propose a hybrid approach that is simultaneously model-based and distance-based. Specifically, I show how the usual Dirichlet process mixture model framework can be adapted to incorporate pairwise distances between observations. One application area is incorporating gene annotation information in statistical models for gene expression. Another application is protein structure prediction, wherein one can estimate protein torsion angle distributions using both (phi, psi) angle pairs and RMSD distances from peptides.

April 9, 2008

Wesley O. Johnson, Ph.D.
Professor, Department of Statistics
University of California-Irvine

Non-Proportional Hazards Regression: Survival Curves Can (Be) Cross

This work represents a collaborative effort with Maria De Iorio, Peter Mueller, and Gary Rosner.
I present a dependent Dirichlet process model for survival analysis data. The model extends the ANOVA DDP that was presented by De Iorio et al. in 2004 in JASA to handle continuous covariates and censored data. A major feature of the work is that there is no necessity for the resulting survival curve estimates to satisfy the ubiquitous proportional hazards assumption. I provide an illustration based on a cancer clinical trial in which the survival probabilities for times early in the study are estimated to be lower for those on the high-dose treatment regimen than for those on the low-dose treatment regimen; and the reverse is true for later times. This is possibly due to the greater toxicity of the high dose in patients who are not as healthy at the beginning of the study.

March 20, 2008

Guosheng Yin, Ph.D.
Assistant Professor, Department of Biostatistics
M. D. Anderson Cancer Center

Bayesian Dose-Finding Trial Designs for Drug Combinations

Treating patients with a combination of agents is becoming commonplace in clinical trials, with biochemical synergism often the primary focus. In a typical drug combination trial, the toxicity profile of each individual drug has already been thoroughly studied in single-agent trials, which naturally offers rich prior information. We propose Bayesian adaptive designs for dose finding to account for the synergistic effect of two or more drugs in combination. To search for the maximum tolerated dose combination, we continuously update the posterior estimates for the toxicity probabilities of the combined doses. By reordering the dose toxicities in the two-dimensional probability space, we adaptively assign each new cohort of patients to the most appropriate dose. We conduct extensive simulation studies to examine the operating characteristics of the designs.

March 12, 2008

Valen E. Johnson, Ph.D.
Professor and Deputy Chair, Department of Biostatistics
M. D. Anderson Cancer Center

Better Bayes Factors, with Applications to Clinical Trial Design

Most Bayesian hypothesis tests result in exponential accumulation of evidence in favor of the alternative hypothesis when the alternative hypothesis is true, but only sub-linear accumulation of evidence in favor of the point null hypothesis when the null hypothesis is true. Thus, it is often impossible for an experiment to provide “strong evidence” in favor of the null hypothesis even when moderately large sample sizes have been obtained. Because Bayesian hypothesis tests yield probability statements regarding the truth of the null hypothesis (rather than a frequentist decision to simply “not reject”), this imbalance in the rates of accumulation of evidence is highly problematic. I review and contrast asymptotic convergence rates of Bayes factors for different classes of objective prior distributions and propose two new classes of prior densities that correct the imbalance inherited by standard objective priors. I illustrate the performance of hypothesis tests defined using these new prior distributions in context of phase II clinical trials.

March 5, 2008

Yu Ryan Yue
Doctoral Candidate, Department of Statistics
University of Missouri at Columbia
Columbia, Missouri

Nonstationary Gaussian Markov Random Fields for Regression and Spatial Modeling

The smoothing spline is one of the most popular curve-fitting methods. Its two-dimensional version, the thin-plate spline, is a well-known surface fitting model that has been used intensively in spatial smoothing areas. These two splines, however, both suffer from having only one global smoothing parameter that controls the smoothness of the fit function. This becomes an issue when the function of interest is highly variable through the input space. To overcome this inadequacy, we have developed a class of priors for smoothing splines and thin-plate splines that are spatially adaptive. These priors extend Gaussian Markov random fields (GMRFs) by using a spatially adaptive variance component and taking a further GMRF prior for this variance function. Fully Bayesian inference can be carried out through efficient Markov chain Monte Carlo simulation. The performance is demonstrated with simulation examples and an application to a set of U. S. rainfall data.

February 27, 2008

Song Yang, Ph.D.
Senior Mathematical Statistician, Office of Biostatistics Research
National Heart, Lung and Blood Institute
Bethesda, Maryland

Semiparametric Estimation of the Hazard Ratio Function

The hazard ratio provides a valuable tool for assessing a treatment effect with survival data, with the proportional hazards special case of the Cox model as a widely used example. In general, the hazard ratio is a function of time, and provides a visual display of the temporal pattern of the treatment effect. The proportional hazards assumption is often too restrictive, at least for the initial exploration of a treatment effect, while a nonparametric estimate of the hazard ratio function requires a bandwidth selection, and may result in increased variance or bias. On the other hand, most semiparametric hazards models proposed so far imply certain restrictions on the hazard ratio that limit their utility. We investigate a model that allows monotone increasing or decreasing hazard ratio functions, including crossing hazards. This model provides a sufficient level of flexibility for many applications. The point estimates, point-wise confidence intervals and simultaneous confidence intervals, or confidence bands, of the hazard ratio, are proposed under this model. We demonstrate the inference procedures using data of coronary heart disease from the Women’s Health Initiative clinical trial on estrogen plus progestin, in addition to other data examples. These examples, with a diverse range of time dependence of the hazard ratio from mild to severe, suggest that the hazard ratio under this class of models, its confidence intervals and confidence bands, provide very useful visual display tools for assessing the treatment effect with survival data.

February 21, 2008

Guoqing Diao, Ph.D.
Assistant Professor, Department of Statistics
George Mason University
Fairfax, Virginia

Semiparametric Cure Rate Models with Random Effects

Joint work with Dr. Guosheng Yin, M. D. Anderson Cancer Center

We propose a novel class of cure rate models for multivariate failure time data with a survival fraction. The class is formulated through a transformation on the unknown population survival function. It incorporates random effects to account for the underlying correlation, and includes the mixture cure model structure and the proportional hazards cure model structure as two special cases. We propose a general form of the covariate structure that automatically satisfies an inherent parameter constraint. Moreover, it accommodates the corresponding binomial and exponential covariate structures in the two main formulations of cure models. The proposed class provides a natural link between the mixture and proportional hazards cure models, and it offers a wide variety of new modeling structures. We show that the nonparametric maximum likelihood estimators for the parameters of these models are consistent and asymptotically normal. The limiting variances achieve the semiparametric efficiency bounds and can be consistently estimated. Simulation studies demonstrated that the proposed methods perform well in practical situations. We use real data to illustrate this class of models.

February 20, 2008

Sonia Petrone, Ph.D.
Associate Professor, Department of Decision Sciences
Bocconi University
Milan, Italy

Bayesian Nonparametric Methods for Complex Heterogeneous Data

Bayesian nonparametric methods have more and more application in treating heterogeneous data in an extremely wide range of fields. I present recent developments of nonparametric allocation rules with multivariate and functional data, where several kinds of heterogeneity have to be taken into account.

February 13, 2008

Wen Ye, Ph.D.
Research Assistant Professor, Department of Biostatistics
University of Michigan
Ann Arbor, Michigan

Semi-Parametric Joint Modeling of Longitudinal and Time-to-Event Data Using P-Spline: A Penalized Likelihood Approach

Longitudinal studies in medical research often generate repeated measurements of biomarkers, and possibly censored survival data. Several joint models recently developed deal with the challenges arising from this type of data. A linear mixed model is commonly used to model the longitudinal covariate in joint models. However, in some cases, the longitudinal covariate time trajectory is not linear. We propose a joint model using penalized cubic B-splines to accommodate the nonlinear trajectory of longitudinal covariate measurements. To ease computation, the estimation procedure maximizes a penalized joint likelihood generated by a Laplace approximation of the joint likelihood, which combines the likelihood of the longitudinal data and the partial likelihood of the time-to-event data. We investigated the properties of the parameter estimators in simulation studies.

February 5, 2008

Sudipto Banerjee, Ph.D.
Associate Professor, Division of Biostatistics
School of Public Health
University of Minnesota
Minneapolis, Minnesota

Gaussian Predictive Processes Models for Large Spatial Datasets

With accessibility to geocoded locations that involve the use of geographical information systems (GIS) to collect scientific data, investigators are increasingly turning to spatial process models to carry out statistical inference. Over the last decade, hierarchical models implemented through Markov chain Monte Carlo (MCMC) methods have become especially popular for spatial modeling due to their flexibility and power to estimate models (and, hence, to address scientific hypotheses) that would be infeasible using classical methods. However, fitting hierarchical spatial models often involves expensive matrix decompositions, the computational complexity of which increases exponentially with the number of spatial locations. This renders them infeasible for large spatial data sets.

I propose using a predictive process derived from the original spatial process that projects process realizations to a lower-dimensional subspace, thereby reducing the computational burden. I discuss the attractive theoretical properties of this predictive process, as well as its greater modeling flexibility compared to that of existing methods. In particular, I show how the predictive process seamlessly adapts to settings with nonstationary processes, with richer and more complex space-varying regression models, and with multivariate spatial models. I present a computationally feasible template that encompasses these diverse settings.

2007 Seminars

December 5, 2007

Richard J. Chappell, Ph.D.
Professor of Biostatistics and Medical Informatics and Statistics
University of Wisconsin, Madison, Wisconsin

Delta What? Choice of Outcome Scale in Noninferiority Trials

Equivalence trials are experiments that attempt to show that one intervention is only a little inferior to another on some quantitative scale. The cutoff value is commonly denoted as Delta. For example, we might want to show that the hazard ratio of disease-free survival among patients given an experimental chemotherapy versus those given an approved regimen is Delta = 1.3 or less, particularly if the former treatment is thought to be less toxic than and otherwise advantageous compared to the latter treatment.

Naturally, a lot of attention is given to the choice of Delta. The scale of Delta in equivalence trials, even more than in superiority clinical trials, must be carefully chosen. Since null hypotheses in superiority studies generally imply no effect, they are often identical or at least compatible when formulated on different scales. However, nonzero Deltas on one scale usually conflict with those on another. For example, the four hypotheses of arithmetic or multiplicative difference of either survival or hazard in general all mean different things unless Delta = 0 for differences or 1 for ratios. This can lead to interpretation problems when the clinically natural scale is not a statistically convenient one.

November 14, 2007 

Gang Li, Ph.D.
Professor of Biostatistics
School of Public Health, UCLA, Los Angeles, California

Joint Analysis of Longitudinal and Survival Data

The joint analysis of longitudinal measurements and survival data has received much attention in recent years. However, such work has primarily focused on a single failure type for the event time with independent censoring. I will discuss joint modeling of a longitudinal outcome together with survival data involving competing risks. This approach allows us to analyze more than one type of failure and provides a simple means to handle dependent censoring. I will also discuss robust procedures for this method, and will illustrate the method through its application to the analysis of data from a clinical trial of patients with scleroderma involving the lungs.

October 31, 2007

Mayetri Gupta, Ph.D.
Visiting Assistant Professor, Boston University
Assistant Professor of Biostatistics
North Carolina Center for Genome Sciences
University of North Carolina at Chapel Hill

Bayesian Methods for Detecting Nucleosome Positioning from Genome Tiling Arrays

Innovative experimental techniques for studying genome-wide protein-DNA interactions have recently been developed. This has led to the generation of new types of data, and has helped researchers gain novel biological insights into gene regulation and other intracellular processes. From a Bayesian point of view, I will discuss a framework and statistical methodology for determining chromatin features using genome tiling array data. Taking examples from yeast and human genomes, I will also illustrate how the structural knowledge gained from this method can be used to significantly improve predictions of transcription factor binding sites.

October 17, 2007

Luis Nieto-Barajas, Ph.D.
Visiting Associate Professor, Department of Biostatistics
M. D. Anderson Cancer Center, Houston, Texas

Bayesian Semiparametric Cure Rate Model with an Unknown Threshold 

I will propose a Bayesian semiparametric model for the analysis of survival data with a cure fraction. Explicitly considering a finite cure time in the model will allow us to separate the cured and the uncured populations. I will use a mixture prior of a Markov gamma process and a point mass at zero to model the baseline hazard rate function of the entire population. I will focus on estimating the cure threshold after which subjects are considered cured. This model allows us to incorporate covariates through a structure that is similar to the proportional hazards model, and also allows the cure threshold to depend on the covariates. I will illustrate the model through its application to simulation studies and to a full Bayesian analysis of data from a study of bone marrow transplantation.

October 10, 2007

Stefano Monni, Ph.D.
Postdoctoral Researcher, Biostatistics & Epidemiology
University of Pennsylvania, Philadelphia

Associating High-Dimensional Response and Covariate Data

I will discuss a Bayesian method for multivariate analysis that can be applied to data sets with a large number of covariates and outcomes. I will illustrate the method through its application to the analysis of gene expression quantitative trait loci (eQTL).

October 3, 2007

Wesley O. Johnson, Ph.D.
Professor, Department of Statistics
University of California - Irvine

Semiparametric Survival Analysis with Time-Dependent Covariates

I will discuss semiparametric modeling of survival data with time-dependent covariates, and will consider the traditional Cox model, the Cox and Oakes model and extensions of the proportional odds model and the accelerated failure time model. I will model baseline survival with a mixture of finite polya trees in each instance. I will use the log pseudo marginal likelihood approach, presented by Geisser and Eddy (1979), to select models among the semiparametric families. I will discuss modeling longitudinal and survival data jointly, and will apply this approach to the analysis of a particular data set in order to compare the results from using fixed versus imputed values for the longitudinal process.

September 28, 2007

Erica E. M. Moodie, Ph.D.
Assistant Professor of Biostatistics
Department of Epidemiology, Biostatistics & Occupational Health
McGill University, Montreal, QC, Canada

Optimal Adaptive Treatment Strategies: Using Structural Nested Models to Estimate the Optimal Duration of Breastfeeding

I will discuss optimal adaptive treatment strategies. An adaptive treatment strategy is a function that takes treatment and covariate history to the current time for an individual as arguments and then determines the treatment to be given. One approach to finding the optimal adaptive strategy is to use structural nested models (SNM). I will develop an SNM to determine the optimal duration of breastfeeding, using data from the Promotion of Breastfeeding Intervention Trial (PROBIT). This trial compared groups of women who were exposed to a program that encouraged breastfeeding and women who were offered standard maternal care. The intent-to-treat analysis showed no significant difference in weight between the randomization groups through one year. However, there was substantial overlap in breastfeeding behavior between women in the two intervention arms, and decisions to discontinue breastfeeding correlated with maternal and infant characteristics, which suggests confounding.

September 26, 2007

David A. Stephens, Ph.D.
Professor, Department of Mathematics & Statistics
McGill University, Montreal, QC, Canada

A Bayesian View of Some Causal Inference Procedures

Causal inference methods have been demonstrated to improve the estimation of treatment effects in studies in which confounding may be present and the confounded covariate vector is of a high dimension. I will briefly review some of the most common causal inference procedures, and will then outline Bayesian versions that can be implemented, albeit at the cost of making model assumptions that are not necessary in the frequentist version.

The main application of the work is in a longitudinal dose-response study into the treatment of amblyopia, a condition of visual impairment that is commonly treated by occlusion (patching of one eye). Until recently the efficacy of this treatment had not been quantified. I will report on two studies -- one observation, one experimental -- that have attempted to make this quantification. Even in the randomized study, the effect of the "dose" of occlusion is potentially confounded as the amount of dose received is controlled by the subject, and thus causal methods must be used. I will demonstrate the similarities and differences between frequentist and Bayesian causal procedures.

September 19, 2007

Lei Liu, Ph.D.
Assistant Professor
Department of Public Health Sciences
Division of Biostatistics & Epidemiology
School of Medicine, University of Virginia, Charlottesville, Virginia

Statistical Analysis of Longitudinal Medical Cost Data

Modeling longitudinal medical costs is of great interest in health economics studies. I will propose several novel models for the analysis of longitudinal medical costs, e.g., monthly or annual medical costs. Of particular interest are multi-part random effects models to distinguish the zero costs, outpatient costs, and inpatient costs. I will introduce several estimation methods and examples of simulation studies conducted to evaluate their performance. To illustrate the models, I will use them to analyze data of monthly medical costs incurred by patients with chronic heart failure. These data were supplied by the clinical data repository (CDR) at the University of Virginia Health System.

July 11, 2007

R. Todd Ogden, Ph.D., Associate Professor of Biostatistics (Psychiatry)
Mailman School of Public Health, Columbia University, New York, New York

Regression Models with Curves or Images as Predictors

Regression of a scalar response on functional predictors (or signals), such as spectra or images, presents a major challenge when the dimension of the signals far exceeds the number of signals in the data set. Meaningful fitting of such a model requires some form of dimension reduction. A proposed approach to this problem extends common multivariate methods to handle functional data by also incorporating a roughness penalty. Common multivariate methods include principal component regression (PCR) and partial least squares (PLS) methods. I will briefly discuss a number of alternative estimation strategies, as well as sufficient conditions for consistency. I will illustrate these methods using data from near infrared (NIR) spectra from chemical samples and data from a brain imaging study. 

June 6, 2007

Nadine Houédé, M.D., Professor
Institut Bergonié, Bordeaux, France

New Methods in Early Drug Development: Bringing Biostatistics into the Clinic

Compared to the evaluation of classical cytotoxic agents, that of new targeted therapies in early phase trials raises some difficult issues in terms of the assessment of efficacy and toxicity. Targeted therapies do not necessarily induce cell death and tumor shrinkage. This is probably due to their mechanisms of action, including cell signaling pathways, angiogenesis or cell cycle progression. In contrast with classical cytotoxic agents, most targeted agents have moderate or no toxicity. Moreover, the relationship between dose and efficacy for targeted therapies is often not linear or even monotone, as efficacy sometimes reaches a plateau or even decreases with higher doses of such therapies. Thus, the efficacy of targeted therapies combined with cytotoxic agents cannot be evaluated effectively using RECIST criteria, and classical trial designs and endpoints may not be adequate to validate these new combinations. The problem becomes even more complex when a targeted agent is combined with a cytotoxic agent.

I will present some examples of development failures for new targeted therapies and discuss what we may learn from them to improve the methodologies used in early phase clinical trials. I will propose a new phase I-II strategy for choosing dose pairs of a combination of a cytotoxic agent and a cytostatic agent based on efficacy and toxicity. Under a Bayesian model, the method chooses the dose pair for each successive cohort of patients adaptively to maximize the expected utility, based on observed data from patients treated previously in the trial. The delphi method is used to obtain this utility (the quality of each toxicity-efficacy result being useful to the patient) from members of an international group of experts.

For illustration, I will apply the method to data from a trial combining targeted therapy and chemotherapy as first-line treatment of bladder cancer.

April 25, 2007

Xiaohui Sophie Wang, Ph.D.
Assistant Professor, Department of Mathematics
The University of Texas-Pan American, Edinburg, Texas

Classification of Curve Data Using Bayesian Wavelet Methods

I propose classification models for binary and multicategory data for which the predictor is a random function. I will use Bayesian modeling with wavelet basis functions that have nice approximation properties over a large class of functional spaces and can accomodate a variety of functional forms that are observed in practical applications. I will develop a unified hierarchical model to encompass the adaptive wavelet-based function estimation model as well as the logistic classification model. I will analyze simulated and real data sets to compare the performance of the proposed model with other classification methods, such as the existing naive plug-in methods.

April 18, 2007

Michael J Daniels, Sc.D.
Associate Professor and Chief, Division of Biostatistics
Department of Epidemiology and Biostatistics
University of Florida, Gainesville, Florida

Joint Models for the Association of Longitudinal Binary and Continuous Processes with Application to a Smoking Cessation Trial

Collaborative research with Xuefeng Liu, Ph.D., Wayne State University.

I propose joint models for use when it is of interest to determine the association of a longitudinal binary and a longitudinal continous process. The models are parameterized such that the dependence between the two processes is characterized by unconstrained regression coefficients. I use Bayesian variable selection techniques to parsimoniously model the coefficients. I develop an MCMC sampling algorithm for sampling from the posterior distribution, using data augmentation steps to handle missing data. I address several technical issues regarding efficient implementation of the MCMC algorithm. The motivation for developing the models was the analysis of a smoking cessation clinical trial, for which an important question of interest was the effect of the treatment (exercise) on the relationship between smoking cessation and weight gain.

April 3, 2007

Yanyuan Ma, Ph.D.
Professor of Statistics
University of Neuchâtel, Neuchâtel, Switzerland

Parameter and Functional Estimation in Semiparametric Models

This presentation represents work done jointly with Yedong Wang, UC-Santa Barbara, and Arnab Maity and Ray Carroll, Texas A&M University.

I will illustrate a semiparametric approach to estimating parameters in variance function, and will also demonstrate the properties of a semiparametric estimator for functional estimation. I will then illustrate the differences and links in the underlying geometric structures.

March 28, 2007

Scott Holan, Ph.D.
Assistant Professor
Department of Statistics
University of Missouri, Columbia, Missouri

Using Bayesian Markov Switching Models to Predict the Spawning Success of Shovelnose Sturgeons

This work represents a collaborative effort with Ginger Davis of the University of Virginia, Mark Wildhaber, Aaron DeLonay, Dianna Papoulias and Janice Bryan of the U. S. Geological Survey.

Sturgeon spawning is linked to environmental patterns, rhythms and cues. The recent endangerment of shovelnose sturgeons has lead to efforts to support increased spawning, but little biological or ecological information specific to the sturgeons has been available to guide such efforts. It has not been known where, when, or under what conditions shovelnose sturgeons spawn in the Missouri River, nor to what degree such spawning is successful. Using measurements of biological variables associated with readiness to spawn, as well as longitudinal behavioral data collected using telemetry and data storage device sensors, we have applied a hierarchical Bayesian model to the prediction of sturgeon spawning success. The model uses an eigenvalue predictor from the transition probability matrix in a two-state Markov switching model with GARCH dynamics as a generated regressor in a linear regression model.

March 27, 2007

Montserrat Fuentes, Ph.D.
Associate Professor
Department of Statistics
North Carolina State University, Raleigh, North Carolina

Spatial Association Between Speciated Fine Particles and Mortality

This work represents a collaborative effort with H. R. Song (University of South Carolina), S. Ghosh (NCSU), David Holland (EPA) and J. Davis (EPA and Marine Earth Atmospheric Science Dept, NCSU).

Particular matter (PM) has been linked to a range of serious cardiovascular and respiratory health problems, including premature mortality. The main objective of this research is to quantify uncertainties about the relationship between fine PM exposure and mortality. Together with my colleagues, I have developed a multivariate spatial regression model for the estimation of the risk of mortality associated with fine PM and its components across all counties in the coterminous United States. I will describe how we characterize different sources of uncertainty in the data and model the spatial structure of the mortality data and the specified fine PM. I will consider a flexible Bayesian hierarchical model for a space-time series of counts (mortality) by constructing a likelihood-based version of a generalized Poisson regression model, that combines methods for point-level misaligned data and change of support regression. Our results seem to suggest an increased risk of mortality by a factor of two due to fine particles with respect to coarse particles. Our study also shows that in the western U. S., the nitrate and crustal components of the specified fine PM seem to have more impact on mortality than other components. In the eastern U. S., sulfate and ammonium account for most of the fine PM effect.

February 28, 2007

Yumming Mu, Ph.D.
Assistant Professor, Department of Statistics
Texas A&M University, College Station, Texas

Quantile Regression Transformation Models for Longitudinal Data

I will describe a flexible nonparametric quantile regression model for longitudinal data. The basic elements of the model are a time-dependent power transformation on the longitudinal dependent variable and a varying-coefficient model for conditional quantile functions. I propose a two-step estimation procedure to fit the model, and establish its consistency property. I choose the tuning parameters with generalized cross validation in conjunction with a Schwarz-type information criterion. I will illustrate this method through its application to data of the progression of CD4 cell counts in patients infected with HIV-1 who are undergoing three different treatments. The quantile regression approach for longitudinal data enables the construction of a pointwise prediction band of individual trajectories without requiring parametric distributional assumptions.

February 7, 2007

Donatello Telesca
Graduate Student, Department of Statistics
University of Washington, Seattle, Washington

Bayesian Hierarchical Self-Modeling Warping Regression with Application to Network Inferences

Functional data often exhibit a common shape and variations in amplitude and phase across curves, and the data analysis often proceeds by synchronization of the data through curve registration. I propose a Bayesian hierarchical model for curve registration. The model provides a formal account of amplitude and phase variability, while borrowing strength from the data across curves in the estimation of the model parameters. I discuss extensions of the model using penalized B-splines in the representation of the shape and time transformation functions, and allowing random image sets in the time transformation. I discuss applications of the model to simulated data, as well as to two data sets. In particular, I illustrate the model in a nonstandard analysis investigating regulatory networks in time and course microarray data.

January 31, 2007

Xinlei Sherry Wang, Ph.D.
Assistant Professor, Department of Statistics
Southern Methodist University, Dallas, Texas

Adaptive Bayesian Criteria in Variable Selection for Generalized Linear Models

For the problem of variable selection in generalized linear models, I develop various adaptive Bayesian criteria. Using a hierarchical mixture setup for model uncertainty, combined with an integrated Laplace approximation, I derive empirical Bayes and fully Bayes criteria that can be computed easily and quickly. I use simulation studies to assess the performance of the criteria, and compare it to that of other criteria, such as AIC and BIC on normal, logistic and Poisson regression model classes. A fully Bayes criterion based on a restricted region hyperprior seems to be the most promising. I apply the proposed criteria and compare it to the competitors using an example data set.

January 24, 2007

Birgir Hrafnkelsson, Ph.D.
Assistant Professor, Department of Mechanical & Industrial Engineering
University of Iceland, Reykjavik, Iceland

Bayesian Modeling of Spatially Correlated Extreme Values

Annual extreme values of environmental variables, such as temperature, exhibit spatial correlation. The marginal distribution at each site can be modeled with the generalized extreme value distribution. The temporal correlation is not significant when there is a strong spatial correlation between the annual extreme values within the same period. I model the spatial correlation using the Gaussian copula with a correlation matrix based on the Matern correlation function. I also model the correlation with the Clayton copula, and compare the two correlation models. Further, I model the data and the parameters of the marginal generalized extreme value distributions with a Bayesian hierarchical model.

January 17, 2007

Steven MacEachern, Ph.D.
Professor, Department of Statistics
The Ohio State University, Columbus, Ohio

Dependent Race Models and Conjoint Choice Analysis

Conjoint choice experiments are a basic tool underlying much of market research. One goal of such an experiment is to assess the viability of a new product offering a combination of levels of various attributes. A viable offering will have high utility for a broad segment of consumers. In a choice experiment, subjects are presented with a set of products and asked which product they prefer. This process is repeated many times, with varying product sets. From these data, one hopes to extract a distribution of choice probabilities (across subjects) for competing products.

The primary modeling issues for a conjoint analysis are (1) how to model choice probabilities at the individual level, and (2) how to synthesize the information across a heterogeneous pool of subjects. I focus on improving the models for individual choice probabilities. The models generalize the so-called "horse race" models from psychology. The new models carefully treat the dependence structure among the alternatives, drawing a distinction between dependence arising from conditional independence and dependence arising from a shared realization of dependence. This structure is induced by the description of a product as a collection of levels of attributes. They yield a natural means of modeling dominance of one product over another. Experimental data was collected to assess the performance of the new models relative to current methods. The new models outperform current methods on a variety of out-of-sample measures of fit. Most importantly, the new models perform quite well on tasks of extrapolation.

January 15, 2007

Fernando Quintana, Ph.D.
Adjunct Professor, Department of Statistics
Pontificia Universidad Catolica de Chile, Santiago, Chile

Semiparametric Modeling of Skewed Distributions

I present a model for univariate outcomes, constructed by weighting a symmetrical density by a skewing function that is modeled using Bernoulli polynomials. One basic idea of the model is the ability to gain control of the resulting distribution, particularly of the moments. I present some of the properties of the model, and discuss its practical applications.

January 10, 2007

Satoshi Morita, Ph.D.
Associate Professor
Nagoya University Graduate School of Medicine, Nagoya, Japan

Determining the Effective Sample Size of a Parametric Prior

I present a definition for the effective sample size of a parametric prior distribution in a Bayesian analysis, and propose methods for computing the effective sample size in a variety of settings.

January 4, 2007

Jing Ning, M.S.
Department of Biostatistics
Bloomberg School of Public Health
The Johns Hopkins University, Baltimore, Maryland

Estimating a Causal Treatment Effect on a Mark Variable with Complications of Failure Time Censoring

In a randomization study, a treatment effect on a mark variable measured at the time of event failure is an important index for evaluating treatment efficiency. That the values of mark variables are not observable when the failure events are censored makes it difficult to analyze data of this type. Also, conditioning on the occurrence of the failure event that occurs after treatment, observance of the mark variable is typically subject to selection bias. Thus, in general, comparisons based on mark variables measured at the post-treatment event do not have a causal interpretation. Furthermore, when failure time censoring is present, the marginal distribution of the mark variable may not be fully identifiable. This non-identifiability problem makes evaluating causal treatment effects even more difficult. I consider models and required assumptions for nonparametric estimation of causal treatment effects on mark variables. I develop analytic procedures by borrowing information from failure time data to correct the selection bias. Formulating the problem by the principal stratification framework of causal inference, I verify that the proposed treatment effects are principal causal effects. This method identifies the causal effects based on a conditional distribution of a mark variable, rather than an unconditional distribution. I establish asymptotic properties of the proposed estimators, and use numerical studies to demonstrate the performance of the estimators with practical sample sizes. I apply the methodology to data from an AIDS clinical trial.

2006 Seminars

December 13, 2006

Susan A. Murphy, Ph.D.
H Robbins Professor of Statistics, Professor of Psychiatry
University of Michigan, Ann Arbor

SMART Designs for Developing Dynamic Treatment Regimens

Dynamic treatment regimens are individually tailored treatments that mimic the adaptive nature of clinical practice, allowing for repeated adaptation of the treatment in response to patient outcomes. I describe an experimental design that yields useful data for constructing dynamic treatment regimens, and discuss potential primary and secondary data analyses. I illustrate the analyses with data from a SMART experimental design concerning patients with treatment-resistant depression.

December 6, 2006

John Cook, Ph.D.
Associate Director of Software Development, Division of Quantitative Sciences
The University of Texas M. D. Anderson Cancer Center

Details of Designing a Trial Using Adaptive Randomization

I present in detail the adaptive randomization design for a protocol conducted last year testing clofarabine +/- AraC for patients with AML/MDS. The trial illustrates some phenomena that researchers may not anticipate when designing adaptive randomization trials.

November 29, 2006

Sonia Jain, Ph.D.
Assistant Professor of Family & Preventive Medicine
Cancer Prevention & Control Program, University of California, San Diego

Analysis of Gene Expression Data Using a Split-merge Markov Chain Monte Carlo Technique

This presentation represents work done jointly with Radford M. Neal, of the University of Toronto.

The inferential problem of associating data in high dimensions to mixture components is difficult when components are nearby or overlapping. I introduce a new split-merge Markov chain Monte Carlo technique that efficiently classifies observations by splitting and merging mixture components of a nonconjugate Bayesian mixture model. This method, a Metropolis-Hastings procedure with split-merge functions, samples clusters of observations simultaneously rather than incrementally assigning observations to mixture components. Split-merge moves are produces by exploiting the properties of a restricted Gibbs sampling scan. I demonstrate the split-merge technique through its application to gene expression data that were part of a cancer classification problem. The data were collected in microarray experiments evaluating samples from patients diagnosed with leukemia, which were clustered according to the type of leukemia.

November 15, 2006

Erning Li, Ph.D.
Assistant Professor, Department of Statistics, Texas A&M University, College Station, Texas

Joint Models for a Primary Endpoint and Longitudinal Data

The relationship between a primary endpoint and longitudinal processes is often of interest in medical and public health research. Joint models that represent the association through shared dependence of the primary and longitudinal data on random effects are increasingly popular. Naive implementation by imputing subject-specific effects from individual regression fits yields biased inference, and several methods for reducing this bias have been proposed. These require a parametric (normality) assumption on the random effects, which may be unrealistic. Moreover, the existing methods routinely assume independent within-subject measurement errors in the longitudinal covariate processes. I propose conditional estimation procedures that require neither a distributional or covariance structural assumption on random effects, nor an independence assumption on within-subject measurement errors. The new procedures readily cover scenarios that have multivariate longitudinal covariate processes and can be calculated using available software. I present performance evaluations of the new estimators through simulations and analysis of data from a study of hypertension. Alternatively, I discuss a semiparametric joint model that makes only mild assumptions on the random effects distribution, and develop likelihood-based inference on the association and distribution. The estimated distribution can reveal interesting population features, as I demonstrate using data from a study of the association between longitudinal hormone levels and bone mineral status in perimenopausal women.

November 6, 2006

Dan Steinberg, Ph.D.
President and founder of Salford Systems, San Diego, California

Application of Data Mining Tools (MAR Splines, Random Forests and TreeNet) in Medical / Cancer Studies

After a brief overview of the technology, I discuss recent applications of these modern methods to uncover relationships not revealed through conventional statistical analysis. I draw upon applications from epidemiological work conducted at the Johns Hopkins University Medical School and at Harvard Medical School, among others.

November 1, 2006

Guosheng Yin, Ph.D.
Assistant Professor, Department of Biostatistics
The University of Texas M. D. Anderson Cancer Center

Generalized Method of Moments for Linear Regression with Multivariate Failure Time Data

The generalized method of moments (GMM) has an attractive structure and is particularly useful for improving estimation efficiency when the likelihood formulation is difficult and the moment conditions are obtained relatively easily. With multivariate failure time data, it is difficult to obtain efficient estimators using conventional estimating equations. I propose taking the GMM approach to the linear regression or accelerated failure time model with correlated survival data. Using martingale-based moments, I have studied the semiparametric rank estimator. To improve efficiency, I have concatenated the moments and built up a quadratic objective function by circumventing a direct estimation of the correlation parameters. I have established the consistency and asymptotic normality properties for the parameter estimates, and derived the limiting distribution for the objective function. Simulation studies I have carried out allow us to examine the finite sample properties of the GMM estimation and inference procedures, and to demonstrate its substantial efficiency gain over the conventional method. Real data from a study of diabetic retinopathy serve as an example on the application of this approach.

October 18, 2006

Ming-Ying Leung, Ph.D.
Professor and Director of Bioinformatics Program
The University of Texas at El Paso

Palindromes in Viral Genomes

A viral genome is typically a DNA or RNA molecule, made up of a sequence of nucleotide bases, usually represented as A, C, G and T. Bases A and T form a complementary pair, as do bases C and G. Palindromes are "words" formed by the base letters on a nucleotide sequence that are symmetrical in the sense that they read exactly the same as their reverse complements. As palindromes are frequently involved in DNA-protein binding, as well as in RNA secondary structure formation, many important functional sites on viral genomes contain unusually high concentrations of palindromes. In order to provide statistical criteria for identifying regions of the viral genomes with significantly high concentrations of palindromes, we obtain the mean and standard deviation for the number of palindromes at or above a given length, and a Poisson process approximation for their distribution on a random nucleotide sequence. These results have been applied to predict replication origins for DNA viruses and to assist our current efforts to develop a more consistent and efficient approach for predicting pseudoknots for RNA viruses.

October 18, 2006

Mei-Ling Ting Lee, Ph.D.
Chair and Professor, Division of Biostatistics
The Ohio State University, Columbus, Ohio

Modeling Protein Mass Spectra Using an Ion Flight Mixture Model

I propose a mathematical mixture of first-hit time distributions as a unifying statistical model for the analysis of mass spectrometry data in proteomic studies. The model recognizes the time of flight of an ion as a first-hit time and models the ion stream as a stochastic process. The model guides the deconvolution of a target protein mass spectrum into signatures of known ions from protein and peptide databases. In collaborative research, I have conducted a mass spectrum experiment to illustrate the mixture model and deconvolution methodology. I discuss tests for differential relative abundance, and illustrate the model, methods and ideas using a data set from a mass spectrometry experiment and a published data set from the study of ovarian cancer.

October 5, 2006

Yugo Cheng, Ph.D.
Assistant Professor, Department of Statistics
University of Illinois at Urbana-Champaign

Sampling for Conditional Inference on Multiway Tables

I describe an efficient sequential Monte Carlo method for sampling multiway tables with given constraints, which can be used to approximate exact conditional inference on contingency tables. An essential feature of this method is that it samples table entries sequentially according to an appropriate proposal distribution. The sequential sampling approach "divides and conquers" the difficult task of finding an appropriate proposal distribution for a multiway table with complex constraints. The model uses computational commutative algebra to provide conditions that guarantee certain good properties. I apply this method to a range of examples from the social and medical sciences to demonstrate its efficiency for real problems.

October 4, 2006

Feng Liang, Ph.D.
Assistant Professor, Department of Statistics
University of Illinois at Urbana-Champaign
and Institute of Statistics & Decision Sciences, Duke University, Durham, North Carolina

Nonparametric Bayesian Kernel Models

The reproducing kernel Hillbert space (RKHS) is a popular tool used in machine learning and data mining. I present a fully Bayesian framework and theory that coherently embeds kernel regression and classification in a general nonparametric model. The theory behind this approach relates the model to statistical learning methods, showing that the new class of priors supports the full range of functions in the RKHS. Key practical features of this approach include the use of shrinking priors to address problems of a "large p," the use of mixture priors for feature selection, coherent updating as sample sizes change and an understanding of "unlabelled data."

September 27, 2006

David B Dunson, Ph.D.
Senior Investigator, NIEHS
Research Triangle Park, North Carolina

A General Nonparametric Bayesian Method for Random Probability Measures Indexed by Predictors

In many applications, interest focuses on relating one or more predictors to the distribution of outcome variable. Typically, the conditional response distribution given the predictors is unknown, but simplifying parametric assumptions are made, such as normality, linearity or a constant residual distribution. Hierarchical mixtures of expert models have been proposed, which avoid such assumptions through the use of locally-weighted mixtures of parametric models. For example, for a continuous response, a mixture of normal linear regressions could be used, with the mixture weights varying with predictor values. To avoid restrictive assumptions, such as a known number of mixture components or a known structure for the mixture weights, I propose a general nonparametric Bayesian method for uncountable collections of random probability measures indexed by predictors. This approach expresses the unknown mixture distribution as a kernel convolution of Dirichlet process basis distributions, with a random stick-breaking measure placed on the basis locations. A key property of the proposed structure is sparseness, with the method automatically tending to favor fewer components. I discuss additional properties, develop an efficient retrospective MCMC algorithm for posterior computation, and illustrate the method through application to reproductive epidemiology data.

September 13, 2006

Alan Dabney, Ph.D.
Assistant Professor, Department of Statistics
Texas A&M University, College Station, Texas

Functional ANOVA Normalization of Two-Channel Microarrays

This works represents a collaborative effort with J. D. Storey.

I present a new, general method for normalizing two-channel microarray data, partially drawing on ideas from two widely used approaches. Whereas the ANOVA approach carefully distinguishes different sources of signal bias through explicit terms in its model, the approach based on the MA plot assumes that all intensity-dependent trends are due to unwanted bias, each leading to inaccurate normalization in fairly common scenarios. The approach I propose, eCADS, captures the strengths of the ANOVA and MA plot approaches, while avoiding their weaknesses. I replace the fixed coefficients in the ANOVA model with functions of the underlying RNA amount, thereby incorporating intensity-dependent relationships like those evident in MA plots. The normalization method fits this "functional ANOVA" model and subtracts terms representing bias to retain the biological signal of interest. By requiring a simple balance in experimental design, I show that the proposed method preserves differential expression relationships in expectation. A consequence of this work is the statistical justification of a more efficient dye-swap design that requires only one array per sample pair. I demonstrate the proposed method through its application to an experiment measuring expression in developing mice.

August 30, 2006

David Rossell-Ribera
Doctoral Student, Department of Statistics
Rice University, Houston, Texas

Boundary-based Optimal Sequential Designs

This work represents a collaborative effort with Peter Müller and Gary Rosner.

I address the problem of designing sequential clinical trials that are optimal with respect to the researcher's goals and constraints. This can be formalized in a decision-theoretic framework as the maximization of some expected utility function, possibly restricting the search to designs that satisfy some properties, e.g., false-positive and false-negative error probabilities. Every time that data are observed, one must make the sequential decision of stopping or continuing data collection. When a stopping decision is made, one must make a terminal decision, e.g., declaring whether a treatment is effective or not.

A full solution to the problem requires backward induction, but this is too computationally expensive, even in fairly simplistic situations. I propose a simplification based on sequential boundaries. Each time data are observed, one computes a summary statistic and decides to stop or continue data collection, depending on the region into which the statistic falls. I use simple parametric forms to define these regions, thus, finding the optimal design is reduced to finding the optimal parameter values. I illustrate this approach in the context of screening designs and microarray data analysis.

May 24, 2006

Jiajie Zhang, Ph.D.
Professor, Associate Dean for Research, School of Health Information Sciences
The University of Texas Health Science Center-Houston

Biomedical Informatics: Current Challenges

Biomedical informatics is a rapidly growing interdisciplinary field, concerned with methods of data collection, storage, processing, communication and presentation within the biomedical sciences and health care industry. I discuss four current challenges facing biomedical informatics: infrastructure, ontology, translation informatics and human-centered computing. Human-centered computing is presented in detail, including its relationship with the high failure rate of health information technology (HIT) projects. Rather than flawed technology, the more common cause of failed HIT projects is a lack of systematic consideration of human and other non-technologic issues in the design and implementation of the projects. Human-centered computing considers human-computer interaction, workflow, organizational change, and process reengineering. I present a theoretical framework of human-centered computing, along with its successful application in several domains.

May 17, 2006

Sumihiro Suzuki
Doctoral Student and Teaching Assistant
Department of Mathematical Sciences
The University of Texas at Dallas

Methods of Sequentially Planned Procedures

Classical sequential procedures are often impractical because of the requirement of taking a single observation at a time. Such sampling schemes are often expensive and time consuming. Sequentially planned procedures, or simply sequential plans, extend and generalize these schemes by allowing observations to be collected in groups of variable sizes. After every group of observations, all the previously collected data are used to determine the next course of action. An optimal (Bayesian) sequential plan minimizes the (Bayes) risk function that takes in to account the decision loss, observation (variable) cost, and group (fixed) cost.

In general, determining the optimal sequential plan remains an open problem, mainly becasue it requires risk optimization over a rather unstructured set of all plans. For a simple class of problems, such as one that arises in testing a treatment for a rare but severe adverse effect, I prove a number of properties of the Bayesian sequential plan, such as transitivity and monotonicity. This allows us to reduce the overall scope of the search to a small, manageable set of plans. Then, for a more general situation, I derive the upper and lower bounds for the Bayes risk, and use them to obtain epsilon-Bayes sequential plans. Additionally, I show that the epsilon-Bayes sequential plans are applicable to certain situations outside the original class of problems.

May 10, 2006

Ying Kuen (Ken) Cheung, Ph.D.
Assistant Professor of Biostatistics
Mailman School of Public Health, Columbia University, New York, NY

Two Guiding Principles for Dose-Finding Designs: Coherence and Rigidity

I introduce and discuss the coherence conditions and rigidity of dose-finding methods in the context of a simple phase I trial setting, where the objective is to estimate a targeted quantile of the unknown dose-toxicity curve in a homogenous patient population. Most phase I methods are outcome-adaptive, and thus escalate or de-escalate doses for future patients based on the previous observations. An escalation for a new patient is said to be coherent only when the previous patient did not show signs of toxicity. Likewise, a de-escalation is coherent only when the most recent outcome indicated toxicity. The coherence conditions, motivated by ethical concerns in clinical trials, are satisfied by many statistical designs in the literature, but not by some commonly used method modifications. I show examples when coherence is violated, and discuss how the coherence principles may be applied to calibrate a two-stage trial design and to deal with situations with delayed toxicity. I present a few examples in which commonly used phase I methods cause rigidity in the outcome sequences with a non-negligible probability. A necessary, albeit somewhat irrevelant, consequence of rigidity in the context of phase I trials is inconsistency. It is interesting to note that rigidity occurs in phase I designs that avoid strong parametric assumptions on the dose toxicity. This is counterintuitive because one may expect that making fewer assumptions will avoid bias and lead to consistency by adding flexibility to the model. I also discuss some practical recommendations.

April 12, 2006

Qingzhao Yu, M.S.
Doctoral Student, Department of Statistics
Texas A&M University, College Station, TX

Bayesian Synthesis

In the practical implementation of a Bayesian analysis, we often face the problem of using the data multiple times: we examine the data in an exploratory fashion to select a model (likelihood and prior), and we obtain the posterior distribution using the same set of data. This is contrary to the foundation of the Bayesian paradigm. Also, when several analysts use different methods to analyze the same data, we should be able to efficiently combine the models they produce. The ensuing aggregation of information should provide improved predictive performance. We tackle these problems through the use of a novel modeling method based on data splitting. In a standard implementation of this method, several data analysts work independently on portions of a data set, eliciting separate models that are eventually updated and combined through Bayesian averaging.

I present theoretical results that characterize the general conditions under which modeling with data splitting improves estimation. These results are suggestive of the general principles of good modeling practice. Application of this method to a popular real data set and to simulated data sets shows a predictive performance superior to that of many automatic modeling techniques, including AIC, BIC, smoothing splines, CART and random forests. Compared to competing modeling methods, the data-splitting modeling approach (1) exhibits superior predictive performance for real data sets and simulations; (2) makes more efficient use of huma knowledge; (3) selects sparser models with better explanatory ability; and (4) avoids multiple uses of the data in the Bayesian framework.

April 5, 2006

Samiran Sinha, Ph.D.
Assistant Professor, Department of Statistics
Texas A&M University, College Station, TX

Analysis of Matched Case-Control Data in the Presence of a Nonignorable Missing Exposure Variable

I focus on the informative missing exposure variable in matched case-control studies. When a missingness mechanism depends on the unobserved exposure values, one needs to model the missing mechanism in order to prevent biased and inconsistent estimates of the parameters. For handling informative missing (IM) data, I propose an approach based on a full likelihood by posing a model for selection probability and a parametric model for the partially missing exposure variable among the control population, along with a disease risk model. I propose an EM algorithm to estimate the model parameters. I discuss two special scenarios, one for a binary variable and another for a normal exposure variable. I illustrate this method through its application to real data, and through a simulation study that explores the advantages of the proposed method compared to those of existing methods under different missingness mechanisms.

March 22, 2006

Michael R. Kosorok, Ph.D.
Professor, Department of Statistics and Biostatistics & Medical Informatics
University of Wisconsin-Madison

Large p, Small n Asymptotics for Statistical Analysis of High-Dimensional Data

False discovery rate (FDR) techniques are extremely useful in microarray studies, image analysis, high-throughput molecular screening, astronomy, and in many other applications involving high-dimensional outcome data. Consider, for example, a cDNA microarray study where p-values are computed for each of p genes using data from n arrays. For FDR methods to be valid for identifying differentially expressed genes, it is necessary that the p-values for the non-differentially expressed genes simultaneously have uniform distributions marginally. While this is feasible for permutation-based p-values, it is unclear whether this also holds for p-values based on asymptotic approximations or on post-normalized data. The issue is that the number of p-values involved goes to infinity, and intuition suggests that at least some of the p-values should behave erratically. I examine this neglected issue when n is allowed to increase slowly and p is allowed to increase almost exponentially relative to n. I show the somewhat surprising result that the p-values, under very general dependency structures and for a variety of marginal test statistics and normalization procedures, are indeed simultaneously valid in a manner that allows accurate control of the FDR. I apply this result to establish the validity of a least-absolute-deviation method for normalization and significance analysis that is robust to contamination in the expression levels. I demonstrate the practical utility of the proposed method with an analysis of human placenta cDNA microarray data.

March 15, 2006

Jianhua Hu, Ph.D.
Assistant Professor, Section of Bioinformatics
Department of Biostatistics & Applied Mathematics
The University of Texas M. D. Anderson Cancer Center, Houston TX

Bayesian Model Selection Using Test Statistics

Existing Bayesian model selection procedures depend critically on the specification of proper prior distributions for the parameters of each of the models considered. I propose a new approach for Bayesian model selection that uses test statistics to compute Bayes factors between models. This greatly reduces the difficulty associated with implementing model selection procedures and eliminates much of the sensitivity of the selection procedure to ad hoc prior specifications. In several test cases, this approach produces results that are competitive with the Bayesian model selection and model averaging techniques that have been previously proposed. The method also offers important computation advantages over existing simulation-based methods. Compared to that of other Bayesian procedures, its implementation requires less subjective input.

March 8, 2006

Danyu Lin, Ph.D.
Dennis Gillings Distinguished Professor, Department of Biostatitsics
School of Public Health, The University of North Carolina at Chapel Hill

Maximum Likelihood Estimation of Haplotype Effects and Haplotype-Environment Interactions in Genetic Association Studies

A haplotype is a specific sequence of nucleotides on a single chromosome. The population associations between haplotypes and disease phenotypes provide critical information about the genetic basis of complex human diseases. Standard genotyping techniques cannot distinguish the two homologous chromosomes of an individual, so that only the unphased genotype (the combination of the two homologous haplotypes) is directly observable. Statistical inference about haplotype-phenotype assocations based on unphased genotype data presents a very interesting and challenging missing data problem, especially when the sampling depends on the disease status. I provide a comprehensive and rigorous treatment of this problem, and consider all commonly used study designs, including cross-sectional, case-control, nested case-control and case-cohort study designs. The phenotype can be a disease indicator, a quantitative trait or a potentially censored time-to-disease variable. I formulate the effects of haplotypes on the phenotype through flexible regression models, which can accommodate a variety of genetic mechanisms and gene-environment parameters. The corresponding maximum likelihood estimators are consistent, asymptotically normal and asymptotically efficient. I develop simple and efficient numerical algorithms, and use simulation studies to demonstrate that the likelihood-based procedures perform well in practical settings. I provide applications to two major genetic studies, and discuss areas in need of further development.

March 1, 2006

Bradley P. Carlin Ph.D.
Mayo Professor in Public Health, Division of Biostatistics
University of Minnesota, Minneapolis, MN

Using R and BRugs in Bayesian Clinical Trial Design and Analysis

Thanks in large part to the rapid development of Markov chain Monte Carlo (MCMC) methods and software for their implementation, Bayesian methods have become ubiquitous in modern biostatistical analysis. In submissions to the U. S. FDA Center for Devices and Radiological Health, where data on new devices are often scanty but researchers typically have access to large historical databases, Bayesian methods have been in use for over a decade. Statisticians and regulators on the drug side of the FDA are also now coming to appreciate the value of these methods, especially their ability to combine information from separate but related sources, reduce sample size, and directly measure the effects of interest while protecting the overall error rates.

I review the implementation of a variety of Bayesian clinical trial design and analysis methods in R and BRugs (the version of the OpenBUGS package callable from within R). In particular, I illustrate how a Bayesian might think about "power" when designing a trial, and how a Bayesian procedure may be calibrated to guarantee good, long-run frequentist performance (i.e., low type I and II error rates), a subject of keen interest to the FDA. The presentation should be accessible to a broad audience, and should generate discussion regarding areas requiring further development before Bayesian clinical trial design and analysis can be realistically considered for routine adoption by practitioners.

February 22, 2006

Xuelin Huang, Ph.D.
Assistant Professor, Department of Biostatistics & Applied Mathematics
The University of Texas M. D. Anderson Cancer Center, Houston TX

A Parallel Phase I/II Clinical Trial Design for Combination Therapies

This work represents a joint project undertaken with Biswas Swati, Yashuri Oki, Jean-Pierre Issa and Donald Berry.

The use of multiple drugs in a single clinical trial or as a therapeutic strategy has become common, particularly in the treatment of cancer. Interactions between drugs may impart specific benefits to the patient that are not available when the drugs are used individually. Because traditional trials are designed to evaluate one agent at a time, the evaluation of therapies in combination requires specialized trial designs. In place of the traditional, separate phase I and II trials, we propose using a parallel phase I/II clinical trial design to evaluate simultaneously the safety and efficacy of combination therapies. The proposed design applies Bayesian methods, uses all data accumulated from the beginning of the trial to update the prior distributions for the toxicity and efficacy parameters, and uses a new method to determine assignment probabilities. After an initial period of dose escalation, patients are randomly assigned to admissible dose levels. Combination doses with lower efficacy or intolerable toxicity are eliminated from the trial. The trial is stopped if the posterior probability for safety, efficacy or futility crosses a prespecified boundary. For illustration, we apply the design to a combination chemotherapy trial for leukemia. We use simulation studies to assess the operating characteristics of the parallel phase I/II trial design, and compare it to a conventional design for a standard phase I and phase II trial. The simulations show that the proposed design saves sample size, has better power and efficiently assigns more patients to doses with higher efficacy levels.

February 8, 2006

Kaye E. Basford, Ph.D.
Professor of Biometry, Head, School of Land and Food Sciences
The University of Queensland, Brisbane, Queensland, Australia

Analysis of Allergens: From Genomics to Medicine

This work was conducted jointly with Dr. Vladimir Brusic.

What represents high quality food today? Broadly speaking, it is something that improves our life through what we eat, touch and breathe. Agricultural and food sciences originally focused on yield (through the green revolution), then on nutritional value. Now, they focus on safety and environmental impact. There is, thus, an increasing overlap between food production and health and medical considerations, such as those involving allergens. In addition to traditional clinical approaches to the analysis of allergens, recent developments in genomics, proteomics and bioinformatics have allowed large-scale studies of the sources of allergens. For this work, we focused on the bioinformatics of specialist databases, and the computational analyses of allergenicity and allergic cross-reactivity.

February 6, 2006

Jeremy M. G. Taylor, Ph.D.
Professor of Biostatistics and Radiation Oncology
University of Michigan, Ann Arbor, MI

Survival Analysis Using Auxiliary Variables via Multiple Imputation, with Application to AIDS Clinical Trial Data

This work represents a joint undertaking with Dr. Paul Hsu at the University of Arizona.

We developed an approach, based on multiple imputations, that estimates the marginal survival distribution in survival analysis using auxiliary variables to recover information for censored observations. To conduct the imputation, we used two working survival models to define a nearest neighbor imputing risk set. One model is for the event times, and the other is for the censoring times. Based on the imputing risk set, we considered two nonparametric multiple imputation methods: risk set imputation and Kaplan-Meier imputation. For both methods, we imputed a future event or censoring time for each censored observation. With a categorical auxiliary variable, we showed that a large number of imputes of estimates from the Kaplan -Meier imputation method correspond to the weighted Kaplan-Meier estimator. We also showed that the Kaplan-Meier imputation method is robust to misspecification of either one of the two working models. In a simulation study with time-independent and time-dependent auxiliary variables, we compared the multiple imputation appraoches with an inverse probabilty of censoring, weighted method. We showed that all approaches can reduce bias due to dependent censoring and improve efficiency. We applied the approaches to AIDS clinical trial data, comparing ZDV and placebo, in which the CD4 count was the time-dependent auxiliary variable.

February 1, 2006

Kay E. Basford, Ph.D.
Professor of Biometry, Head, School of Land and Food Sciences
The University of Queensland, Brisbane, Queensland, Australia

Clustering Incomplete, Mixed, Three-Mode, Three-Way Data

This work was conducted jointly with Dr. Lyn Hunt.

We demonstrated the finite mixture approach to clustering three-mode, three-way data where some of the attributes are continuous (and assumed to have a normal distribution), and some are categorical (and assumed to have a multinominal distribution). We extended this approach to accommodate a situation with randomly missing data. We illustrated this by clustering the genotypes in a three-way data set where various attributes were measured on genotypes grown in several environments, and where there was a moderate amount of missing data.

January 27, 2006

Xiao-Hua Andrew Zhou, Ph.D.
Professor, Department of Biostatistics, School of Public Health and Community Medicine, University of Washington, Seattle WA

Double Semiparametric ROC Regression Models

This work was conducted jointly with Dr. Huazen Lin.

ROC regression methodology offers an opportunity to investigate how factors such as characteristics of study subjects or test environment affect test accuracy. Direct modeling of the ROC curves has received recent attention. Parametric and semiparametric methods have been developed for directly modeling ROC curves: (1) parametric methods specifying the link and baseline functions; and (2) semiparametric methods specifying the link function, but not the baseline function. Of course, the misspecification of either the link or the baseline function can lead to substantial bias for ROC curve estimates. We extended the existing direct ROC regression models to allow arbitrary link and baseline functions. I show that the proposed estimators for the regression parameters and ROC curves are asymptotically normal and consistent with the parametric convergent rate, and illustrate our approach with a real data set from a hearing test.

January 11, 2006

Simon Sheather, Ph.D.
Professor & Head, Department of Statistics
Texas A&M University, College Station, TX

Sliced Mean Variance Covariance Inverse Regression (SMVCIR)

A number of authors have established the direct connection between linear discriminant analysis and sliced inverse regression (SIR). These procedures rely on the assumption of homogeneity of the variance-covariance matrices. As such, these procedures only focus on location differences between groups. Cook and Yin (2001, Australian and New Zealand Journal of Statistics)studied the properties of sliced average variance estimation (SAVE), which expands the SIR space by also including contrasts between the groups' estimated variance-covariance matrices. In a discussion of the work of Cook and Yin, Hastie and Zhu showed that the SAVE directions are not always ordered in a natural way. Further Zhu and Hastie showed (2003, Journal of Computational and Graphical Statistics) that SAVE can overemphasize second order differences between the groups when a location difference is dominant. I describe a new procedure, SMVCIR, that combines the information from SIR and SAVE and produces a set of ordered directions that capture the differences in location, variance, and covariance between the groups. I use the new procedure to analyze two data sets involving wine.

2005 Seminars

December 7, 2005

Merlise A. Clyde, Ph.D.
Associate Professor, Institute of Statistics and Decision Sciences
Duke University, Durham, North Carolina

Bayesian Nonparametric Models for Proteomic Peak Identification, Quantification, and Classification 

I present model-based inference for proteomic peak identification, quantification, and classification from mass spectroscopy, focusing on nonparametric Bayesian models. Using experimental data generated from MALDI-TOF (matrix-assisted laser desorption ionization - time of flight) mass spectroscopy, I model observed intensities in spectra with a hierarchical nonparametric model of expected intensity as a function of the time of flight. In particular, I express the unknown intensity function as a sum of kernel functions, a natural choice of basis functions for modeling spectral peaks. I discuss how to place priors on the unknown functions using Levy random field priors, and describe posterior inference via a reversible jump Markov chain Monte Carlo algorithm.

November 16, 2005

Dr. Julie Goldberg
Assistant Professor, Clinical Decision Making
University of Illinois at Chicago

The Role of Patients' Experiences in Prostate Cancer Decision Making

One of the major unanswered problems in medical decision making is how to help patients newly diagnosed with disease make high-stakes decisions about treatment choices and potential health outcomes that they have never experienced. The aim of the study is to understand how patients' experiences affect their abilities to anticipate uncertain future decisions and consequences. My work focuses on prostate cancer decision making. There are multiple treatment options for prostate cancer, all of which have great impact on the quality of life, but differ only minimally in their ability to extend life. Research suggests, therefore, that the decision should be based on a patient's preferences and values. My work attempts to understand why our current theories fail to predict patients' actual choices.

I extended decision analytic theory, a quantitative tool for predicting choice, in my first study. This theory assumes that patients' choices are independent of context. I hypothesized and found that patients' choices are very sensitive to context, in particular, their experiences of a disease state. Importantly, the pattern of findings suggests that the impact of patients' experiences was not pervasive; it was selective, informing their decision making only when their experiences were relevant to the task.

In a second study, I tried to examine what was in the "black box" of experience that mattered in men's decision making regarding prostate cancer. Given that research suggests that these decisions are preference-based, medical organizations have recommended a shared decision-making approach in which men reach their own, individualized decisions. Until now, knowledge about the factors that comprise an "individualized" decision has been lacking. I employed a theoretically grounded, qualitative approach, a method uniquely designed to capture this kind of complexity, to address this question. The study supported the findings of my previous work: it was patients' life experiences, rather than medical information, that systematically drove their decisions to screen for and treat prostate cancer. Moreover, many of these factors were outside the purview of a health professional's conceptualization of the disease. These studies offer insight into the cognitive and emotional processes that may underlie this decision and which could serve to inform the conversation between doctor and patient. 

November 9, 2005

Cigdem Gunduz Demir
Ph.D. Candidate, Research Assistant, Department of Computer Science
Rensselaer Polytechnic Institute, Troy, N.Y.

The Cell Graphs of Cancer

In the current practice of medicine, pathologists traditionally diagnose cancer from tissue samples. Examining such biopsy samples under a microscope, a pathologist makes assessments based on a visual interpretation of cell morphology and tissue distribution. This, however, leads to a certain level of subjectivity, possibly resulting in some inter-observer variability. To circumvent this problem, it is important to develop computational diagnostic tools that operate on quantitative measures. Such automated diagnostic tools facilitate fast, objective, mathematical judgment that is complementary to that of a pathologist, while reducing the subjectivity. For the purpose of automated cancer diagnoses, we introduce a new tissue representation model based on the generation of cell graphs from tissue samples. In this approach, we employ machine learning algorithms to automatically distinguish cancerous tissues from their counterparts by making use of the distinctive topological properties of "cluster" formation in cancerous cells.

We present the methodology of cell-graph generation along with a theoretical framework and experimental demonstrations. We introduce the definitions of different sets of distinctive cell-graph features (such as the clustering coefficient, giant connected component ratio, spectral radius, and number of connected components). We also report on the experimental demonstrations obtained on clinical data for the diagnosis of brain cancer (glioma). Despite the complex dynamic nature of glioma formation, we have successfully demonstrated that the self-organizing clusters of cancerous cells in the human brain exhibit distinctive local and global graph properties and, hence, that a machine-learning algorithm is able to differentiate cancerous tissue from non-cancerous tissue, for example, from healthy tissue or from a benign inflammatory process, with high accuracy.

November 2, 2005

Loki Natarajan, Ph.D.
Assistant Adjunct Professor, Family and Preventive Medicine
Cancer Prevention & Control Program, University of California at San Diego

Methods for Estimating Mutation Rates in Cell Populations

Spontaneous or randomly occurring mutations play a key role in cancer progression. Estimation of the mutation rate of cancer cells can provide useful information about the disease. We describe a discrete time stochastic model for a mutational birth process. We assume that mutations occur concurrently with mitosis so that when a non-mutant parent cell splits into two progeny, one of the daughter cells might carry a mutation. We propose an estimator for the mutation rate and investigate its statistical properties (bias and mean-squared error) via theory and simulations. We also explore the sensitivity of the proposed estimator to deviations from modeling assumptions. We describe the existing methods and extensions of the proposed methods, and discuss an application to human colorectal cancer cell lines.

October 26, 2005

Franziska L. Michor, Ph.D.
Junior Fellow, Department for Organismic and Evolutionary Biology
Harvard University

Dynamics of Chronic Myeloid Leukemia 

The clinical success of the ABL tyrosine kinase inhibitor imatinib in chronic myeloid leukemia (CML) serves as a model for molecularly targeted therapy. However, at least two critical questions remain: (1) Can imatinib eradicate leukemic stem cells? (2) What are the dynamics of relapse due to imatinib resistance which is caused by mutations in the ABL kinase domain? Understanding how imatinib exerts its therapeutic effect in CML, and measuring its disease burden by quantitative PCR provides an opportunity to develop a mathematical approach to answering these questions. We find that a four-compartment model based on the known biology of hematopoietic differentiation can explain the kinetics of the molecular response to imatinib in a data set of 169 patients. Successful therapy leads to a biphasic exponential decline of leukemic cells. The first slope of 0.05 per day represents the turnover rate of differentiated leukemic cells, while the second slope of 0.008 per day represents the turnover rate of leukemic progenitors. The model suggests that imatinib is a potent inhibitor of the production of differentiated leukemic cells, but does not deplete leukemic stem cells. We calculate the probability of developing imatinib resistance mutations and estimate the time until detection of resistance. Our model provides the first quantitative insights into the in vivo kinetics of a human cancer.

October 19, 2005

Xuming He, Ph.D. 
Professor of Statistics
University of Illinois, Champaign-Urbana

Inference for Quantile Regression Models

Quantile regression models have a wide range of application. Nevertheless, regression models that focus on conditional means are often inadequate to reflect inhomogeneity or to capture some interesting information from part of the population. As the quantile regression approach gains popularity in econometrics, statistics and biostatistics, it is important that we develop reliable inference tools for this approach. I will review a number of existing methods for estimating standard errors and for constructing confidence intervals, and will explain why it has been difficult for software developers to choose a default method. I will then introduce the Markov chain marginal bootstrap (MCMB) algorithm, and assess its performance in terms of accuracy, speed, and reliability. This overview of the MCMB algorithm is not about Bayesian computation, but describes its appeal for handling high-dimensional problems. The current version of the MCMB algorithm for quantile regression is available as an R package or an SAS procedure.

October 12, 2005

Siva Sivaganesan, Ph.D.
Visiting Professor, Department of Biostatistics & Applied Mathematics
M. D. Anderson Cancer Center

On a Model-Based Clustering Approach

Several methods are available for clustering different types of data, e.g., for clustering gene expression data. We review the "infinite" mixture model using a Dirichlet process prior, and use an example to show that this approach has certain advantages over the finite mixture model in terms of sensitivity. We show through simulation that when replicate measurements are available, it is advantageous to include the full data set in the model, rather than to take the average of the measurements, as is commonly done.

Then we address the issue of context-specific clustering, where certain experimental conditions are known to group together. Such knowledge may be used to improve clustering. We extend the Dirichlet process model to context-specific clustering, and use simulated data to show that this modeling method can improve clustering.

October 5, 2005

Steve Horvath, Ph.D., Sc.D.
Assistant Professor of Biostatistics and Human Genetics
UCLA

Using Gene Co-Expression Network Methods in Cancer Genetics, with Application to Detecting Therapeutic Targets in Brain Cancer

Microarray gene expression profiles are proving useful for classifying subsets of patients or tumors, and for predicting survival and response to therapy. However, genes identified as predictive in one microarray study are frequently not validated in other studies. Identifying individual genes from complex molecular signatures for diagnosis, prognosis and therapeutic intervention remains a significant challenge.

We propose to use gene co-expression networks to identify therapeutic targets. Gene co-expression network construction is conceptually straightforward: nodes represent genes and are connected if the the corresponding genes are significantly co-expressed across appropriately chosen tissue samples. In reality, it is tricky to define the connections between the nodes in such networks. An important question is whether it is biologically meaningful to encode gene co-expression using binary information (connected = 1; unconnected = 0). We describe a general framework for "soft thresholding" that assigns a connection weight to each gene pair. This leads us to define the notion of a weighted gene co-expression network. For soft thresholding, we propose several adjacency functions that convert the co-expression measure to a connection weight. For determining the parameters of the adjacency function, we propose a biologically motivated criterion, to which we refer as the scale-free topology criterion.

We generalize the following important network concepts to the case of weighted networks. First, we introduce several node connectivity measures and provide empirical evidence that they can be important for predicting the biological significance of a gene. Second, we provide theoretical and empirical evidence that the "weighted" topological overlap measure (used to define gene modules) leads to more cohesive modules than its "unweighted" counterpart. Third, we generalize the clustering coefficient to weighted networks. Unlike the unweighted clustering coefficient, the weighted clustering coefficient is not inversely related to the connectivity. We provide a model that shows how an inverse relationship between the clustering coefficient and connectivity arises from hard thresholding.

Pioneering the use of a weighted gene co-expression network, we analyzed two independent glioblastoma gene expression microarray data sets (n = 55 and n = 65). We detected five highly reproducible gene co-expression modulues, including one enriched for mitosis/cell cycle genes. The high-degree nodes in this module, the hub genes, were strikingly and reproducibly associated with patient survival (p < 1.0 x 10-22). This network-based aporoach strikingly improved the validation success rate of individual genes predictive of survival (64%; 95% CI = 45-83%) in an independent data set relative to approaches based on p-value alone (14%; 95% CI = 13-16%) or an enrichment for genes associated with mitosis by gene ontology without reference to network concepts (32%; 95% CI = 25-45%). 

These results demonstrate that gene co-expression networks contain valuable information for identifying prognostically and therapeutically important individual genes from complex molecular signatures. Detection of these individual genes may be a critical step towards integrating gene expression data into clinical practice.

September 14, 2005

Craig C. Earle, M.D.
Associate Professor of Medicine
Dana-Farber Cancer Institute Center for Outcomes & Policy Research, Harvard Medical School

Trends in the Aggressiveness of Cancer Care: The Importance of Steady State Conditions in Time Trend Analysis

In the last decade, we have seen an expansion in therapeutic possibilities for patients with advanced cancer. However, with more options comes the possibility that treatment may be too aggressive or continue for too long for some patients. Recently, we examined trends in the aggressiveness of cancer care near the end of life. Our findings revealed a number of pitfalls in time trend analysis due to secular trends and a lack of steady state conditions among the patient population. We use this example as well as others in the recent literature to discuss analytic approaches to time trends in cancer treatment and outcome. 


© 2009 The University of Texas M. D. Anderson Cancer Center