Invited Speakers and Guest Lecturers

Invited Speakers and Guest Lecturers 2022-2023

Xuming he, H.C. Carver Professor of Statistics from the University of Michigan

Noon-1 p.m. Tuesday, Feb. 21, in WH-100E

"How Good is Your Best Selected Subgroup" 

This talk is hosted by the Department of Mathematics and Statistics and is co-sponsored by the Data Science Transdisciplinary Area of Excellence.

Subgroup analysis is often performed by “slicing and dicing” the data to find one or more subgroups that show distinctive characteristics. However, evaluation of the best selected subgroup tends to be overly optimistic. In this presentation, we use the subgroup evaluation in clinical trials as an example to discuss the risk of selection bias in subgroup evaluations. In particular, we propose a novel bootstrap-based inference procedure for the best selected subgroup effect. The proposed inference procedure is model-free, easy to compute and asymptotically sharp. We show, through both theory and empirical investigations, that how a subgroup is selected post hoc should play an important role in any statistical analysis. Much of the talk is based on joint work with Xinzhou Guo.


Richard W. DiSalvo, associate research professor in the School of Public and International Affairs at Princeton University

Noon-1 p.m. Friday, Dec. 9, in AA-340

"Separate, but Better? Measuring School Spending Progressivity and its Association with School Segregation"

This talk is organized by the Data Science Transdisciplinary Area of Excellence.


Recent public discussions and legal decisions suggest that school segregation will remain persistent in the United States, but increased transparency may help monitor spending across schools. These circumstances revive an old question: is it possible to achieve an educational system that is separate but equal — or better — in terms of spending? This question motivates further understanding the measurement of spending progressivity and its association with segregation. Focusing on economic disadvantage, we compare two commonly used measures of spending progressivity: exposure-based and slope-based. Using nationwide U.S. school-level data on public education spending from the National Education Resource Database on Schools (NERD$, ), and school enrollments and rates of free/reduced-price lunch from the Longitudinal School Demographic Dataset (LSDD, ), we empirically examine school spending progressivity and its properties for the 2018-19 school year. Consistent with our theory, the exposure-based measure is the slope-based measure shrunk inversely by economic school segregation. This property makes more segregated school districts look more progressive on the exposure-based measure, representing a seemingly “separate but better” relationship. However, we show that this provocative pattern may be reversed by relatively modest poor-versus-nonpoor differences in unobserved parental contributions. We discuss implications for the measurement of progressivity, and for theory on public educational investments broadly.


Brandon Stewart, assistant Professor in the Department of Sociology and also affiliated with the Department of Politics and the Office of Population Research at Princeton University

1:15-2:15 p.m. Thursday, Nov. 17, in WH-100E

"How to Make Causal Inferences Using Texts"

This talk is organized by the Statistics Seminar of the Department of Mathematics and Statistics at Ƶ and the Data Science Transdisciplinary Area of Excellence.

Texts are increasingly used to make causal inferences: either with the document serving as the outcome, treatment or confounder. I overview two recent papers on causal inference with text-based latent representations. We demonstrate that all text-based causal inferences depend upon a latent representation of the text, and we provide a framework to learn the latent representation.  Estimating this latent representation, however, creates new risks: we may unintentionally create a dependency across observations or create opportunities to fish for large effects. To address these risks, we introduce a train/test split framework and apply it to estimate causal effects from an experiment on immigration attitudes and a study on bureaucratic responsiveness. I then describe a framework for text-based confounding adjustment using text matching. (Based on joint work with Egami, Fong, Grimmer, Nielsen and Roberts)


Invited Speakers and Guest Lecturers 2021-2022

Hongtu Zhu, professor of biostatistics, computer science and genetics at the University of North Carlina at Chapel Hill

Noon-1 p.m. Tuesday, May 3, via Zoom

"Challenges in Biobank-scale: Imaging Genetics and Beyond"

This talk is organized by the Data Science Seminar of the Department of Mathematics and Statistics at Ƶ and is endorsed by the Data Science Transdisciplinary Area of Excellence and the Center for Imaging, Acoustics and Perception Science.

Recently the UK Biobank study has conducted brain magnetic resonance imaging
(MRI) scans of over 40,000 participants. In addition, publicly available imaging genetic
datasets also emerge from several other independent studies. We collected
massive individual-level MRI data from different data resources, harmonized image
processing procedures, and conducted the largest genetic studies so far for various
neuroimaging traits from different structural and functional modalities. In this
talk, we showcase novel clinical findings from our analyses, such as the shared genetic
influences among brain structures, functions, and the genetic overlaps with a
wide spectrum of clinical outcomes. We also discuss the challenges we have
faced when analyzing these biobank-scale datasets and highlight opportunities for
future research. This presentation is based on a series of works with members of the
BIG-S2 lab of the University of North Carolina at Chapel Hill.


Clio Andris, assistant professor in the School of City and Regional Planning and the School of Interactive Computing at Georgia Institute of Technology

2:30 p.m. Thursday, April 7, via Zoom

"Measuring McCities: Landscapes of chain and independent restaurants in the United
States"

This talk is co-sponsored by the Department of Geography, the Smart Communities for Social Good Working Group and the Data Sciences Transdisciplinary Area of Excellence.


We explored which cities in the U.S. maintain an independent food culture. We used a dataset of nearly 800,000 independent and chain restaurants for the Continental United States. We found that car-dependency, low walkability, high percentage voters for Donald Trump, concentrations of college-age students, and nearness to highways were associated with high rates of chainness. These high chainness McCities are prevalent in the Midwestern and the Southeastern United States.


Bo Zhao, associate professor in the Department of Geography at the University of Washington
2:30-3:30 p.m. Thursday, March 10, via Zoom

"Humanistic GIS: Towards a Research Agenda"

This talk is co-sponsored by the Department of Geography, the Smart Communities for Social Good Working Group and the Data Sciences Transdisciplinary Area of Excellence.



Zhao will introduce a newly proposed research perspective Humanistic GIS that can better encompass the expanded category of GIS technology as well as the accompanying opportunities and challenges. Deeply rooted in humanistic geography, humanistic GIS offers a systematic framework that situates GIS in its mediation of human experience and further categorizes GIS through its embodiment, hermeneutic, autonomous, and background relations with the involved human and place. Humanistic GIS represents a shift from those earlier waves of making or doing GIS, when GIS was more primarily a technology of research and representation -- not as immediately mediating everyday emplaced human life as it is today. 


Annie Qu, Chancellor's Professor of Statistics at the University of California at Irvine
noon-1 p.m. Tuesday, Dec. 7

"Correlation Tensor Decomposition and its Application in Spatial Imaging Data"

This talk is part of the Data Science Seminar in the Department of Mathematical Sciences and is co-endorsed by the Data Science Transdisciplinary Area of Excellence and the Center for Imaging, Acoustics, and Perception Science at Ƶ.

Meeting passcode, if needed: 13902

Ƶ the speaker: Before joining UC Irvine, Qu was Data Science Founder Professor of Statistics and the director of the Illinois Statistics Office at the University of Illinois at Urbana-Champaign. She was awarded as the Brad and Karen Smith Professorial Scholar by the College of LAS at UIUC, a recipient of the NSF Career award in 2004-2009, and is a Fellow of the Institute of Mathematical Statistics and a Fellow of the American Statistical Association. She obtained her PhD from the Department of Statistics at the Pennsylvania State University. Qu's research focuses on solving fundamental issues regarding structured and unstructured large-scale data, and developing cutting-edge statistical methods and theory in machine learning and algorithms on personalized medicine, text mining, recommender systems, medical imaging data and network data analyses for complex heterogeneous data. The newly developed methods are able to extract essential and relevant information from large volume high-dimensional data. Her research has impacts in many fields such as biomedical studies, genomic research, public health research, and social and political sciences.


Cen Wu, associate professor of statistics at Kansas State University and a faculty scientist at its Johnson Cancer Research Center
noon-1 p.m. Tuesday, Nov. 16

This talk is organized by the Mathematical Sciences Department Data Science Seminar. 

"Robust Bayesian variable selection for gene-environment interactions"

Meeting ID: 986 2507 3234, Passcode 13902

Gene-environment (G×E) interactions have important implications to elucidate the etiology of complex diseases beyond the main genetic and environmental effects. Outliers and data contamination in disease phenotypes of G×E studies have been commonly encountered, leading to the development of a broad spectrum of robust regularization methods. Nevertheless, within the Bayesian framework, the issue has not been taken care of in existing studies. In this talk, I will present a robust Bayesian variable selection method for G×E interaction studies. The proposed Bayesian method can effectively accommodate heavy-tailed errors and outliers in the response variable while conducting variable selection by accounting for structural sparsity. In particular, for the robust sparse group selection, the spike-and-slab priors have been imposed on both individual and group levels to identify important main and interaction effects robustly. An efficient Gibbs sampler has been developed to facilitate fast computation. Extensive simulation studies and analysis of both the diabetes data with SNP measurements from the Nurses' Health Study and TCGA melanoma data with gene expression measurements demonstrate the superior performance of the proposed method over multiple competing alternatives.

Ƶ the speaker:  Wu's current research has mainly been motivated by data contamination and heavy-tailed distributions that widely exist in disease phenotypes and multi-level omics measurements from cancers and other complex diseases. Tackling these problems in a high dimensional setting demands robust variable selection methods, within both the frequentist and Bayesian frameworks. Dr. Wu’s statistical methodological work includes Bayesian sparse learning, high/ultra-high dimensional robust variable selection, and integrative analysis of cancer genomics data from multiple platforms.


Sergei V. Kalinin, group leader for the Data NanoAnalytics Group in the Center for Nanophase Materials Sciences at Oak Ridge National Laboratory
11 a.m.-noon Monday, Nov. 1

This talk is co-sponsored by the Data Science TAE; the Center for Imaging, Acoutstics and Perception Science (CIAPS); and the Department of Physics.

"Machine Learning for Scanning Probe and Electron Microscopy: From Imaging to Atomic Fabrication"

Machine learning and artificial intelligence (ML/AI) are rapidly becoming an indispensable part of physics research, with domain applications ranging from theory and materials prediction to high-throughput data analysis. However, the constantly emerging question is how to match the correlative nature of classical ML with hypothesis-driven causal nature of physical sciences. In parallel, the recent successes in applying ML/AI methods for autonomous systems from robotics through self-driving cars to organic and inorganic synthesis are generating enthusiasm for the potential of these techniques to enable automated and autonomous experiment (AE) in imaging.

In this presentation, I will discuss recent progress in automated experiment in electron and scanning probe microscopy, ranging from feature to physics discovery via active learning. The applications of classical deep learning methods in streaming image analysis are strongly affected by the out of distribution drift effects, and the approaches to minimize though are discussed. We further present invariant variational autoencoders as a method to disentangle affine distortions and rotational degrees of freedom from other latent variables in imaging and spectral data. The analysis of the latent space of autoencoders further allows establishing physically relevant transformation mechanisms. Extension of encoder approach towards establishing structure-property relationships will be illustrated on the example of plasmonic structures. I will briefly discuss the transition from correlative ML to physics discovery, incorporating prior knowledge and yielding generative physical models of observed phenomena. Finally, I illustrate transition from post-experiment data analysis to active learning process. Here, the strategies based on simple Gaussian Processes often tend to produce sub-optimal results due to the lack of prior knowledge and very simplified (via learned kernel function) representation of spatial complexity of the system. Comparatively, deep kernel learning (DKL) methods allow to realize both the exploration of complex systems towards the discovery of structure-property relationship, and enable automated experiment targeting physics (rather than simple spatial feature) discovery. The latter is illustrated via experimental discovery of the edge plasmons in STEM/EELS and ferroelectric domain dynamics in PFM.

This research is supported by the by the U.S. Department of Energy, Basic Energy Sciences, Materials Sciences and Engineering Division and the Center for Nanophase Materials Sciences, which is sponsored at Oak Ridge National Laboratory by the Scientific User Facilities Division, BES DOE.

Sergei Kalinin

Ƶ the speaker: Kalinin is a corporate fellow and a group leader at the Center for Nanophase Materials Sciences at Oak Ridge National Laboratory. He received his MS degree from Moscow State University in 1998 and PhD from the University of Pennsylvania (with Dawn Bonnell) in 2002. His research presently focuses on the applications of big data and artificial intelligence methods in atomically resolved imaging by scanning transmission electron microscopy and scanning probes forapplications including physics discovery and atomic fabrication, as well as mesoscopic studies of electrochemical, ferroelectric, and transport phenomena via scanning probe microscopy.

He has co-authored >650 publications, with a total citation of >33,000 and an h-index of >94. He is a fellow of MRS, APS, IoP, IEEE, Foresight Institute, and AVS; a recipient of the Blavatnik Award for Physical Sciences (2018), RMS medal for Scanning Probe Microscopy (2015), Presidential Early Career Award for Scientists and Engineers (PECASE) (2009); Burton medal of Microscopy Society of America (2010); 4 R&D100 Awards (2008, 2010, 2016, and 2018); and a number of other distinctions


Giles Hooker, professor of statistics at the University of California, Berkeley
Noon-1 p.m. Tuesday, Oct. 26

This talk is organized by the Mathematical Sciences Department Data Science Seminar. 

"There is No Free Variable Importance: Traps in Interpreting Black Box Functions"

The field of machine learning – loosely defined as nonparametric statistical modeling –
has become enormously successful over the past fifty years, partly by forgoing the parametric models familiar to statisticians. A consequence of this philosophy has been that these methods result in algebraically complex models that provide little humanaccessible insight into the workings of the model, or what it might say about the underlying processes generating the data. As these methods have been taken up in high-stakes decision making, demands to “x-ray the black box” have become more prevalent, resulting in a wide variety of approaches to understand what signal the model is capturing or to provide explanations of individual predictions. Unfortunately, many of these methods produce results that can lead to mistaken conclusions about the model, or the underlying processes, or both. This talk reviews two sources of error: distorting the covariate distribution beyond the range where the model performs well and estimating structured surrogates using insufficient data. We show that many popular interpretation/explanation methods suffer from these, potentially resulting in mistaken conclusions or advice, and review the properties necessary to generate reliable explanation or interpretation.

Ƶ the speaker: Giles Hooker is a professor of statistics at the University of California, Berkeley. His work has focused on statistical methods using dynamical systems models, inference with machine learning models, functional data analysis, and robust statistics. He is the author of Dynamic Data Analysis: Modeling Data with Differential ܲپDzԲ” and “Functional Data Analysis in R and MATLAB. Much of his work has been inspired by collaborations particularly in ecology, human movement and citizen science data.


Hoada Fu, research fellow and an enterprise lead for machine learning, artifical intelligence and digital connected care for Eli Lilly and Company
Noon-1 p.m. Tuesday, Sept. 21

This talk is organized by the Mathematical Sciences Department Data Science Seminar. 

"Our Recent Development on Cost Constrained Machine Learning Models"

Suppose we can only pay $100 to diagnose a disease subtype for selecting the best treatments. We can either measure 10 cheap biomarkers or 2 expensive ones. How can we pick the optimal combinations to achieve the highest diagnostic accuracy? This is a nontrivial problem. In a special case where each variable costs the same, the total cost constraint will be reduced to an L0 penalty which is the best subset selection problem. Until recently, there is no good solution even for this special case. Traditional algorithms can only solve up to ~35 variables for best subset selections. Thanks to algorithm breakthroughs in the field of optimization research, we have modified and extended a recently developed algorithm to handle our cost constraint problems with thousands of variables. In this talk, we will introduce the background of this problem, methods development, and theoretical results. We will also show an impressive example of dynamic programming. It will tell a story on how algorithms can make a difference in computing. We hope that through this presentation, the audience can have a feel of modern statistics, which combines computer science, statistics, and algorithms.

Hoada Fu

Haoda Fu is a research fellow and an enterprise lead for machine learning, artificial intelligence and digital connected care from Eli Lilly and Company. Fu is a Fellow of ASA (American Statistical Association). He is also an adjunct professor of biostatistics at the University of North Carolina Chapel Hill and Indiana University School of Medicine. He received his PhD in statistics from the University of Wisconsin - Madison in 2007, and joined Lilly after that. Since he joined Lilly, he has been very active in statistics methodology research. He has more than 90 publications in such areas as Bayesian adaptive design, survival analysis, recurrent event modeling, personalized medicine, indirect and mixed treatment comparison, joint modeling, Bayesian decision making and rare events analysis. In recent years, his research has focused on machine learning and artificial intelligence. His research has been published in various top journals including JASA, JRSS, Biometrika, Biometrics, ACM, IEEE, JAMA, Annals of Internal Medicine, etc. He has been teaching topics of machine learning, AI in large industry conferences and FDA workshops. He was on the board of directors for statistics organizations and was the program chair and committee chair of ICSA, ENAR and the ASA Biopharm section.


Invited Speakers and Guest Lecturers 2020-2021

Catherine D'Ignazio, assistant professor in the Department of Urban Studies and Planning, and director of the Data + Feminism Lab at MIT; and Lauren Klein, associate professor in the departments of English and Quantitative Theory and Methods, and director of the Digital Humanities Lab at Emory University
2-4 p.m. Friday, March 12 (presentation and Q&A to end at 3:30 p.m., followed by an informal coffee hour)

"Data Feminism"

As data are increasingly mobilized in the service of governments and corporations, their unequal conditions of production, their asymmetrical methods of application, and their unequal effects on both individuals and groups have become increasingly difficult for data scientists--and others who rely on data in their work--to ignore. But it is precisely this power that makes it worth asking: "Data science by whom? Data science for whom? Data science with whose interests in mind? These are some of the questions that emerge from what we call data feminism, a way of thinking about data science and its communication that is informed by the past several decades of intersectional feminist activism and critical thought. Illustrating data feminism in action, this talk will show how challenges to the male/female binary can help to challenge other hierarchical (and empirically wrong) classification systems; it will explain how an understanding of emotion can expand our ideas about effective data visualization; how the concept of invisible labor can expose the significant human efforts required by our automated systems; and why the data never, ever “speak for themselves.” The goal of this talk, as with the project of data feminism, is to model how scholarship can be transformed into action: how feminist thinking can be operationalized in order to imagine more ethical and equitable data practices.

Catherine D'ignazio
Catherine D’Ignazio is a scholar, artist/designer and hacker mama who focuses on feminist technology, data literacy and civic engagement. She has run reproductive justice hackathons, designed global news recommendation systems, created talking and tweeting water quality sculptures and led walking data visualizations to envision the future of sea level rise. With Rahul Bhargava, she built the platform Databasic.io, a suite of tools and activities to introduce newcomers to data science. Her 2020 book from MIT Press, Data Feminism, co-authored with Lauren Klein, charts a course for more ethical and empowering data science practices. Her research at the intersection of technology, design & social justice has been published in the Journal of Peer Production, the Journal of Ƶ Informatics, and the proceedings of Human Factors in Computing Systems (ACM SIGCHI). Her art and design projects have won awards from the Tanne Foundation, Turbulence.org and the Knight Foundation and exhibited at the Venice Biennial and the ICA Boston. D’Ignazio is an assistant professor of urban science and planning in the Department of Urban Studies and Planning at MIT. She is also Director of the Data + Feminism Lab which uses data and computational methods to work towards gender and racial equity, particularly in relation to space and place.

Lauren Klein
Lauren Klein is an associate professor in the departments of English and Quantitative Theory & Methods at Emory University, where she also directs the Digital Humanities Lab. She works at the intersection of data science, digital humanities, and early American literature, with a research focus on issues of race and gender. She has designed platforms for exploring the contents of historical newspapers, recreated forgotten visualization schemes with fabric and addressable LEDs, and, with her students, cooked meals from early American recipes — and then visualized the results. In 2017, she was named one of the “rising stars in digital humanities” by Inside Higher Ed. She is the author of An Archive of Taste: Race and Eating in the Early United States (University of Minnesota Press, 2020) and, with Catherine D’Ignazio, Data Feminism (MIT Press, 2020). With Matthew K. Gold, she edits Debates in the Digital Humanities, a hybrid print-digital publication stream that explores debates in the field as they emerge. Her current project, "Data by Design: An Interactive History of Data Visualization, 1786-1900," was recently funded by an NEH-Mellon Fellowship for Digital Publication.


Toby Burrows, senior research fellow at the Oxford e-Research Centre, University of Oxford, and the School of Humanities, University of Western Australia

This talk is in cooperation with the Center for Medieval and Renaissance Studies (CEMERS) at Ƶ.

3 p.m. Wednesday, March 10

Zoom meeting ID is 970 8903 8153

"Mapping Manuscript Migrations: Tracking the Travels of 220,000 Medieval and Renaissance Manuscripts"

Hundreds of thousands of medieval and Renaissance manuscripts still survive today, and detailed information about their history and provenance is scattered across a large number of databases and Web sites. Combining this kind of data was the focus of the Mapping Manuscript Migrations (MMM) project, which was funded by the Digging into Data program of the Trans-Atlantic Partnership between 2017 and 2020, and brought together manuscript researchers, curators, librarians, and computing specialists from institutions in Oxford, Philadelphia, Paris, and Helsinki. The project combined three large collections of data relating to the history and provenance of more than 220,000 medieval and Renaissance manuscripts.

This talk will discuss the work done by the MMM project, especially its deployment of Semantic Web and Linked Open Data technologies in order to transform, aggregate, and harmonize such a large body of data. It will also examine the various ways in which the data have been published: through a public Web portal, as a searchable Linked Open Data store, and as a downloadable dataset. It will demonstrate some of the ways in which the data can be used to answer research questions, including creating visualizations through the Web portal, and running SPARQL queries against the data store. We will also look at the way in which the project was organized, and how the contributions of specialists from such diverse fields were brought together. We will finish with some thoughts about how the lessons learned from the MMM project can be applied and developed in the future.


Hengchen Dai, assistant professor of management and organizations and behavioral decision making at the UCLA Anderson School of Management
2-3 p.m. Friday, Feb. 26, 2021

Join the Zoom meeting at

Meeting ID: 926 2283 7791
Passcode: 5-digit zip code for Ƶ

"The Value of Customer-Related Information on Service Platforms: Evidence From a Large Field Experiment"

As digitization enables service platforms to access users' information, important questions arise about how digital service platforms should disseminate information to improve service capacity and enjoyment. We examine a strategy that involves providing customer-related information to individual service providers at the beginning of a service encounter. We causally evaluate this strategy via a field experiment on a large live-streaming platform that connects viewers and individual broadcasters. When viewers entered shows, we provided viewer-related information to broadcasters who were randomly assigned to the treatment condition (but not to control broadcasters). Our analysis, involving a subsample of 49,998 broadcasters, demonstrates that relative to control broadcasters, treatment broadcasters expanded service capacity by 12.62% by increasing both show frequency (3.31%) and show length (7.10%), thus earning 10.44% more based on our conservative estimate. Moreover, our intervention increased service enjoyment (measured by viewer watch time) by 4.51%. Two surveys and additional analyses provide evidence for two mechanisms and rule out several alternative explanations. Our low-cost, information-based intervention has important implications for digital service platforms that have little control over service providers’ work schedules and service quality.

Hengchen Dai
Hengchen Dai is an assistant professor of management and organizations as well as a faculty member in the behavioral decision making area at Anderson School of Management at UCLA. She received her bachelor's degree from Peking University and her PhD from the University of Pennsylvania.

Her research primarily applies insights from behavioral economics and psychology to motivate people to behave in line with their long-term best interests and pursue their personal and professional goals. Her research also examines how different social forces, incentives, and technology affect users’ judgments and behaviors on online platforms.

She has published in leading academic journals such as Academy of Management Journal, Management Science, The Journal of Applied Psychology, Journal of Consumer Research, Journal of Marketing Research, and Psychological Science. Her research has been covered in major media outlets such as The Financial Times, The Wall Street Journal, Harvard Business Review, The New York Times, The Huffington Post, and The New Yorker.


Dennis Zhang, Associate Professor of Operations and Manufacturing Management at the Washington University in St. Louis Olin Business School
2-3 p.m. Friday, Oct. 30, 2020

"Customer Choice Models versus Machine-Learning: Finding Optimal Product Displays on Alibaba"

We compare the performance of two approaches for finding the optimal set of products to display to customers landing on Alibaba's two online marketplaces, Tmall and Taobao. Both approaches were placed online simultaneously and tested on real customers for one week. The first approach we test is Alibaba's current practice. This procedure embeds thousands of product and customer features within a sophisticated machine learning algorithm that is used to estimate the purchase probabilities of each product for the customer at hand. The products with the largest expected revenue (revenue * predicted purchase probability) are then made available for purchase. The downside of this approach is that it does not incorporate customer substitution patterns; the estimates of the purchase probabilities are independent of the set of products that eventually are displayed. Our second approach uses a featurized multinomial logit (MNL) model to predict purchase probabilities for each arriving customer. In this way we use less sophisticated machinery to estimate purchase probabilities, but we employ a model that was built to capture customer purchasing behavior and, more specifically, substitution patterns. We use historical sales data to fit the MNL model and then, for each arriving customer, we solve the cardinality-constrained assortment optimization problem under the MNL model online to find the optimal set of products to display. Our experiments show that despite the lower prediction power of our MNL-based approach, it generates significantly higher revenue per visit compared to the current machine learning algorithm with the same set of features. We also conduct various heterogeneous-treatment-effect analyses to demonstrate that the current MNL approach performs best for sellers whose customers generally only make a single purchase.

Dennis Zhang is a tenured associate professor of operations and manufacturing Management at the Olin Business School. His research focuses on data-driven operations in digital economy and platforms. He implements field experiments and uses observational data to improve operations.

Join the Zoom meeting at

Meeting ID: 926 2283 7791
Passcode: 5-digit zip code for Ƶ


Interdisciplinary Dean's Speaker Series in Data Science, 2019-2020

The Interdisciplinary Dean's Speaker Series in Data Science was in place for the 2019-2020 academic year and brought in the following speakers:

Amanda Larracuente, Assistant Professor and Stephen Biggar and Elisabeth Asaro Fellow in Data Science at the University of Rochester
Feb. 21, 2020

"Intragenomic Conflict in Drosophila: Satellite DNA and Drive"

Conflicts arise within genomes when genetic elements are selfish and fail to play by the rules. Meiotic drivers are selfish genetic elements found in a wide variety of taxa that cheat meiosis to bias their transmission to the next generation. One of the best-studied drive systems is an autosomal male driver found on the 2nd chromosome of Drosophila melanogaster called Segregation Distorter (SD). Males heterozygous for SD and sensitive wild type chromosomes transmit SD to >95% of their progeny, whereas female heterozygotes transmit SD fairly, to 50% of their progeny. SD is a sperm killer that targets sperm with large blocks of tandem satellite repeats (called Responder) for destruction through a chromatin condensation defect after meiosis. The molecular mechanism of drive is unknown. We combine genomic, cytological, and molecular methods to study the population dynamics of this system and how the driver and the target satellite DNA interact. These interactions provide insight into the regulation of satellite DNAs in spermatogenesis and the mechanisms of meiotic drive.


Arthur Spirling, Professor in the Department of Politics and Center for Data Science at New York University
Nov. 19, 2019

"Word Embeddings: What works, what doesn't, and how to tell the difference for applied research"

We consider the properties and performance of word embeddings techniques in the context of political science research. In particular, we explore key parameter choices — including context window length, embedding vector dimensions and the use of pre-trained vs locally fit variants — with respect to efficiency and quality of inferences possible with these models. Reassuringly, we show that results are generally robust to such choices for political corpora of various sizes and in various languages. Beyond reporting extensive technical findings, we provide a novel, crowdsourced “Turing test”-style method for examining the relative performance of any two models that produce substantive, text-based outputs. Encouragingly, we show that popular, easily available pre-trained embeddings perform at a level close to — or surpassing — both human coders and more complicated locally-fit models. For completeness, we provide best practice advice for cases where local fitting is required.

Spirling is professor of politics and data science at New York University. He is the deputy director and the director of graduate studies (MSDS) at the Center for Data Science, and chair of the executive committee of the Moore-Sloan Data Science Environment. He studies British political development and legislative politics more generally. His particular interests lie in the application of text-as-data/natural language processing, Bayesian statistics, machine learning, item response theory and generalized linear models. His substantive field is comparative politics, and he focuses primarily on the United Kingdom. Spirling received his PhD from the University of Rochester, Department of Political Science, in 2008. From 2008 to 2015, he was an assistant professor and then the John L. Loeb Associate Professor of the Social Sciences in the Department of Government at Harvard University. He is the faculty coordinator for the NYU Text-as-Data speaker series.

Andrew Gordon Wilson, Assistant Professor at the Courant Institute of Mathematical Sciences and Center for Data Science at New York University
Nov. 8, 2019

"How do we build models that learn and generalize?"

To answer scientific questions, and reason about data, we must build models and perform inference within those models. But how should we approach model construction and inference to make the most successful predictions? How do we represent uncertainty and prior knowledge? How flexible should our models be? Should we use a single model, or multiple different models? Should we follow a different procedure depending on how much data are available?

In this talk, he will present a philosophy for model construction, grounded in probability theory. He will exemplify this approach for scalable kernel learning and Gaussian processes, Bayesian deep learning, and understanding human learning.

Andrew Gordon Wilson is faculty in the Courant Institute and Center for Data Science at NYU. Before joining NYU, he was an assistant professor at Cornell University from 2016-2019. He was a research fellow in the Machine Learning Department at Carnegie Mellon University from 2014-2016, and completed his PhD at the University of Cambridge in 2014. His interests include probabilistic modelling, scientific computing, Gaussian processes, Bayesian statistics, and loss surfaces and generalization in deep learning. His webpage is .

Joseph Hogan, Carole and Lawrence Sirovich Professor of Public Health and Deputy Director of the Data Science Initiative at Brown University
Oct. 9, 2019

“Using Electronic Health Records Data for Predictive and Causal Inference Ƶ the HIV Care Cascade"

The HIV care cascade is a conceptual model describing essential steps in the continuum of HIV care. The cascade framework has been widely applied to define population-level metrics and milestones for monitoring and assessing strategies designed to identify new HIV cases, link individuals to care, initiate antiviral treatment and ultimately suppress viral load. Comprehensive modeling of the entire cascade is challenging because data on key stages of the cascade are sparse. Many approaches rely on simulations of assumed dynamical systems, frequently using data from disparate sources as inputs. However, growing availability of large-scale longitudinal cohorts of individuals in HIV care affords an opportunity to develop and fit coherent statistical models using single sources of data, and to use these models for both predictive and causal inferences. Using data from 90,000 individuals in HIV care in Kenya, we model progression through the cascade using a multistate transition model fitted using Bayesian Additive Regression Trees (BART), which allows considerable flexibility for the predictive component of the model. We show how to use the fitted model for predictive inference about important milestones and causal inference for comparing treatment policies. Connections to agent-based mathematical modeling are made. This is joint work with Yizhen Xu, Tao Liu, Rami Kantor and Ann Mwangi.

Hogan's research concerns development and application of statistical methods for large-scale observational data with emphasis on applications in HIV/AIDS. He is program director for the Moi-Brown Partnership for Biostatistics Training, which focuses on research capacity building at Moi University in Kenya.

Invited speakers from prior years

Ivo D. Dinov
Associate Director, Michigan Institute for Data Science
Director, Statistics Online Computational Resource
Professor, Computational Medicine and Bioinformatics, Human Behavior and Biological Sciences of the University of Michigan

DinovIvo Dinov is an expert in mathematical modeling, statistical analysis, computational processing and visualization of Big Data. He is involved in longitudinal morphometric studies of human development (e.g., Autism, Schizophrenia), maturation (e.g., depression, pain) and aging (e.g., Alzheimer's and Parkinson's diseases). Dinov is developing, validating and disseminating novel technology-enhanced pedagogical approaches for scientific education and active learning.

Dinov will give two talks.

Michigan Institute of Data Science – Organization, Education Challenges and Research Opportunities
April 24, 2018

I will present the Michigan Institute of Data Science (MIDAS), a trans-collegiate Institute at the University of Michigan. I will start by describing the multidisciplinary activities in data science at the University of Michigan. Then I will cover some of scientific pursuits (development of concepts, methods, and technology) for data collection, management, analysis, and interpretation as well as their innovative use to address important problems in science, engineering, business, and other areas. We will end with an open-ended discussion of educational challenges, research opportunities and infrastructure demands in data science.

Compressive Big Data Analytics
 April 24, 2018

I will start by showing examples of specific Big Data driving biomedical and health challenges. These will help us identify the common characteristics of Big Biomedical Data. We will also provide working definitions for "Data Science" and "Predictive Analytics". The core of the talk will be the mathematical foundation for analytically representing multisource, complex, incongruent, and multi-scale information as computable data objects. Specifically, I will describe the Compressive Big Data Analytics (CBDA) technique. Several applications of neurodegenerative disorders will be presented as case-studies.

Invited speaker

J. S. Marron
Amos Hawley Distinguished Professor of Statistics and Operations Research and Professor of Biostatistics
University of North Carolina at Chapel Hill

Steve MarronJ.S. Marron is widely recognized as a world research leader in the statistical disciplines of high- dimensional, functional and object-oriented data analysis, as well as data visualization. He has made broad major contributions ranging from the invention of innovative new statistical methods, through software development and on to statistical and mathematical theory. His research continues with a number of ongoing deep, interdisciplinary research collaborations with colleagues in computer science, genetics, medicine, mathematics and biology. A special strength is his strong record of mentoring graduate students, postdocs and junior faculty, in both statistics and related disciplinary fields.

Data Integration by JIVE: Joint and Individual Variation Explained
March 15, 2018

Abstract: A major challenge in the age of Big Data is the integration of disparate data types into a data analysis. That is tackled here in the context of data blocks measured on a common set of experimental subjects. This data structure motivates the simultaneous exploration of the joint and individual variation within each data block. This is done here in a way that scales well to large data sets (with blocks of wildly disparate size), using principal angle analysis, careful formulation of the underlying linear algebra, and differing outputs depending on the analytical goals. Ideas are illustrated using mortality, cancer and neuroimaging data sets.

OODA of Tree Structured Data Objects Using Persistent Homology
March 15, 2018

The field of Object Oriented Data Analysis has made a lot of progress on the statistical analysis of the variation in populations of complex objects. A particularly challenging example of this type is populations of tree-structured objects. Deep challenges arise, whose solutions involve a marriage of ideas from statistics, geometry, and numerical analysis, because the space of trees is strongly non-Euclidean in nature. Here these challenges are addressed using the approach of persistent homologies from topological data analysis. The benefits of this data object representation are illustrated using a real data set, where each data point is the tree of blood arteries in one person's brain. Persistent homologies gives much better results than those obtained in previous studies.

Object Oriented Data Analysis
March 16, 2018

Object Oriented Data Analysis is the statistical analysis of populations of complex objects. In the special case of Functional Data Analysis, these data objects are curves, where standard Euclidean approaches, such as principal components analysis, have been very successful. Challenges in modern medical image analysis motivate the statistical analysis of populations of more complex data objects which are elements of mildly non-Euclidean spaces, such as Lie Groups and Symmetric Spaces, or of strongly non-Euclidean spaces, such as spaces of tree-structured data objects. These new contexts for Object Oriented Data Analysis create several potentially large new interfaces between mathematics and statistics. The notion of Object Oriented Data Analysis also impacts data analysis, through providing a language for discussion of the many choices needed in many modern complex data analyses.

Invited speaker

Henry Kautz
Robin & Tim Wentworth Director of the Goergen Institute for Data Science and Professor
School of Computing, University of Rochester

Henry Kautz

Henry Kautz has served as department head at AT&T Bell Labs in Murray Hill, N.J., and as a full professor at the University of Washington, Seattle. In 2010 he was elected president of the Association for Advancement of Artificial Intelligence (AAAI), and in 2016 was elected chair of the AAAS Section on Information, Computing and Communication. His research in artificial intelligence, pervasive computing and healthcare applications has led him to be honored as a Fellow of the American Association for the Advancement of Science, Fellow of the Association for Kautz will visit Ƶ in Nov. 2-3, 2017, and give two talks. The first talk is a technical talk and the second one is an overview presentation targeting general audience.

Mining Social Media to Improve Public Health
Nov. 2

Abstract: People posting to social media on smartphones can be viewed as an organic sensor network for public health data, picking up information about the spread of disease, lifestyle factors that influence health, and pinpointing sources of disease. We show how a faint but actionable signal can be detected in vast amounts of social media data using statistical natural language and social network models. We present case studies of predicting influenza transmission and per-city rates, discovering patterns of alcohol consumption in different neighborhoods, and tracking down the sources of foodborne illness.

Data Science: Foundation for the Future of Science, Healthcare, Business, and Education
Nov. 3

Abstract: Data science is the synthesis of computer science and statistics that is driving fundamental changes in essentially all aspects of society. While the applications of data science are incredibly broad, the discipline has a surprisingly small and coherent intellectual core, based on principles of statistical prediction and information management. In 2013, the University of Rochester adopted data science as the unifying theme for its five-year strategic plan, and created the Goergen Institute for Data Science. The Institute has created undergraduate and graduate degree programs in data science, helped hire faculty engaged in interdisciplinary research, seeded new research efforts, and grown partnerships with industry. As the University works on its 2018 five-year strategic plan, data science remains a key priority.

Guest Lecture

Joshua White, Defense Consultant

Joshua White

Joshua White is VP of engineering for Rsignia Inc., engaged primarily in data science-related activities as they relate to terrorism studies, social media/social sciences, high-performance computing, and high-speed network and protocol analysis research for both the defense and intelligence communities. He is an adjunct professor at the State University of New York Polytechnic Institute Utica/Rome campus in the Network Computer Security Department, Utica College in the Social Data Science Program and MVCC in the Data Analytics Micro Credential Program since 2014.

Social Networks and Big Data Analysis Techniques
Dec. 1, 2017

Abstract: "The #bluewhalechallenge is presented as a test analysis case for various social network and big data analysis techniques. In this presentation, we present the techniques used for collecting, indexing, and analyzing billions of documents in an attempt to discover who was controlling the challenge and who is participating in the challenge. Various techniques are not suitable for true large-scale analysis given the time and resources required. We identify those techniques that require no more than a reasonable amount of time and resources to compute while still resulting in reasonable results."