Statistical Inference as Severe Testing Read online

Page 2


  Your interest may be in improving statistical pedagogy, which requires, to begin with, recognizing that no matter how sophisticated the technology has become, the nature and meaning of basic statistical concepts are more unsettled than ever. You could be teaching a methods course in psychology wishing to intersperse philosophy of science in a way that is both serious and connected to immediate issues of practice. You might be an introspective statistician, focused on applications, but wanting your arguments to be on surer philosophical grounds.

  Viewing statistical inference as severe testing will offer philosophers of science new avenues to employ statistical ideas to solve philosophical problems of induction, falsification, and demarcating science from pseudoscience. Philosophers of experiment should find insight into how statistical modeling bridges gaps between scientific theories and data. Scientists often question the relevance of philosophy of science to scientific practice. Through a series of excursions, tours, and exhibits, tools from the philosophy and history of statistics will be put directly to work to illuminate and solve problems of practice. I hope to galvanize philosophers of science and experimental philosophers to further engage with the burgeoning field of data science and reproducibility research.

  Fittingly, the deepest debates over statistical foundations revolve around very simple examples, and I stick to those. This allows getting to the nitty-gritty logical issues with minimal technical complexity. If there’ s disagreement even there, there’ s little hope with more complex problems. (I try to use the notation of discussants, leading to some variation.) The book would serve as a one-semester course, or as a companion to courses on research methodology, philosophy of science, or interdisciplinary research in science and society. Each tour gives a small set of central works from statistics or philosophy, but since the field is immense, I reserve many important references for further reading on the CUP-hosted webpage for this book, www.cambridge.org/mayo .

  Relation to Previous Work

  While (1) philosophy of science provides important resources to tackle foundational problems of statistical practice, at the same time, (2) the statistical method offers tools for solving philosophical problems of evidence and inference. My earlier work, such as Error and the Growth of Experimental Knowledge (1996), falls under the umbrella of (2), using statistical science for philosophy of science: to model scientific inference, solve problems about evidence (problem of induction), and evaluate methodological rules (does more weight accrue to a hypothesis if it is prespecified?). Error and Inference (2010), with its joint work and exchanges with philosophers and statisticians, aimed to bridge the two-way street of (1) and (2). This work, by contrast, falls under goal (1): tackling foundational problems of statistical practice. While doing so will constantly find us entwined with philosophical problems of inference, it is the arguments and debates currently engaging practitioners that take the lead for our journey.

  Join me, then, on a series of six excursions and 16 tours, during which we will visit three leading museums of statistical science and philosophy of science, and engage with a host of tribes marked by family quarrels, peace treaties, and shifting alliances. 1

  Acknowledgments

  I am deeply grateful to my colleague and frequent co-author, Aris Spanos. More than anyone else, he is to be credited for encouraging the connection between the error statistical philosophy of Mayo (1996 ) and statistical practice, and for developing an error statistical account of misspecification testing. I thank him for rescuing me time and again from being stalled by one or another obstacle. He has given me massive help with the technical aspects of this book, and with revisions to countless drafts of the entire manuscript.

  My ideas were importantly influenced by Sir David Cox. My debt is to his considerable work on statistical principles, his involvement in conferences in 2006 (at Virginia Tech) and 2010 (at the London School of Economics), and through writing joint papers (2006, 2 2010 3 ). I thank him for his steadfast confidence in this project, and for discussions leading to my identifying the unsoundness in arguments for the (strong) Likelihood Principle (Mayo 2014b ) – an important backdrop to the evidential interpretation of error probabilities that figures importantly within.

  I have several people to thank for valuable ideas on many of the topics in this book through extensive blog comments (errorstatistics.com), and/or discussions on portions of this work: John Byrd, Nancy Cartwright, Robert Cousins, Andrew Gelman, Gerd Gigerenzer, Richard Gill, Prakash Gorroochurn, Sander Greenland, Brian Haig, David Hand, Christian Hennig, Thomas Kepler, Daniël Lakens, Michael Lew, Oliver Maclaren, Steven McKinney, Richard Morey, Keith O’ Rourke, Caitlin Parker, Christian Roberts, Nathan Schachtman, Stephen Senn, Cosma Shalizi, Kent Staley, Niels Waller, Larry Wasserman, Corey Yanofsky, and Stanley Young.

  Older debts recalled are to discussions and correspondence with George Barnard, Ronald Giere, Erich Lehmann, Paul Meehl, and Karl Popper.

  Key ideas in this work grew out of exchanges with Peter Achinstein, Alan Chalmers, Clark Glymour, Larry Laudan, Alan Musgrave, and John Worrall, published in Mayo and Spanos (2010 ). I am grateful for the stimulating conversations on aspects of this research during seminars and conferences at the London School of Economics in 2008, 2010, 2012, and 2016, Centre for Philosophy of Natural and Social Sciences. I thank Virginia Tech and Doug Lind, chair of the philosophy department, for support and professional accommodations which were essential to this project. I obtained valuable feedback from graduate students of a 2014 seminar (with A. Spanos) on Statistical Inference and Modeling at Virginia Tech.

  I owe special thanks to Diana Gillooly and Cambridge University Press for supporting this project even when it existed only as a ten-page summary, and for her immense help throughout. I thank Esther Migueliz, Margaret Patterson, and Adam Kratoska for assistance in the production and preparation of this manuscript. For the figures in this work, I’ m very appreciative for all Marcos Jiménez’ work. I am grateful to Mickey Mayo for graphics for the online component. I thank Madeleine Avirov, Mary Cato, Michael Fay, Nicole Jinn, Caitlin Parker, and Ellen Woodall for help with the copy-editing. For insightful comments and a scrupulous review of this manuscript, copy-editing, library, and indexing work, I owe mammoth thanks to Jean Anne Miller. For other essential support, I am indebted to Melodie Givens and William Hendricks.

  I am grateful to my son, Isaac Chatfield, for technical assistance, proofing, and being the main person to cook real food. My deepest debt is to my husband, George W. Chatfield, for his magnificent support of me and the study of E.R.R.O.R.S. 4 I dedicate this book to him.

  1 A bit of travel trivia for those who not only read to the end of prefaces, but check its footnotes: two museums will be visited twice, one excursion will have no museums. With one exception, we engage current work through interaction with tribes, not museums. There’ s no extra cost for the 26 souvenirs: A– Z.

  2 First Symposium on Philosophy, History, and Methodology of Error (ERROR 06) , Virginia Tech.

  3 Statistical Science & Philosophy of Science: Where Do/Should They Meet in 2010 (and Beyond)? The London School of Economics and Political Science, Centre for the Philosophy of Natural and Social Science.

  4 Experimental Reasoning, Reliability, and the Objectivity, and Rationality of Science.

  Excursion 1

  How to Tell What’ s True about Statistical Inference

  Itinerary

  Tour I Beyond Probabilism and Performance 1.1 Severity Requirement: Bad Evidence, No Test (BENT)

  1.2 Probabilism, Performance, and Probativeness

  1.3 The Current State of Play in Statistical Foundations: A View From a Hot-Air Balloon

  Tour II Error Probing Tools versus Logics of Evidence 1.4 The Law of Likelihood and Error Statistics

  1.5 Trying and Trying Again: The Likelihood Principle

  Tour I

  Beyond Probabilism and Performance

  I’ m talking about a specific, extra type of integrity that is [beyond] not lying, but
bending over backwards to show how you’ re maybe wrong, that you ought to have when acting as a scientist.

  (Feynman 1974 /1985, p. 387)

  It is easy to lie with statistics . Or so the cliché goes. It is also very difficult to uncover these lies without statistical methods – at least of the right kind. Self-correcting statistical methods are needed, and, with minimal technical fanfare, that’ s what I aim to illuminate. Since Darrell Huff wrote How to Lie with Statistics in 1954 , ways of lying with statistics are so well worn as to have emerged in reverberating slogans:

  Association is not causation.

  Statistical significance is not substantive significance.

  No evidence of risk is not evidence of no risk.

  If you torture the data enough, they will confess.

  Exposés of fallacies and foibles ranging from professional manuals and task forces to more popularized debunking treatises are legion. New evidence has piled up showing lack of replication and all manner of selection and publication biases. Even expanded “ evidence-based” practices, whose very rationale is to emulate experimental controls, are not immune from allegations of illicit cherry picking, significance seeking, P -hacking, and assorted modes of extraordinary rendition of data. Attempts to restore credibility have gone far beyond the cottage industries of just a few years ago, to entirely new research programs: statistical fraud-busting, statistical forensics, technical activism, and widespread reproducibility studies. There are proposed methodological reforms – many are generally welcome (preregistration of experiments, transparency about data collection, discouraging mechanical uses of statistics), some are quite radical. If we are to appraise these evidence policy reforms, a much better grasp of some central statistical problems is needed.

  Getting Philosophical

  Are philosophies about science, evidence, and inference relevant here? Because the problems involve questions about uncertain evidence, probabilistic models, science, and pseudoscience – all of which are intertwined with technical statistical concepts and presuppositions – they certainly ought to be. Even in an open-access world in which we have become increasingly fearless about taking on scientific complexities, a certain trepidation and groupthink take over when it comes to philosophically tinged notions such as inductive reasoning, objectivity, rationality, and science versus pseudoscience. The general area of philosophy that deals with knowledge, evidence, inference, and rationality is called epistemology . The epistemological standpoints of leaders, be they philosophers or scientists, are too readily taken as canon by others. We want to understand what’ s true about some of the popular memes: “ All models are false,” “ Everything is equally subjective and objective,” “ P -values exaggerate evidence,” and “ [M]ost published research findings are false” (Ioannidis 2005 ) – at least if you publish a single statistically significant result after data finagling. (Do people do that? Shame on them.) Yet R. A. Fisher, founder of modern statistical tests, denied that an isolated statistically significant result counts.

  [W]e need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result.

  (Fisher 1935b /1947, p. 14)

  Satisfying this requirement depends on the proper use of background knowledge and deliberate design and modeling.

  This opening excursion will launch us into the main themes we will encounter. You mustn’ t suppose, by its title, that I will be talking about how to tell the truth using statistics. Although I expect to make some progress there, my goal is to tell what’ s true about statistical methods themselves! There are so many misrepresentations of those methods that telling what is true about them is no mean feat. It may be thought that the basic statistical concepts are well understood. But I show that this is simply not true.

  Nor can you just open a statistical text or advice manual for the goal at hand. The issues run deeper. Here’ s where I come in. Having long had one foot in philosophy of science and the other in foundations of statistics, I will zero in on the central philosophical issues that lie below the surface of today’ s raging debates. “ Getting philosophical” is not about articulating rarified concepts divorced from statistical practice. It is to provide tools to avoid obfuscating the terms and issues being bandied about. Readers should be empowered to understand the core presuppositions on which rival positions are based – and on which they depend.

  Do I hear a protest? “ There is nothing philosophical about our criticism of statistical significance tests” (someone might say). The problem is that a small P -value is invariably, and erroneously, interpreted as giving a small probability to the null hypothesis.” Really? P -values are not intended to be used this way; presupposing they ought to be so interpreted grows out of a specific conception of the role of probability in statistical inference. That conception is philosophical . Methods characterized through the lens of over-simple epistemological orthodoxies are methods misapplied and mischaracterized. This may lead one to lie, however unwittingly, about the nature and goals of statistical inference, when what we want is to tell what’ s true about them.

  1.1 Severity Requirement: Bad Evidence, No Test (BENT)

  Fisher observed long ago, “ [t]he political principle that anything can be proved by statistics arises from the practice of presenting only a selected subset of the data available” (Fisher 1955 , p. 75). If you report results selectively, it becomes easy to prejudge hypotheses: yes, the data may accord amazingly well with a hypothesis H , but such a method is practically guaranteed to issue so good a fit even if H is false and not warranted by the evidence. If it is predetermined that a way will be found to either obtain or interpret data as evidence for H , then data are not being taken seriously in appraising H. H is essentially immune to having its flaws uncovered by the data. H might be said to have “ passed” the test, but it is a test that lacks stringency or severity. Everyone understands that this is bad evidence, or no test at all. I call this the severity requirement . In its weakest form it supplies a minimal requirement for evidence:

  Severity Requirement (weak): One does not have evidence for a claim if nothing has been done to rule out ways the claim may be false. If data x agree with a claim C but the method used is practically guaranteed to find such agreement, and had little or no capability of finding flaws with C even if they exist, then we have bad evidence, no test (BENT).

  The “ practically guaranteed” acknowledges that even if the method had some slim chance of producing a disagreement when C is false, we still regard the evidence as lousy. Little if anything has been done to rule out erroneous construals of data. We’ ll need many different ways to state this minimal principle of evidence, depending on context.

  A Scandal Involving Personalized Medicine

  A recent scandal offers an example. Over 100 patients signed up for the chance to participate in the Duke University (2007– 10) clinical trials that promised a custom-tailored cancer treatment. A cutting-edge prediction model developed by Anil Potti and Joseph Nevins purported to predict your response to one or another chemotherapy based on large data sets correlating properties of various tumors and positive responses to different regimens (Potti et al. 2006 ). Gross errors and data manipulation eventually forced the trials to be halted. It was revealed in 2014 that a whistleblower – a student – had expressed concerns that

  … in developing the model, only those samples which fit the model best in cross validation were included. Over half of the original samples were removed. … This was an incredibly biased approach.

  (Perez 2015 )

  In order to avoid the overly rosy predictions that ensue from a model built to fit the data (called the training set), a portion of the data (called the test set) is to be held out to “ cross validate” the model. If any unwelcome test data are simply excluded, the technique has obviously not done it
s job. Unsurprisingly, when researchers at a different cancer center, Baggerly and Coombes, set out to avail themselves of this prediction model, they were badly disappointed: “ When we apply the same methods but maintain the separation of training and test sets, predictions are poor” (Coombes et al. 2007 , p. 1277). Predicting which treatment would work was no better than chance.

  You might be surprised to learn that Potti dismissed their failed replication on grounds that they didn’ t use his method (Potti and Nevins 2007 )! But his technique had little or no ability to reveal the unreliability of the model, and thus failed utterly as a cross check. By contrast, Baggerly and Coombes’ approach informed about what it would be like to apply the model to brand new patients – the intended function of the cross validation. Medical journals were reluctant to publish Baggerly and Coombes’ failed replications and report of critical flaws. (It eventually appeared in a statistics journal, Annals of Applied Statistics 2009 , thanks to editor Brad Efron.) The clinical trials – yes on patients – were only shut down when it was discovered Potti had exaggerated his honors in his CV! The bottom line is, tactics that stand in the way of discovering weak spots, whether for prediction or explanation, create obstacles to the severity requirement; it would be puzzling if accounts of statistical inference failed to place this requirement, or something akin to it, right at the center – or even worse, permitted loopholes to enable such moves. Wouldn’ t it?