Friday, January 27, 2012

SPSP 2012: Watchdogs, Witch-hunts, and What to do about False-Positive Findings

In a recent trend, the field of social-personality psychology has become sensitive to the data reporting and analytic strategies that go into the publication of a research paper. Today at the “False Positive Findings are Frequent Findable and Fixable” symposium at SPSP the three speakers presented some very polarizing observations about this trend in our field.

The Summary

In the first talk, Leslie John of Harvard University bravely discussed the prevalence of questionable data analysis practices in our field. The short answer: People engage in many data collection and analysis strategies that bias hypothesis testing and contribute to the publishing of null findings.

In the second talk, Joe Simmons from UPenn presented findings from a research paper suggesting that false findings are preventable with a few key changes in the way people report results and journals review papers. In summary, Simmons suggested that researchers should report justifications for sample size cutoffs, conduct analyses without covariates (and with), report a list of all measures used in a study, collect samples of sufficient power (n = 20), among other things. In addition, Simmons urged reviewers to put more emphasis on exact (rather than conceptual) replications, and to give authors a little slack for imperfect findings. Simmons further suggests that our natural tendencies to justify any data analytic strategy that works as the correct one, necessitates a mandate to adhere to clear rules for data reporting.

In the final talk, Uri Simonsohn of UPenn discussed what he refers to as “p-hacking.” P-hacking is the idea that if researchers are engaging in questionable analysis practices, then they should have a disproportionate number of findings at or close to the p < .05 threshold for statistical significance, and that this can be relatively easy to detect. Simonsohn then presented data showing that Daniel Kahneman publishes findings that are real, using this p-hacking technique, and then suggested some other uses for this technique—including to assess whether a journal publishes many false positive findings, or whether a job candidate’s data can be trusted.

My Reaction

I think that overall, there are some very interesting talking points worth considering in this symposium. For instance, the premium on perfect data that reviewers sometimes focus on would be nice to reign in. I also think the practice of collecting a million exploratory variables, correlating them, and seeing what relations emerge needs to be stopped. I also find the idea of p-hacking to be absolutely fascinating and I’m planning to do it to myself when I get home from the conference (a topic for a later blog entry).

I do however, have some reservations about the critical points in the symposium. First, I don’t think the researchers are considering the real monetary/time costs for some forms of data collection. For example, if my research question involves a special sample, and my limited access to this hypothetical sample requires that I collect all data during a set time, a researcher might collect a bunch of variables. Then, when the original hypothesis does not find support with the data, I cannot justify (1) not exploring the data or (2) exploring the data, but not publishing those results [Both of these possibilities were proposed by the speakers]. That’s just not a practical solution given the money and time costs of collecting such special set of participants.

I also wondered where a person should stop when they report findings? Do they report the order of the measures? The race/gender of the experimenters? The day of the week?  What is enough for total transperancy?

Third, when Simonsohn mentioned using p-hacking to check on whether researchers are faking data or not, I became concerned: A job candidate has very few papers to their name and I wondered if the speakers even know how many research papers it would take to have a reliable estimate about whether someone is falsifying data using this methodology.

Finally, isn’t putting greater emphasis on exact replications a more parsimonious solution? If a person’s findings can be replicated, then by definition, the findings are real. The other good thing about exact replications is that researchers don’t get into a potential witch hunt.

What do you think about these ideas? I’d love to read your comments!

Simmons JP, Nelson LD, & Simonsohn U (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological science, 22 (11), 1359-66 PMID: 22006061


  1. Some system for rewarding researchers who do replications or reviews or meta-analyses might be called for. As of now, there's too much emphasis on originality.

    Of course, groundbreaking findings advance science, but our enthusiasm for them needs to be kept in check.

    What if there were some collectively maintained database all a field's major journals participated in keeping up to date, on which every theory was catalogued along with some sort of rating based on how many times the findings supporting it have been replicated?

    Researchers could go to the database and see if the idea they're studying has been looked into before and by how many other researchers. Contradictions could be highlighted and used to encourage studies to sort them out.

    Most important, the rating based on replication (and N-size) could serve as a guide to how much credence ought to be placed on the ideas.

    1. Hi Dennis, these are some interesting and innovative ways to solve the false-positive findings problem in psychology.

      Thanks for reading!

  2. "Finally, isn’t putting greater emphasis on exact replications a more parsimonious solution? If a person’s findings can be replicated, then by definition, the findings are real."

    But what if they can't be replicated? Null findings are difficult to publish. I remember a lecturer suggesting that 'psychology is a TypeI error'

  3. Instead of wasting resources on witch hunts, maybe listen to Paul Meehl and stop relying on p-values in the first place.

  4. I have an interest in p-hacking from the opposite perspective - i.e., when can we rely on null data to support "evidence of absence" or "affirmative evidence against harm." I am a consultant who works with various clients who are accused of harming people with products and exposures.

    I suspect lots of the science used to support my adversaries' cases is flawed based on (inadvertent p-hacking).

    This is a very interesting topic to me.