Sloppy Science: Still Someone Else’s Problem?

“The Somebody Else’s Problem field is much simpler and more effective, and what’s more can be run for over a hundred years on a single torch battery… An SEP is something we can’t see, or don’t see, or our brain doesn’t let us see, because we think that it’s somebody else’s problem…. The brain just edits it out, it’s like a blind spot”.

Douglas Adams (1952 – 2001) Life, The Universe, and Everything

The very first blog post I wrote (back in March 2013), for the Institute of Physics’ now sadly defunct physicsfocus project, was titled “Are Flaws in Peer Review Someone Else’s Problem?” and cited the passage above from the incomparable, and sadly missed, Mr. Adams. The post described the trials and tribulations my colleagues and I were experiencing at the time in trying to critique some seriously sloppy science, on the subject of ostensibly “striped” nanoparticles, that had been published in very high profile journals by a very high profile group. Not that I suspected it at the time of writing the post, but that particular saga ended up dragging on and on, involving a litany of frustrations in our attempts to correct the scientific record.

I’ve been put in mind of the stripy saga, and that six-year-old post, for a number of reasons lately. First, the most recent stripe-related paper from the group whose work we critiqued makes absolutely no mention of the debate and controversy. It’s as if our criticism never existed; the issues we raised, and the surrounding controversy, are simply ignored by that group in their most recent work.

More importantly, however, I have been following Ken Rice‘s (and others’) heated exchange with the authors of a similarly fundamentally flawed paper very recently published in Scientific Reports [Oscillations of the baseline of solar magnetic field and solar irradiance on a millennial timescale, VV Zharkova, SJ Shepherd, SI Zharkov, and E Popova, Sci. Rep. 9 9197 (2019)]. Ken’s blog post on the matter is here, and the ever-expanding PubPeer thread (225 comments at the time of writing, and counting) is here. Michael Brown‘s take-no-prisoners take-down tweets on the matter are also worth reading…

The debate made it into the pages — sorry, pixels — of The Independent a few days ago: “Journal to investigate controversial study claiming global temperature rise is due to Earth moving closer to Sun.

Although the controversy in this case is related to physics happening on astronomically larger length scales than those at the heart of our stripy squabble, there are quite a number of parallels (and not just in terms of traffic to the PubPeer site and the tenor of the authors’ responses). Some of these are laid out in the following Tweet thread by Ken…

The Zharkova et al. paper makes fundamental errors that should never have passed through peer review. But then we all know that peer review is far from perfect. The question is what should happen to a paper that is not fradulent but still makes it to publication containing misleadingly sloppy and/or incorrect science? Should it remain in the scientific record? Or should it be retracted?

It turns out that this is a much more contested issue than it might appear at first blush. For what it’s worth, I am firmly of the opinion that a paper containing fundamental errors in the science and/or based on mistakes due to clearly definable f**k-ups/corner-cutting in experimental procedure should be retracted. End of story. It is unfair on other researchers — and, I would argue, blatantly unethical in many cases — to leave a paper in the literature that is fundamentally flawed. (Note that even retracted papers continue to accrue citations.) It is also a massive waste of taxpayers’ money to fund new research based on flawed work.

Here’s one example of what I mean, taken from personal, and embarrassing, experience. I screwed up the calibration of a tuning fork sensor used in a set of atomic force microscopy experiments. We discovered this screw-up after publication of the paper that was based on measurements with that particular sensor. Should that paper have remained in the literature? Absolutely not.

Some, however, including my friend and colleague Mike Merrifield, who is also Head of School here and with whom I enjoy the ever-so-occasional spat, have a slightly different take on the question of retractions:

Mike and I discussed the Zharkova et al. controversy both briefly at tea break and via an e-mail exchange last week, and it seems that there are distinct cultural differences between different sub-fields of physics when it comes to correcting the scientific record. I put the Gedankenexperiment described below to Mike and asked him whether we should retract the Gedankenpaper. The particular scenario outlined in the following stems from an exchange I had with Alessandro Strumia a few months back, and subsequently with a number of my particle physicist colleagues (both at Nottingham and elsewhere), re. the so-called 750 GeV anomaly at CERN…

“Mike, let’s say that some of us from the Nanoscience Group go to the Diamond Light Source to do a series of experiments. We acquire a set of X-ray absorption spectra that are rather noisy because, as ever, the experiment didn’t bloody well work until the last day of beamtime and we had to pack our measurements into the final few hours. Our signal-to-noise ratio is poor but we decide to not only interpret a bump in a spectrum as a true peak, but to develop a sophisticated (and perhaps even compelling) theory to explain that “peak”. We publish the paper in a prestigious journal, because the theory supporting our “peak” suggests the existence of an exciting new type of quasiparticle. 

We return to the synchrotron six months or a year later, repeat the experiment over and over but find no hint of the “peak” on which we based our (now reasonably well-cited) analysis. We realise that we had over-interpreted a statistical noise blip.

Should we retract the paper?”

I am firmly of the opinion that the paper should be retracted. After all, we could not reproduce our results when we did the experiment correctly. We didn’t bend over backwards in the initial experiment to convince ourselves that our data were robust and reliable and instead rushed to publish (because we were so eager to get a paper out of the beamtime.) So now we should eat humble pie for jumping the gun — the paper should be retracted and the scientific record should be corrected accordingly.

Mike, and others, were of a different opinion, however. They argued that the flawed paper should remain in the scientific literature, sometimes for the reasons to which Mike alludes in his tweet above [1].  In my conversations with particle physicists re. the 750 GeV anomaly, which arose from a similarly over-enthusiastically interpreted bump in a spectrum that turned out to be noise, there was a similarly strong inertia to correct the scientific record. There appeared to be a feeling that only if the data were fabricated or fraudulent should the paper be retracted.

During the e-mail exchanges with my particle physics colleagues, I was struck on more than one occasion by a disturbing disconnect between theory and experiment. (This is hardly the most original take on the particle physics field, I know. I’ll take a moment to plug Sabine Hossenfelder’s Lost In Math once again.) There was an unsettling (for me) feeling among some that it didn’t matter if experimental noise had been misinterpreted, as long as the paper led to some new theoretical insights. This, I’ll stress, was not an opinion universally held — some of my colleagues said they didn’t go anywhere near the 750 GeV excess because of the lack of strong experimental evidence. Others, however, were more than willing to enthusiastically over-interpret the 750 GeV “bump” and, unsurprisingly, baulked at the suggestion that their papers should be retracted or censured in any way. If their sloppy, credulous approach to accepting noise in lieu of experimental data had advanced the field, then what’s wrong with that? After all, we need intrepid pioneers who will cross the Pillars of Hercules

I’m a dyed-in-the-wool experimentalist; science should be driven by a strong and consistent feedback loop between experiment and theory. If a scientist mistakes experimental noise (or well-understood experimental artefacts) for valid data, or if they get fundamental physics wrong a la Zherkova et al, then there should be — must be — some censure for this. After all, we’d censure our undergrad students under similar circumstances, wouldn’t we? One student carries out an experiment for her final year project carefully and systematically, repeating measurements, bringing her signal-to-noise ratio down, putting in the hours to carefully refine and redefine the experimental protocols and procedures, refusing to make claims that are not entirely supported by the data. Another student instead gets over-excited when he sees a “signal” that chimes with his expectations, and instead of doing his utmost to make sure he’s not fooling himself, leaps to a new and exciting interpretation of the noisy data. Which student should receive the higher grade? Which student is the better scientist?

As that grand empiricist Francis Bacon put it centuries ago,

The understanding must not therefore be supplied with wings, but rather hung with weights, to keep it from leaping and flying.

It’s up to not just individual scientists but the scientific community as a whole to hang our collective understanding with weights. Sloppy science is not just someone else’s problem. It’s everyone’s problem.

[1] Mike’s suggestion in his tweet that the journal would like to retract the paper to spare their blushes doesn’t chime with our experience of journals’ reactions during the stripy saga. Retraction is the last thing they want because it impacts their brand.

 

At sixes and sevens about 3* and 4*

The post below appears in today’s Times Higher Education under the title “The REF’s star system leaves a black hole in fairness.” My original draft was improved immensely by Paul Jump‘s edits (but I am slightly miffed that my choice of title (above) was rejected by the sub-editors.) I’m posting the article here for those who don’t have a subscription to the THE. (I should note that the interview panel scenario described below actually happened. The question I asked was suggested in the interview pack supplied by the “University of True Excellence”.)


“In your field of study, Professor Aspire, just how does one distinguish a 3* from a 4* paper in the research excellence framework?”

The interviewee for a senior position at the University of True Excellence – names have been changed to protect the guilty – shuffled in his seat. I leaned slightly forward after posing the question, keen to hear his response to this perennial puzzler that has exercised some of the UK’s great and not-so-great academic minds.

He coughed. The panel – on which I was the external reviewer – waited expectantly.

“Well, a 4* paper is a 3* paper except that your mate is one of the REF panel members,” he answered.

I smiled and suppressed a giggle.

Other members of the panel were less amused. After all, the rating and ranking of academics’ outputs is serious stuff. Careers – indeed, the viability of entire departments, schools, institutes and universities – depend critically on the judgements made by peers on the REF panels.

Not only do the ratings directly influence the intangible benefits arising from the prestige of a high REF ranking, they also translate into cold, hard cash. An analysis by the University of Sheffield suggests that in my subject area, physics, the average annual value of a 3* paper for REF 2021 is likely to be roughly £4,300, whereas that of a 4* paper is £17,100. In other words, the formula for allocating “quality-related” research funding is such that a paper deemed 4* is worth four times one judged to be 3*; as for 2* (“internationally recognised”) or 1* (“nationally recognised”) papers, they are literally worthless.

We might have hoped that before divvying up more than £1 billion of public funds a year, the objectivity, reliability and robustness of the ranking process would be established beyond question. But, without wanting to cast any aspersions on the integrity of REF panels, I’ve got to admit that, from where I was sitting, Professor Aspire’s tongue-in-cheek answer regarding the difference between 3* and 4* papers seemed about as good as any – apart from, perhaps, “I don’t know”.

The solution certainly isn’t to reach for simplistic bibliometric numerology such as impact factors or SNIP indicators; anyone making that suggestion is not displaying even the level of critical thinking we expect of our undergraduates. But every academic also knows, deep in their studious soul, that peer review is far from wholly objective. Nevertheless, university senior managers – many of them practising or former academics themselves – are often all too willing, as part of their REF preparations, to credulously accept internal assessors’ star ratings at face value, with sometimes worrying consequences for the researcher in question (especially if the verdict is 2* or less).

Fortunately, my institution, the University of Nottingham, is a little more enlightened – last year it had the good sense to check the consistency of the internal verdicts on potential REF 2021 submissions via the use of independent reviewers for each paper. The results were sobering. Across seven scientific units of assessment, the level of full agreement between reviewers varied from 50 per cent to 75 per cent. In other words, in the worst cases, reviewers agreed on the star rating for no more than half of the papers they reviewed.

Granted, the vast majority of the disagreement was at the 1* level; very few pairs of reviewers were “out” by two stars, and none disagreed by more. But this is cold comfort. The REF’s credibility is based on an assumption that reviewers can quantitatively assess the quality of a paper with a precision better than one star. As our exercise shows, the effective error bar is actually ± 1*.

That would be worrying enough if there were a linear scaling of financial reward. But the problem is exacerbated dramatically by both the 4x multiplier for 4* papers and the total lack of financial reward for anything deemed to be below 3*.

The Nottingham analysis also examined the extent to which reviewers’ ratings agreed with authors’ self-scoring (let’s leave aside any disagreement between co-authors on that). The level of full agreement here was similarly patchy, varying between 47 per cent and 71 per cent. Unsurprisingly, there was an overall tendency for authors to “overscore” their papers, although underscoring was also common.

Some argue that what’s important is the aggregate REF score for a department, rather than the ratings of individual papers, because, according to the central limit theorem, any wayward ratings will “wash out” at the macro level. I disagree entirely. Individual academics across the UK continue to be coaxed and cajoled into producing 4* papers; there are even dedicated funding schemes to help them do so. And the repercussions arising from failure can be severe.

It is vital in any game of consequence that participants be able to agree when a goal has been scored or a boundary hit. Yet, in the case of research quality, there are far too many cases in which we just can’t. So the question must be asked: why are we still playing?

“The drum beats out of time…”

Far back in the mists of time, in those halcyon days when the Brexit referendum was still but a comfortably distant blot on the horizon and Trump’s lie tally was a measly sub-five-figures, I had the immense fun of working with Brady Haran and Sean Riley on this…

As that video describes, we tried an experiment in crowd-sourcing data via YouTube for an analysis of the extent to which fluctuations in timing might be a signature characteristic of a particular drummer (or drumming style). Those Sixty Symbols viewers who very kindly sent us samples of their drumming — all 78 of you [1] — have been waiting a very, very long time for this update. My sincere thanks for contributing and my profuse apologies for the exceptionally long delay in letting you know just what happened to the data you sent us. The good news is that a paper, Rushing or Dragging? An Analysis of the “Universality” of Correlated Fluctuations in Hi-hat Timing and Dynamics (which was uploaded to the arXiv last week), has resulted from the drumming fluctuations project. The abstract reads as follows.

A previous analysis of fluctuations in a virtuoso (Jeff Porcaro) drum performance [Räsänen et al., PLoS ONE 10(6): e0127902 (2015)] demonstrated that the rhythmic signal comprised both long range correlations and short range anti-correlations, with a characteristic timescale distinguishing the two regimes. We have extended Räsänen et al.’s approach to a much larger number of drum samples (N=132, provided by a total of 58 participants) and to a different performance (viz., Rush’s Tom Sawyer). A key focus of our study was to test whether the fluctuation dynamics discovered by Räsänen et al. are “universal” in the following sense: is the crossover from short-range to long-range correlated fluctuations a general phenomenon or is it restricted to particular drum patterns and/or specific drummers? We find no compelling evidence to suggest that the short-range to long-range correlation crossover that is characteristic of Porcaro’s performance is a common feature of temporal fluctuations in drum patterns. Moreover, level of experience and/or playing technique surprisingly do not play a role in influencing a short-range to long-range correlation cross-over. Our study also highlights that a great deal of caution needs to be taken when using the detrended fluctuation analysis technique, particularly with regard to anti-correlated signals.

There’s also some bad news. We’ll get to that. First, a few words on the background to the project.

Inspired by a fascinating paper published by Esa Rasanen (of Tampere University) and colleagues back in 2015, a few months before the Sixty Symbols video was uploaded, we were keen to determine whether the correlations observed by Esa et al. in the fluctuations in an iconic drummer’s performance — the late, great Jeff Porcaro — were a common feature of drumming.

Why do we care — and why should you care — about fluctuations in drumming? Surely we physicists should be doing something much more important with our time, like, um, curing cancer…

OK, maybe not.

More seriously, there are very many good reasons why we should study fluctuations (aka noise) in quite some detail. Often, noise is the bane of an experimental physicist’s life. We spend inordinate amounts of time chasing down and attempting to eliminate sources of noise, be they at a specific frequency (e.g. mains “hum” at 50 Hz or 60 Hz [2]) or, sometimes more frustratingly, when the signal contamination is spread across the frequency spectrum, forming what’s known as white noise. (Noise can be of many colours other than white — just as with a spectrum of light it all depends on which frequencies are present.)

But noise is most definitely not always just a nuisance to be avoided/eliminated at all costs; there can be a wealth of information embedded in the apparent messiness. Pink noise, for example, crops up in many weird and wonderful — and, indeed, many not-so-weird-and-not-so-wonderful — places, from climate change, to fluctuations in our heartbeats, to variations in the stock exchange, to current flow in electronic devices, and, indeed, to mutations occurring during the expansion of a cancerous tumour.  An analysis of the character and colour of noise can provide compelling insights into the physics and maths underpinning the behaviour of everything from molecular self-assembly to the influence and impact of social media.

The Porcaro performance that Esa and colleagues analysed for their paper is the impressive single-handed 16th note groove that drives Michael McDonald’s “I Keep Forgettin’…” I wanted to analyse a similar single-handed 16th note pattern, but in a rock rather than pop context, to ascertain whether Procaro’s pattern of fluctuations in interbeat timing were characteristic only of his virtuoso style or if they were a general feature of drumming. I’m also, coincidentally, a massive Rush fan. An iconic and influential track from the Canadian trio with the right type of drum pattern immediately sprang to mind: Tom Sawyer.

So we asked Sixty Symbols viewers to send in audio samples of their drumming along to Tom Sawyer, which we subsequently attempted to evaluate using a technique called detrended fluctuation analysis. When I say “we”, I mean a number of undergraduate students here at the University of Nottingham (who were aided, but more generally abetted, by myself in the analysis.) I’ve set a 3rd year undergraduate project on fluctuations in drumming for the last three years; the first six authors on the arXiv paper were (or are) all undergraduate students.

Unfortunately, the sound quality (and/or the duration) of many of the samples submitted in response to the Sixty Symbols video was just not sufficient for the task. That’s not a criticism, in any way, of the drummers who submitted audio files; it’s entirely my fault for not being more specific in the video. We worked with what we could, but in the end, the lead authors on the arXiv paper, Oli(ver) Gordon and Dom(inic) Coy, adopted a different and much more productive strategy for their version of the project: they invited a number of drummers (twenty-two in total) to play along with Tom Sawyer using only a hi-hat (so as to ensure that each and every beat could be isolated and tracked) and under exactly the same recording conditions.

You can read all of the details of the data acquisition and analysis in the arXiv paper. It also features the lengthiest acknowledgements section I’ve ever had to write. I think I’ve thanked everyone who provided data in there but if you sent me an MP3 or a .wav file (or some other audio format) and you don’t see your name in there, please let me know by leaving a comment below this post. (Assuming, of course, that you’d like to be acknowledged!)

We submitted the paper to the J. New Music Research last year and received some very helpful referees’ comments. I am waiting to get permission from the editor of the journal to make those (anonymous) comments public. If that permission is given, I’ll post the referees’ reports here.

In hindsight, Tom Sawyer was not the best choice of track to analyse. It’s a difficult groove to get right and even Neil Peart himself has said that it’s the song he finds most challenging to play live. In our analysis, we found very little evidence of the type of characteristic “crossover” in the correlations of the drumming fluctuations that emerged from Esa and colleagues’ study of Porcaro’s drumming. Our results are also at odds with the more recent work by Mathias Sogorski, Theo Geisel, Viola Priesemann (of the Max Planck Institute for Dynamics and Self-Organization, and the Bernstein Center for Computational Neuroscience, Göttingen, Germany) — a comprehensive and systematic analysis of microtiming variations in jazz and rock recordings spanning a total of over 100 recordings.

The likelihood is that the conditions under which we recorded the tracks — in particular, the rather “unnatural” hi-hat-only performance — may well have washed out the type of correlations observed by others. Nonetheless, this arguably negative result is a useful insight into the extent to which correlated fluctuations are robust (or not) with respect to performance environment and style. It was clear from our results, in line with previous work by Holger Hennig, Theo Geisel and colleagues, that the fluctuations are not so much characteristic of an individual drummer but of a performance; the same drummer could produce different fluctuation distributions and spectra under different performing conditions.

So where do we go from here? What’s the next stage of this research? I’m delighted to say that the Sixty Symbols video was directly responsible for kicking off an exciting collaboration with Esa and colleagues at Tampere that involves a number of students and researchers here at Nottingham. In particular, two final year project students, Ellie Hill and Lucy Edwards, have just returned from a week-long visit to Esa’s group at Tampere University. Their project, which is jointly supervised by my colleague Matt Brookes, Esa, and myself, focuses on going that one step further in the analysis of drumming fluctuations to incorporate brain imaging. Using this wonderful device.

I’m also rather chuffed that another nascent collaboration has stemmed from the Sixty Symbols video (and the subsequent data analysis) — this time from the music side of the so-called “two cultures” divide. The obscenely talented David Domminney Fowler, of Australian Pink Floyd fame, has kindly provided exceptionally high quality mixing desk recordings of “Another Brick In The Wall (Part 2)” from concert performances. (Thanks, Dave. [3]) Given the sensitivity of drumming fluctuations to the precise performance environment, the analysis of the same drummer (in this case, Paul Bonney) over multiple performances could prove very informative. We’re also hoping that Bonney will be able to make it to the Sir Peter Mansfield Imaging Centre here in the not-too-distant future so that Matt and colleagues can image his brain as he drums. (Knock yourself out with drummer jokes at this point. Dave certainly has.) I’m also particularly keen to compare results from my instrument of choice at the moment, Aerodrums, with those from a traditional kit.

And finally, the Sixty Symbols video also prompted George Datseris, professional drummer and PhD student  researcher, also at the Max Planck Institute for Dynamics & Self-Organisation, to get in touch to let us know about his intriguing work with the Giesel group: Does it Swing? Microtiming Deviations and Swing Feeling in Jazz. Esa and George will both be visiting Nottingham later this year and I am very enthusiastic indeed about the prospects for a European network on drum/rhythm research.

What’s remarkable is that all of this collaborative effort stemmed from Sixty Symbols. Public engagement is very often thought of exclusively in terms of scientists doing the research and then presenting the work as a fait accompli. What I’ve always loved about working with Brady on Sixty Symbols, and with Sean on Computerphile, is that they want to make the communication of science a great deal more open and engaging than that; they want to involve viewers (who are often the taxpayers who fund the work) in the trials and tribulations of the day-to-day research process itself. Brady and I have our spats on occasion, but on this point I am in complete and absolute agreement with him. Here he is, hitting the back of the net in describing the benefits of a warts-and-all approach to science communication…

They don’t engage with one paper every year or two, and a press release. I think if people knew what went into that paper and that press release…and they see the ups and the downs… even when it’s boring… And they see the emotion of it, and the humanity of it…people will become more engaged and more interested…

With the drumming project, Sixty Symbols went one step further and brought the viewers in so they were part of the story — they drove the direction of the science. While YouTube has its many failings, Sixty Symbols and channels like it enable connections with the world outside the lab that were simply unimaginable when I started my PhD back in (gulp…) 1990. And in these days of narrow-minded, naive nationalism, we need all the international connections we can get. Marching to the beat of your own drum ain’t all it’s cracked up to be…

Source of cartoon: https://xkcd.com/1736/


[1] 78. “Seven eight”.

[2] 50 Hz or 60 Hz depending on which side of the pond you fall. Any experimental physicist or electrical/electronic engineer who might be reading will also know full well that mains noise is generally not only present at 50 (or 60) Hz — there are all those wonderful harmonics to consider. (And the strongest peak may well not even be at 50 (60) Hz, but at one of those harmonics. And not all harmonics will contribute equally.  Experimental physics is such a joy at times…)

[3] In the interests of full disclosure I should note that Dave is a friend, a fan of Sixty Symbols, Numberphile, etc.., and an occasional contributor to Computerphile. He and I have spent quite a few tea-fuelled hours setting the world to rights

 

 

Politics. Perception. Philosophy. And Physics.

Today is the start of the new academic year at the University of Nottingham (UoN) and, as ever, it crept up on me and then leapt out with a fulsome “Gotcha”. Summer flies by so very quickly. I’ll be meeting my new 1st year tutees this afternoon to sort out when we’re going to have tutorials and, of course, to get to know them. One of the great things about the academic life is watching tutees progress over the course of their degree from that first “getting to know each other” meeting to when they graduate.

The UoN has introduced a considerable number of changes to the “student experience” of late via its Project Transform process. I’ve vented my spleen about this previously but it’s a subject to which I’ll be returning in the coming weeks because Transform says an awful lot about the state of modern universities.

For now, I’m preparing for a module entitled “The Politics, Perception and Philosophy of Physics” (F34PPP) that I run in the autumn semester. This is a somewhat untraditional physics module because, for one thing, it’s almost entirely devoid of mathematics. I thoroughly enjoy  F34PPP each year (despite this amathematical heresy) because of the engagement and enthusiasm of the students. The module is very much based on their contributions — I am more of a mediator than a lecturer.

STEM students are sometimes criticised (usually by Simon Jenkins) for having poorly developed communication skills. This is an especially irritating stereotype in the context of the PPP module, where I have been deeply impressed by the quality of the writing the students submit. As I discuss in the video below (an  overview of the module), I’m not alone in recognising this: articles submitted as F34PPP coursework have been published in Physics World, the flagship magazine of the Institute of Physics.

 

In the video I note that my intention is to upload a weekly video for each session of the module. I’m going to do my utmost to keep this promise and, moreover, to accompany each of those videos with a short(ish) blog post. (But, to cover my back, I’ll just note in advance that the best laid schemes gang aft agley…)

Addicted to the brand: The hypocrisy of a publishing academic

Back in December I gave a talk at the Power, Acceleration and Metrics in Academic Life conference in Prague, which was organised by Filip Vostal and Mark Carrigan. The LSE Impact blog is publishing a series of posts from those of us who spoke at the conference. They uploaded my post this morning. Here it is…


 

I’m going to put this as bluntly as I can; it’s been niggling and nagging at me for quite a while and it’s about time I got it off my chest. When it comes to publishing research, I have to come clean: I’m a hypocrite. I spend quite some time railing about the deficiencies in the traditional publishing system, and all the while I’m bolstering that self-same system by my selection of the “appropriate” journals to target.

Despite bemoaning the statistical illiteracy of academia’s reliance on nonsensical metrics like impact factors, and despite regularly venting my spleen during talks at conferences about the too-slow evolution of academic publishing towards a more open and honest system, I nonetheless continue to contribute to the problem. (And I take little comfort in knowing that I’m not alone in this.)

One of those spleen-venting conferences was a fascinating and important event held in Prague back in December, organized by Filip Vostal and Mark Carrigan: “Power, Acceleration, and Metrics in Academic Life”. My presentation, The Power, Perils and Pitfalls of Peer Review in Public – please excuse thePartridgian overkill on the alliteration – largely focused on the question of post-publication peer review (PPPR) via online channels such as PubPeer. I’ve written at length, however, on PPPR previously (here,here, and here) so I’m not going to rehearse and rehash those arguments. I instead want to explain just why I levelled the accusation of hypocrisy and why I am far from confident that we’ll see a meaningful revolution in academic publishing any time soon.

Let’s start with a few ‘axioms’/principles that, while perhaps not being entirely self-evident in each case, could at least be said to represent some sort of consensus among academics:

  • A journal’s impact factor (JIF) is clearly not a good indicator of the quality of a paper published in that journal. The JIF has been skewered many, many times with some of the more memorable and important critiques coming from Stephen Curry, Dorothy Bishop, David Colquhoun, Jenny Rohn, and, most recently, this illuminating post from Stuart Cantrill. Yet its very strong influence tenaciously persists and pervades academia. I regularly receive CVs from potential postdocs where they ‘helpfully’ highlight the JIF for each of the papers in their list of publications. Indeed, some go so far as to rank their publications on the basis of the JIF.
  • Given that the majority of research is publicly funded, it is important to ensure that open access publication becomes the norm. This one is arguably rather more contentious and there are clear differences in the appreciation of open access (OA) publishing between disciplines, with the arts and humanities arguably being rather less welcoming of OA than the sciences. Nonetheless, the key importance of OA has laudably been recognized by Research Councils UK (RCUK) and all researchers funded by any of the seven UK research councils are mandated to make their papers available via either a green or gold OA route (with the gold OA route, seen by many as a sop to the publishing industry, often being prohibitively expensive).

With these three “axioms” in place, it now seems rather straight-forward to make a decision as to the journal(s) our research group should choose as the appropriate forum for our work. We should put aside any consideration of impact factor and aim to select those journals which eschew the traditional for-(large)-profit publishing model and provide cost-effective open access publication, right?

Indeed, we’re particularly fortunate because there’s an exemplar of open access publishing in our research area: The Beilstein Journal of Nanotechnology. Not only are papers in the Beilstein J. Nanotech free to the reader (and easy to locate and download online), but publishing there is free: no exorbitant gold OA costs nor, indeed, any type of charge to the author(s) for publication. (The Beilstein Foundation has very deep pockets and laudably shoulders all of the costs).

But take a look at our list of publications — although we indeed publish in the Beilstein J. Nanotech., the number of our papers appearing there can be counted on the fingers of (less than) one hand. So, while I espouse the three principles listed above, I hypocritically don’t practice what I preach. What’s my excuse?

In academia, journal brand is everything. I have sat in many committees, read many CVs, and participated in many discussions where candidates for a postdoctoral position, a fellowship, or other roles at various rungs of the academic career ladder have been compared. And very often, the committee members will say something along the lines of “Well, Candidate X has got much better publications than Candidate Y”…without ever having read the papers of either candidate. The judgment of quality is lazily “outsourced” to the brand-name of the journal. If it’s in a Nature journal, it’s obviously of higher quality than something published in one of those, ahem, “lesser” journals.

If, as principal investigator, I were to advise the PhD students and postdocs in the group here at Nottingham that, in line with the three principles above, they should publish all of their work in the Beilstein J. Nanotech., it would be career suicide for them. To hammer this point home, here’s the advice from one referee of a paper we recently submitted:

“I recommend re-submission of the manuscript to the Beilstein Journal of Nanotechnology, where works of similar quality can be found. The work is definitively well below the standards of [Journal Name].”

There is very clearly a well-established hierarchy here. Journal ‘branding’, and, worse, journal impact factor, remain exceptionally important in (falsely) establishing the perceived quality of a piece of research, despite many efforts to counter this perception, including, most notably, DORA. My hypocritical approach to publishing research stems directly from this perception. I know that if I want the researchers in my group to stand a chance of competing with their peers, we have to target “those” journals. The same is true for all the other PIs out there. While we all complain bitterly about the impact factor monkey on our back, we’re locked into the addiction to journal brand.

And it’s very difficult to see how to break the cycle…

Are flaws in peer review someone else’s problem?

2017-05-08-09-12-32-900x600

That stack of fellowship applications piled up on the coffee table isn’t going to review itself. You’ve got twenty-five to read before the rapidly approaching deadline, and you knew before you accepted the reviewing job that many of the proposals would fall outside your area of expertise. Sigh. Time to grab a coffee and get on with it.

As a professor of physics with some thirty-five years’ experience in condensed matter research, you’re fairly confident that you can make insightful and perceptive comments on that application about manipulating electron spin in nanostructures (from that talented postdoc you met at a conference last year). But what about the proposal on membrane proteins? Or, worse, the treatment of arcane aspects of string theory by the mathematician claiming a radical new approach to supersymmetry? Can you really comment on those applications with any type of authority?

Of course, thanks to Thomson Reuters there’s no need for you to be too concerned about your lack of expertise in those fields. You log on to Web of Knowledge and check the publication records. Hmmm. The membrane protein work has made quite an impact – the applicant’s Science paper from a couple of years back has already picked up a few hundred citations and her h-index is rising rapidly. She looks to be a real ‘star’ in her community. The string theorist is also blazing a trail.

Shame about the guy doing the electron spin stuff. You’d been very excited about that work when you attended his excellent talk at the conference in the U.S. but it’s picked up hardly any citations at all. Can you really rank it alongside the membrane protein proposal? After all, how could you justify that decision on any sort of objective basis to the other members of the interdisciplinary panel…?

Bibliometrics are the bane of academics’ lives. We regularly moan about the rate at which metrics such as the journal impact factor and the notorious h-index are increasing their stranglehold on the assessment of research. And, yet, as the hypothetical example above shows, we can be our own worst enemy in reaching for citation statistics to assess work outside – or even firmly inside – our  ‘comfort zone’ of expertise.

David Colquhoun, a world-leading pharmacologist at University College London and a blogger of quite some repute, has repeatedly pointed out the dangers of lazily relying on citation analyses to assess research and researchers. One article in particular, How to get good science, is a searingly honest account of the correlation (or lack thereof) between citations and the relative importance of a number of his, and others’, papers. It should be required reading for all those involved in research assessment at universities, research councils, funding bodies, and government departments – particularly those who are of the opinion that bibliometrics represent an appropriate method of ranking the ‘outputs’ of scientists.

Colquhoun, in refreshingly ‘robust’ language, puts it as follows:

“All this shows what is obvious to everyone but bone-headed bean counters. The only way to assess the merit of a paper is to ask a selection of experts in the field.

“Nothing else works.

“Nothing.”

An ongoing controversy in my area of research, nanoscience, has thrown Colquhoun’s statement into sharp relief. The controversial work in question represents a particularly compelling example of the fallacy of citation statistics as a measure of research quality. It has also provided worrying insights into scientific publishing, and has severely damaged my confidence in the peer review system.

The minutiae of the case in question are covered in great detail at Raphael Levy’s blog so I won’t rehash the detailed arguments here. In a nutshell, the problem is as follows. The authors of a series of papers in the highest profile journals in science – including Science and the Nature Publishing Group family – have claimed that stripes form on the surfaces of nanoparticles due to phase separation of different ligand types. The only direct evidence for the formation of those stripes comes from scanning probe microscopy (SPM) data. (SPM forms the bedrock of our research in the Nanoscience group at the University of Nottingham, hence my keen interest in this particular story.)

But those SPM data display features which appear remarkably similar to well known instrumental artifacts, and the associated data analyses appear less than rigorous at best. In my experience the work would be poorly graded even as an undergraduate project report, yet it’s been published in what are generally considered to be the most important journals in science. (And let’s be clear – those journals indeed have an impressive track record of publishing exciting and pioneering breakthroughs in science.)

So what? Isn’t this just a storm in a teacup about some arcane aspect of nanoscience? Why should we care? Won’t the problem be rooted out when others fail to reproduce the work? After all, isn’t science self-correcting in the end?

Good points. Bear with me – I’ll consider those questions in a second. Take a moment, however, to return to the academic sitting at home with that pile of proposals to review. Let’s say that she had a fellowship application related to the striped nanoparticle work to rank amongst the others. A cursory glance at the citation statistics at Web of Knowledge would indicate that this work has had a major impact over a very short period. Ipso facto, it must be of high quality.

And yet, if an expert – or, in this particular case, even a relative SPM novice – were to take a couple of minutes to read one of the ‘stripy nanoparticle’ papers, they’d be far from convinced by the conclusions reached by the authors. What was it that Colquhoun said again? “The only way to assess the merit of a paper is to ask a selection of experts in the field. Nothing else works. Nothing.”

In principle, science is indeed self-correcting. But if there are flaws in published work who fixes them? Perhaps the most troublesome aspect of the striped nanoparticle controversy was highlighted by a comment left by Mathias Brust, a pioneer in the field of nanoparticle research, under an article in the Times Higher Education:

I have [talked to senior experts about this controversy] … and let me tell you what they have told me. About 80% of senior gold nanoparticle scientists don’t give much of a damn about the stripes and find it unwise that Levy engages in such a potentially career damaging dispute. About 10% think that … fellow scientists should be friendlier to each other. After all, you never know [who] referees your next paper. About 5% welcome this dispute, needless to say predominantly those who feel critical about the stripes. This now includes me. I was initially with the first 80% and did advise Raphael accordingly.”

[Disclaimer: I know Mathias Brust very well and have collaborated, and co-authored papers, with him in the past].

I am well aware that the plural of anecdote is not data but Brust’s comment resonates strongly with me. I have heard very similar arguments at times from colleagues in physics. The most troubling of all is the idea that critiquing published work is somehow at best unseemly, and, at worst, career-damaging.  Has science really come to this?

Douglas Adams, in an inspired passage in Life, The Universe, and Everything, takes the psychological concept known as “someone else’s problem (SEP)” and uses it as the basis of an invisibility ‘cloak’ in the form of an SEP-field. (Thanks to Dave Fernig, a fellow fan of Douglas Adams, for reminding me about the Someone Else’s Problem field.) As Adams puts it, instead of attempting the mind-bogglingly complex task of actually making something invisible, an SEP is much easier to implement. “An SEP is something we can’t see, or don’t see, or our brain doesn’t let us see, because we think that it’s somebody else’s problem…. The brain just edits it out, it’s like a blind spot”.

The 80% of researchers to which Brust refers are apparently of the opinion that flaws in the literature are someone else’s problem. We have enough to be getting on with in terms of our own original research, without repeating measurements that have already been published in the highest quality journals, right?

Wrong. This is not someone else’s problem. This is our problem and we need to address it.

Image: Paper pile. Credit: https://pixnio.com/it/oggetti/libri/libri-documento-educazione-informazioni-conoscenza-leggere-ricerca-scuola-stack-studio-lavoro