Guilty Confessions of a REFeree

#4 of an occasional series

At the start of this week I spent a day in a room in a university somewhat north of Nottingham with a stack of research papers and a pile of grading sheets. Along with a fellow physicist from a different university (located even further north of Nottingham), I had been asked to act as an external reviewer for the department’s mock REF assessment.

I found it a deeply uncomfortable experience. My discomfort had nothing to do, of course, with our wonderfully genial hosts — thank you all for the hospitality, the conversation, the professionalism, and, of course, lunch. But I’ve vented my spleen previously on the lack of consistency in mock REF ratings (it’s been the most-viewed post at Symptoms… since I resurrected the blog in June last year) and I agreed to participate in the mock assessment so I could see for myself how the process works in practice.

Overall, I’d say that the degree of agreement on “star ratings” before moderation of my co-marker’s grading and mine was at the 70% level, give or take. This is in line with the consistency we observed at Nottingham for independent reviewers in Physics and is therefore, at least, somewhat encouraging. (Other units of assessment for Nottingham’s mock REF review had only 50% agreement.)  But what set my teeth on edge for a not-insignificant number of papers — including quite a few of those on which my gradings agreed with those of my co-marker — was that I simply did not feel at all  qualified to comment.

Even though I’m a condensed matter physicist and we were asked to assess condensed matter physics papers, I simply don’t have the necessary level of hubris to pretend that I can expertly assess any paper in any CMP sub-field. The question that went through my head repeatedly was “If I got this paper from Physical Review Letters (or Phys. Rev. B, or Nature, or Nature Comms, or Advanced Materials, or J. Phys. Chem. C…etc…) would I accept the reviewing invitation or would I decline, telling them it was out of my field of expertise?”  And for the majority of papers the answer to that question was a resounding “I’d decline the invitation.”

So if a paper I was asked to review wasn’t in my (sub-)field of expertise, how did I gauge its reception in the relevant scientific community?

I can’t quite believe I’m admitting this, given my severe misgivings about citation metrics, but, yes, I held my nose and turned to Web of Science. And citation metrics also played a role in the decisions my co-marker made, and in our moderation. This, despite the fact that we had no way of normalising those metrics to the prevailing citation culture of each sub-field, nor of ranking the quality as distinct from the impact of each paper. (One of my absolutely favourite papers of all time – a truly elegant and pioneering piece of work – has picked up a surprisingly low number of citations, as compared to much more pedestrian work in the field.)

Only when I had to face a stack of papers and grade them for myself did I realise just how exceptionally difficult it is to pass numerical judgment on a piece of work in an area that lies outside my rather small sphere of research. I was, of course, asked to comment on publications in condensed matter physics, ostensibly my area of expertise. But that’s a huge field. Not only is no-one a world-leading expert in all areas of condensed matter physics, it’s almost impossible to keep up with developments in our own narrow sub-fields of interest let alone be au fait with the state of the art in all other sub-fields.

So we therefore turn to citations to try to gauge the extent to which a paper has made ripples — or perhaps even sent shockwaves – through a sub-field in which we have no expertise. My co-marker and I are hardly alone in adopting this citation-counting strategy. But that’s of course no excuse — we were relying on exactly the type of pseudoquantitative heuristic that I have criticised in the past and I felt rather “grubby” at the end of the (rather tiring) day. David Colquhoun made the following point time and again in the run up to the last REF  (and well before):

All this shows what is obvious to everyone but bone-headed bean counters. The only way to assess the merit of a paper is to ask a selection of experts in the field.

Nothing else works.


Bibliometrics are a measure of visibility and “clout” in a particular (yet often nebulously defined) research community; they’re not a quantification of scientific quality. Therefore, very many scientists, and this most definitely includes me, have deep misgivings about using citations to judge a paper’s — let alone a scientist’s — worth.

Although I agree with that quote from David above, the problem is that we need to somehow choose the correct “boundary conditions” for each expert; I can have a reasonable level of expertise in one sub-area of a field — say, scanning probe microscopy or self-assembly or semiconductor surface physics — and a distinct lack of working knowledge, let alone expertise, in another sub-area of that self-same field. I could list literally hundreds of topics where I would, in fact, be winging it.

For many years, and because of my deep aversion to simplistic citation-counting and bibliometrics, I’ve been guilty of the type of not-particularly-joined-up thinking that Dorothy Bishop rightly chastises in this tweet…

We can’t trust the bibliometrics in isolation (for all the reasons (and others) that David Colquhoun lays out here), so when it comes to the REF the argument is that we have to supplement the metrics with “quality control” via another round of ostensibly expert peer review. But the problem is that it’s often not expert peer review; I was certainly not an expert in the subject areas of very many of the papers I was asked to judge. And I’ll hold that no-one can be a world-leading expert in every sub-field of a given area of physics (or any other discipline).

So what are the alternatives?

David has suggested that we should, in essence, retire what’s known as the “dual support” system for research funding (see the video embedded below): “…abolish the REF, and give the money to research councils, with precautions to prevent people being fired because their research wasn’t expensive enough.” I have quite some sympathy with that view because the common argument that the so-called QR funding awarded via the REF is used to support “unpopular” areas of research that wouldn’t necessarily be supported by the research councils is not at all compelling (to put it mildly). Universities demonstrably align their funding priorities and programmes very closely with research council strategic areas; they don’t hand out QR money for research that doesn’t fall within their latest Universal Targetified Globalised Research Themes.

Prof. Bishop has a different suggestion for revamping how QR funding is divvied up, which initially (and naively, for the reasons outlined above) I found a little unsettling. My first-hand experience earlier this week with the publication grading methodology used by the REF — albeit in a mock assessment — has made me significantly more comfortable with Dorothy’s strategy:

.”..dispense with the review of quality, and you can obtain similar outcomes by allocating funding at institutional level in relation to research volume”.

Given that grant income is often taken as yet another proxy for research quality, and that there’s a clear Matthew effect (rightly or wrongly) at play in science funding, this correlation between research volume and REF placement is not surprising. As the Times Higher Education article on Dorothy’s proposals went on to quote,

The government should, therefore, consider allocating block funding in proportion to the number of research-active staff at a university because that would shrink the burden on universities and reduce perverse incentives in the system, [Prof Bishop] said.

Before reacting strongly one way or another, I strongly recommend that you take the time to listen to Prof. Bishop eloquently detail her arguments in the video below.

Here’s the final slide of that presentation:


So much rests on that final point. Ultimately, the immense time and effort devoted to/wasted on the REF boils down to a lack of trust — by government, funding bodies, and, depressingly, often university senior management — that academics cannot motivate themselves without perverse incentives like aiming for a 4* paper. That would be bad enough if we all could agree on what a 4* paper looks like…

At sixes and sevens about 3* and 4*

The post below appears in today’s Times Higher Education under the title “The REF’s star system leaves a black hole in fairness.” My original draft was improved immensely by Paul Jump‘s edits (but I am slightly miffed that my choice of title (above) was rejected by the sub-editors.) I’m posting the article here for those who don’t have a subscription to the THE. (I should note that the interview panel scenario described below actually happened. The question I asked was suggested in the interview pack supplied by the “University of True Excellence”.)

“In your field of study, Professor Aspire, just how does one distinguish a 3* from a 4* paper in the research excellence framework?”

The interviewee for a senior position at the University of True Excellence – names have been changed to protect the guilty – shuffled in his seat. I leaned slightly forward after posing the question, keen to hear his response to this perennial puzzler that has exercised some of the UK’s great and not-so-great academic minds.

He coughed. The panel – on which I was the external reviewer – waited expectantly.

“Well, a 4* paper is a 3* paper except that your mate is one of the REF panel members,” he answered.

I smiled and suppressed a giggle.

Other members of the panel were less amused. After all, the rating and ranking of academics’ outputs is serious stuff. Careers – indeed, the viability of entire departments, schools, institutes and universities – depend critically on the judgements made by peers on the REF panels.

Not only do the ratings directly influence the intangible benefits arising from the prestige of a high REF ranking, they also translate into cold, hard cash. An analysis by the University of Sheffield suggests that in my subject area, physics, the average annual value of a 3* paper for REF 2021 is likely to be roughly £4,300, whereas that of a 4* paper is £17,100. In other words, the formula for allocating “quality-related” research funding is such that a paper deemed 4* is worth four times one judged to be 3*; as for 2* (“internationally recognised”) or 1* (“nationally recognised”) papers, they are literally worthless.

We might have hoped that before divvying up more than £1 billion of public funds a year, the objectivity, reliability and robustness of the ranking process would be established beyond question. But, without wanting to cast any aspersions on the integrity of REF panels, I’ve got to admit that, from where I was sitting, Professor Aspire’s tongue-in-cheek answer regarding the difference between 3* and 4* papers seemed about as good as any – apart from, perhaps, “I don’t know”.

The solution certainly isn’t to reach for simplistic bibliometric numerology such as impact factors or SNIP indicators; anyone making that suggestion is not displaying even the level of critical thinking we expect of our undergraduates. But every academic also knows, deep in their studious soul, that peer review is far from wholly objective. Nevertheless, university senior managers – many of them practising or former academics themselves – are often all too willing, as part of their REF preparations, to credulously accept internal assessors’ star ratings at face value, with sometimes worrying consequences for the researcher in question (especially if the verdict is 2* or less).

Fortunately, my institution, the University of Nottingham, is a little more enlightened – last year it had the good sense to check the consistency of the internal verdicts on potential REF 2021 submissions via the use of independent reviewers for each paper. The results were sobering. Across seven scientific units of assessment, the level of full agreement between reviewers varied from 50 per cent to 75 per cent. In other words, in the worst cases, reviewers agreed on the star rating for no more than half of the papers they reviewed.

Granted, the vast majority of the disagreement was at the 1* level; very few pairs of reviewers were “out” by two stars, and none disagreed by more. But this is cold comfort. The REF’s credibility is based on an assumption that reviewers can quantitatively assess the quality of a paper with a precision better than one star. As our exercise shows, the effective error bar is actually ± 1*.

That would be worrying enough if there were a linear scaling of financial reward. But the problem is exacerbated dramatically by both the 4x multiplier for 4* papers and the total lack of financial reward for anything deemed to be below 3*.

The Nottingham analysis also examined the extent to which reviewers’ ratings agreed with authors’ self-scoring (let’s leave aside any disagreement between co-authors on that). The level of full agreement here was similarly patchy, varying between 47 per cent and 71 per cent. Unsurprisingly, there was an overall tendency for authors to “overscore” their papers, although underscoring was also common.

Some argue that what’s important is the aggregate REF score for a department, rather than the ratings of individual papers, because, according to the central limit theorem, any wayward ratings will “wash out” at the macro level. I disagree entirely. Individual academics across the UK continue to be coaxed and cajoled into producing 4* papers; there are even dedicated funding schemes to help them do so. And the repercussions arising from failure can be severe.

It is vital in any game of consequence that participants be able to agree when a goal has been scored or a boundary hit. Yet, in the case of research quality, there are far too many cases in which we just can’t. So the question must be asked: why are we still playing?