The post below appears in today’s Times Higher Education under the title “The REF’s star system leaves a black hole in fairness.” My original draft was improved immensely by Paul Jump‘s edits (but I am slightly miffed that my choice of title (above) was rejected by the sub-editors.) I’m posting the article here for those who don’t have a subscription to the THE. (I should note that the interview panel scenario described below actually happened. The question I asked was suggested in the interview pack supplied by the “University of True Excellence”.)
“In your field of study, Professor Aspire, just how does one distinguish a 3* from a 4* paper in the research excellence framework?”
The interviewee for a senior position at the University of True Excellence – names have been changed to protect the guilty – shuffled in his seat. I leaned slightly forward after posing the question, keen to hear his response to this perennial puzzler that has exercised some of the UK’s great and not-so-great academic minds.
He coughed. The panel – on which I was the external reviewer – waited expectantly.
“Well, a 4* paper is a 3* paper except that your mate is one of the REF panel members,” he answered.
I smiled and suppressed a giggle.
Other members of the panel were less amused. After all, the rating and ranking of academics’ outputs is serious stuff. Careers – indeed, the viability of entire departments, schools, institutes and universities – depend critically on the judgements made by peers on the REF panels.
Not only do the ratings directly influence the intangible benefits arising from the prestige of a high REF ranking, they also translate into cold, hard cash. An analysis by the University of Sheffield suggests that in my subject area, physics, the average annual value of a 3* paper for REF 2021 is likely to be roughly £4,300, whereas that of a 4* paper is £17,100. In other words, the formula for allocating “quality-related” research funding is such that a paper deemed 4* is worth four times one judged to be 3*; as for 2* (“internationally recognised”) or 1* (“nationally recognised”) papers, they are literally worthless.
We might have hoped that before divvying up more than £1 billion of public funds a year, the objectivity, reliability and robustness of the ranking process would be established beyond question. But, without wanting to cast any aspersions on the integrity of REF panels, I’ve got to admit that, from where I was sitting, Professor Aspire’s tongue-in-cheek answer regarding the difference between 3* and 4* papers seemed about as good as any – apart from, perhaps, “I don’t know”.
The solution certainly isn’t to reach for simplistic bibliometric numerology such as impact factors or SNIP indicators; anyone making that suggestion is not displaying even the level of critical thinking we expect of our undergraduates. But every academic also knows, deep in their studious soul, that peer review is far from wholly objective. Nevertheless, university senior managers – many of them practising or former academics themselves – are often all too willing, as part of their REF preparations, to credulously accept internal assessors’ star ratings at face value, with sometimes worrying consequences for the researcher in question (especially if the verdict is 2* or less).
Fortunately, my institution, the University of Nottingham, is a little more enlightened – last year it had the good sense to check the consistency of the internal verdicts on potential REF 2021 submissions via the use of independent reviewers for each paper. The results were sobering. Across seven scientific units of assessment, the level of full agreement between reviewers varied from 50 per cent to 75 per cent. In other words, in the worst cases, reviewers agreed on the star rating for no more than half of the papers they reviewed.
Granted, the vast majority of the disagreement was at the 1* level; very few pairs of reviewers were “out” by two stars, and none disagreed by more. But this is cold comfort. The REF’s credibility is based on an assumption that reviewers can quantitatively assess the quality of a paper with a precision better than one star. As our exercise shows, the effective error bar is actually ± 1*.
That would be worrying enough if there were a linear scaling of financial reward. But the problem is exacerbated dramatically by both the 4x multiplier for 4* papers and the total lack of financial reward for anything deemed to be below 3*.
The Nottingham analysis also examined the extent to which reviewers’ ratings agreed with authors’ self-scoring (let’s leave aside any disagreement between co-authors on that). The level of full agreement here was similarly patchy, varying between 47 per cent and 71 per cent. Unsurprisingly, there was an overall tendency for authors to “overscore” their papers, although underscoring was also common.
Some argue that what’s important is the aggregate REF score for a department, rather than the ratings of individual papers, because, according to the central limit theorem, any wayward ratings will “wash out” at the macro level. I disagree entirely. Individual academics across the UK continue to be coaxed and cajoled into producing 4* papers; there are even dedicated funding schemes to help them do so. And the repercussions arising from failure can be severe.
It is vital in any game of consequence that participants be able to agree when a goal has been scored or a boundary hit. Yet, in the case of research quality, there are far too many cases in which we just can’t. So the question must be asked: why are we still playing?