As I was finishing my PhD, I was clear that I wanted to work in the industry. The reason was clear, in a company, success is very objectively measurable. $$ in the short, medium or long term is success. In academia on the other hand, “successful” research is very subjective and hard (if not impossible) to measure.
A widely accepted way of measuring how “successful” a paper or a journal is to count the citations it gets. This manifests itself in summarized measures such as the h-index for researchers and impact factor for journals. Recently there are other more normalized metrics such as Field Weighted Citation Index and others. The numbers are different, but they all do one thing, count citations.
Though pulications such as the Leiden Manifesto clearly indicate these metrics should not be the only indicator of anything, some funding agencies put hard limits on who can apply to certain funding based on such metrics. Others, I am sure are doing these in the background without being open about them.
The idea behind this is very straightforward. A high quality paper is one that attracts attention from the wider community. Something that is cited as a reference must be well written and respected. Setting aside the Goodhardt principle for a moment, I would like to challenge this assumption with a couple of high profile cases from electrochemistry:
Definitely, I am not the first one to challenge the fundamental assumption of counting citations. Here is an interesting experiment where researchers debated the issue and published a junk paper to test it out. Read for yourself here.
On another analysis published as an editorial in Nature in 2017, the authors list a number of papers that were “impactful” after years of not getting citations. Ironically, their measure of impact is the number of citations that the papers got more recently. Ultimately, they are using the same metric that they are criticizing in measuring impact.
Unfortunately, it is painful to admit that I don’t have much in the way of a solution. The standard way promotion and tenure decisions are made is to rely on the opinion of people in the field whose opinions can be trusted. A file containing the accomplishments of the candidate is sent to people who judge whether the candidate has achieved enough. This is typically not considered peer review, as the reviewer is an expert with higher levels of experience than the evaluated. Whenever there is a person reviewing, the review is by definition subjective.
One way of making peer review better is a collective review. When the review is done by a large group of peers, typically the end result is much better. In an example of “Vox populi, vox dei”, couple recent high profile incidents showed the power of collective reviewing over social media.
In both cases(and I am sure others that I haven’t followed), high profile publications received due attention and collective review. This is a great way of assessing the quality of the science, however, it can only happen for the very few high profile cases.
Before I close, I have to come back to the Goodhart principle. Attaching a numerical value to anything is dangerous as Goodhart pointed out. The recent explanations of the principle are worded as “Any measure that becomes a target stops being a good measure”; and the original statement from Goodhart is ” any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes”. He is specifically talking about economic measures in reference to the UK monetary management in the 70s. Though the wording is different, the recent version is also clear. As long as there is a numerical measure of any quantity that is also a target for any human endavour, that target will be manipulated with various means. Though this is true for my original $$ argument as well, it is much less obvious and much harder to manipulate tangible, hard numbers compared to a vaguely defined “impact”.
Ultimately, “success” in academia is hard to measure, because the definition of success is highly subjective, field dependent and ill-defined. Evaluations are unavoidably influenced by interpersonal relations and biases. I don’t know what the right way forward is, but I know it is definitely not counting citations.