April 2020

Pursuing COVID-19 at Internet Speed

As scientists grapple with the global pandemic, preprint servers let information be free—and fast

By Corey S. Powell

It is the slowest of times, it is the fastest of times. 

The hours drag heavily during the COVID-19 pandemic, as we worry about our safety and about the safety of the people we care about; the wait for effective drugs and vaccines often feels agonizing. And yet, by any traditional measure, the research on the novel coronavirus responsible for the pandemic is progressing at an astonishing pace. On January 9, health officials in China and at the World Health Organization announced the discovery of a novel coronavirus (now known as SARS-CoV-2) responsible for the outbreak that began in Wuhan, China. Just three days later, Chinese officials had sequenced and shared the genome of the virus. 

Since then, scientists around the world have been working furiously to understand and to control COVID-19, releasing results almost as soon as they emerge from the lab. In a recent interview with The New York Times, vaccine researcher Paul Duprex of the University of Pittsburgh describes it as a time to "cut the crap," leaving the formality of journal publishing as something to worry about later. 

A critical part of that accelerated response is preprint publishing—specifically, two resources known as bioRxiv and medRxiv (pronounced "bio-archive" and "med-archive"). These archives, co-created by John Inglis and Richard Sever at Cold Spring Harbor Laboratory, provide a forum where researchers can post their papers freely and publicly. Both sites have grown steadily since the launch of bioRxiv in 2013 and the addition of medRxiv six years later. Together they now host some 80,000 papers, a tally that already includes more than 1,100 papers related to SARS-CoV-2. 

Here, Richard Sever shares some insights about the creation of bioRxiv and medRxiv, how they operate, and their prominent role in the global effort to combat the pandemic.

BioRxiv and medRxiv go against the long tradition of requiring researchers to put their papers through peer review before sharing them. Why bypass one of the central filters of science publishing?

The overall mission is to get research out more quickly. The responsible process of peer review is needed, but it delays the dissemination of work considerably. On average, it takes about eight months to publish a paper in a journal, and a lot of the time people have to go through more than one journal to get published. A rule of thumb is that when you submit a paper it'll probably be about a year before anyone can read it.

What bioRxiv and medRxiv do is say: Look, it's really important to get the information out to experts first, and many of them are perfectly equipped to judge the paper themselves. Why not decouple the evaluation from the dissemination? Disseminate the paper first so that everybody can read it, and have the seal of approval come later.

What effect has the COVID-19 pandemic had on the way people are using the two archives?

We're seeing a huge amount of traffic right now because of the coronavirus pandemic, and we're getting a lot of papers coming into medRxiv. MedRxiv has been growing slowly but surely since its launch midway through last year, but since January of this year we've been receiving a very high volume of papers. Most of the papers coming in now are on COVID-19, and we're seeing a big spike in the number of submissions from China, because obviously they were the country first affected by the virus.

Rapid sharing of scientific information is especially important during a health crisis. Did you have that kind of rapid response in mind from the start?

The aspiration for both of them was always to speed up science by disseminating information more quickly. You're seeing that writ large in the midst of the COVID-19 pandemic. There's even more reason to want to disseminate information, because the consequences of rapid dissemination should be rapid discovery. I follow Twitter quite closely and watch the responses from scientists. An MD just tweeted, "Do you want to plug into the biggest ongoing multi-professional conference in human history? #COVID-19medRxiv."

What are some of the most notable COVID-19 findings being shared on medRxiv and bioRxiv?

We've seen things like estimates of how long coronavirus will remain viable on different surfaces getting huge amounts of traffic. There was the announcement of a serological test for people who had antibodies to the virus, which is something that you want to happen so you can test the people who’ve had the virus. And of course the epidemiology—the studies coming out of China saying how many people have had the virus in different regions, what the outcomes were, the effectiveness of the different strategies—people are very interested in that.

On the bioRxiv side, you've got the fundamental biology of the virus. It's amazing to think how long some of these studies would ordinarily take. We already have papers on the key structures of the proteins involved [in COVID-19 infection] posted on bioRxiv, which is incredible. There’s an unprecedented speed and level of dissemination.

There has also been a lot of mathematical modeling looking at the infectivity [of SARS-CoV-2]. You have all these data coming out about the number of people getting infected, how fast they're getting infected, how many people they are infecting. All of these things are helpful.

"Do you want to plug into the biggest ongoing multi-professional conference in human history?"

How does the publishing response to COVID-19 compare to that for previous epidemics and pandemics? 

There was an article a while ago in the journal PLOS Medicine pointing out that when the first SARS outbreak happened in 2002-2003, 93% of the papers about SARS came out after the epidemic was over. When you got to the Zika epidemic [in 2015], we were seeing papers on bioRxiv in the midst of the epidemic. But bioRxiv was still fairly young, medRxiv didn't exist, and the practice of posting preprints in biology was not familiar to that many people. 

Now with the SARS-CoV-2 virus [responsible for COVID-19], posting preprints, reading preprints, citing them in job applications, putting them on your CV, et cetera, has become much more commonplace in the biomedical sciences. People have immediately turned to medRxiv and bioRxiv in the midst of this epidemic. Chinese scientists were less familiar, but they’ve quickly learned of the practice and adopted it. On medRxiv, we’re now getting something like 20 papers a day on coronavirus alone.

What are the differences between bioRxiv and medRxiv—why do you need two of them?

BioRxiv was designed to handle basic biological research, at a stage way before anything involving patients. It was for people doing basic work in genetics, cell biology, neuroscience, ecology, that type of thing, following a [preprint-posting] model that’s been widespread in physics and computational science for almost 30 years. We thought, we will trial this in basic research first.

Then people like Eric Topol at Scripps started saying, This is really important, we should also do it for clinical research. We’d always excluded clinical research from bioRxiv because there are issues around health claims, registration for clinical trials, et cetera. There are a bunch of things that you worry about when you’re disseminating material that's not been evaluated in any way. So medRxiv came a bit later, and it had a slightly different process.

How do you vet papers to prevent the archives from getting inundated with junk science?

For both archives, there’s a two-stage process. First, we have what are essentially clerical checks. We make sure the thing looks like a paper: It’s not spam, it has all the necessary information, it's run through a plagiarism check to ensure that somebody hasn't taken somebody else's paper and put their name on it.

Then there's a second phase where the paper is examined by scientists. At bioRxiv, there's a group of about 150 active scientists, all of whom have lab leadership positions. They look at each paper to give it a quick thumbs-up or to see if they have any concerns. They’re charged with asking, Is this science? You know: It might be one of the worst papers I've ever seen, but it's a science paper. It's a coarse filter. 

Screenshot_2020-03-29 bioRxiv COVID-19 SARS-CoV-2 preprints from medRxiv and bioRxiv
BioRxiv and its companion, medRxiv, have become go-to destinations for sharing COVID-19 research.

Many of the papers on medRxiv deal directly with medicine and human health. Do those require an extra level of scrutiny?

A similar process operates on medRxiv, but there are more hoops that the authors have to jump through. If there is a treatment claim, they have to provide a clinical trial ID from an accepted registry like clinicaltrials.gov. They have to make a series of declarations that they've got consent from any patients or other research participants. They have to make a conflict of interest declaration. They have to declare that they've received approval by an institutional review board, and provide details.

Then we go through the basic screening process. Is this science? Is this nonsense? Is this spam? But also, Is this something that has the potential to cause harm by changing public behavior? If we got a paper that said, for example, the MMR vaccine is causing autism, or cigarettes don't cause cancer, we would turn it away. We’d say, We are making no judgment on the accuracy or quality of the paper, but some things should go through peer review first because the consequences of being wrong are severe.

Do you have an equivalent of a retraction process for dealing with flawed or disputed papers?

No, not as such. We have a withdrawal process. The paper's not actually removed, but it is marked as withdrawn and there's a reason given so that there's a transparent record of what transpired. The most common reason is that somebody posts a paper and then later they do more experiments and say, You know what, we were wrong, some of our assumptions weren't correct. They can then go back to the bioRxiv paper and withdraw it, and say This paper has been withdrawn. We don't want people to cite it because we don't believe the conclusions hold up. It's pretty rare. 

The whole point of the server is to put things out there so that the scientific community can read them, discuss them, and evaluate them. For every bioRxiv paper, below the abstract there is a Comments section. The ideal situation is that people can then comment on the paper and the community can make a decision. We also have links to any blogs or news articles discussing the paper and any Twitter mentions. Since the papers have not undergone peer review, they're all posted caveat emptor

Also, it's not like the evaluation process at journals isn't flawed. We can all think of many, many papers that had to be retracted [from journals] because they were wrong. The difference with bioRxiv and medRxiv is that there was no kind of claim that they were right in the first place.

What happens in a pandemic when you've got a much more real-time view of science coming through?

Researchers clearly understand that papers in bioRxiv and medRxiv haven't gone through peer review, but do journalists get that as well?

Actually, journalists seem to be doing a very job. When they write about a paper [from one of the archives], they will go to a bunch of other scientists and ask, Look there's this thing up there, what do you think?

Do you see the archives as part of a broader change in how researchers share their findings with the world?

That’s an interesting question. Up until now, the purpose of the preprint was the quick dissemination of the work, with the general expectation that at some point down the line there will be a published, peer-reviewed journal article--in a few month's time, in a year's time. When we look back after two years, more than 70% of papers [posted in bioRxiv] end up in journals. 

Now, what happens in a pandemic when you've got a much more real-time view of science coming through? If you're a Chinese scientist writing an epidemiological summary of something that's coming out of Wuhan on February the 2nd, and you post it as a preprint, are you going to want to publish that preprint as a formal journal article a year later? It may be that there is a whole category of paper that is only ever a preprint—not because of quality but because of timing.

At some point we will be able to do a retrospective analysis for the COVID-19 papers as opposed to regular papers. We've always had a small fraction of papers where people say, You know what, a preprint is good enough; I want to get it out there, but I don't need it to send to a journal for formal publication. My guess is that, in the context of a pandemic, the fraction of papers that fall into that category will go up.

What effect is preprint publishing having on traditional scientific journals?

There are two slightly different answers. For regular papers, aside from the epidemic, it does pose a question to journals: Should they try to do this more, should they try and do this faster? But another question is: Does the fact that the information is already out there mean that a journal can actually do it more thoroughly?

Authors are desperate to get their papers through peer review and published in a journal so that people can read them. If everybody is able to read them already, it could allow journals to sit back and think, What is the most effective way to evaluate these papers? It’s an opportunity for people to think about how we do peer review.

Right now, journals are being bombarded with COVID-19 papers. They feel a responsibility to get the material out as fast as possible, but there are always trade-offs between speed and how thorough you are. We've already seen this in the case of two or three published papers on coronavirus that have had some problems. In this tension between speed and thoroughness, pre-prints offer the opportunity to tilt the balance.

Journals don’t feel like you are trying to bypass them?

There were a few journals that used to consider preprints prior publication, we were effectively able to make the case that this wasn't competing with journals. What journals do is they are thorough vetters of work. They use peer review to scrutinize, assess, and certify papers. The service they perform is still valuable. There are a lot of journals that now automatically send papers to bioRxiv, and there's a vast number of journals that will receive papers directly from bioRxiv. Particularly in the context of the COVID-19 epidemic, some journals have actively endorsed the sharing of preprints.

Physicists began publishing preprints way back in 1991, but biologists and medical researchers had long been reluctant to follow suit. How did you overcome that resistance? 

The first step was to acclimatize biologists to this process and accept that biology is no different from physics. There were some claims that in biology you needed to peer-review everything before anybody would read it. There were arguments that biologists were more competitive and wouldn’t reveal stuff before it's peer-reviewed. The physicists said, “You've obviously never met a physicist if you don’t think they are competitive!”

People had concerns: Will I be scooped? Actually, what you're using is an anti-scooping device. Once you put the work online with your name on, it's there and you can point to it. If somebody says, I was the first to do X, you can say, Here's this thing on bioRxiv from a year ago with my name on it. It’s permanent and is citable. Now you’ll go to a scientific conference and the day before people talk about their work they'll post a paper on bioRxiv so they have their marker down.

Another concern was, Is this going to increase noise? Is there going to be loads of misinformation? Not really, because if a paper is no good, nobody's making you read it. More importantly, in academia your reputation is key. Physicists talk about sweating bullets before they press the Submit button on arXiv [the physics preprint server], and I think that's the same in biology. You’re judged on your work, so if you start putting out half-finished crap on bioRxiv, you will get a reputation as the guy who does half-finished crappy work.

COVID-19 has greatly raised the profile of bioRxiv and medRxiv. Where do you go from here?

The most important answer is the somewhat boring one: We just want more and more scientists to use them. There are between 1,000,000 and 2,000,000 papers published every year in biomedical science. The ideal would be that all of those go on bioRxiv and medRxiv, and then they will be evaluated by a constellation of different journals. We want to speed up science by making sure everybody releases stuff as soon as possible.

Beyond that, there are technological challenges. We have 77,000 papers on bioRxiv right now; we have to think about what it looks like if you have 1,000,000. It’s like a Red Queen hypothesis, you have to run to stand still. Meanwhile, all kinds of other archives have come along. After we launched bioRxiv, immediately you got ChemRxiv, SocArXiv, EngrXiv, psyarXiv, a whole constellation of them. What’s nice about them is that they are subject-specific, nonprofit, community-based repositories. 

We’re not planning to build a whole load more archives. We're very mindful that being able to do what we’re doing with 10 times as many papers will be a huge effort. We want to keep our eye on the main mission for biology and health sciences.