THE FUTURE OF REVIEWING IN THE ACL COMMUNITY
We are seeking community feedback on the content of this document in a survey available here: https://forms.office.com/r/44K0sFkrBA. The deadline for taking the survey is June 15th.
You can find information about the panel on the future of reviewing held at ACL 2022 in Dublin here.
This document was written by or with input from the ACL reviewing committee, the ACL and NAACL 2022 program chairs, the ARR team, as well as members of the ACL community (in particular a thread started by Ani Nenkova, Djame Seddah, Ryan Cotterell and Hal Daume proved useful). The writing process was coordinated by Hinrich Schuetze who also crafted the final version you see here. An attempt has been made to collect all important decision points from this input (resulting in 16 dimensions) and to make recommendations that reflect the consensus across all contributors.
Point of contact: Hinrich Schütze, hinrich@hotmail.com
An important role of ACL in the NLP community is to
organize the review and publication of
research papers.
We believe that ACL and the ACL community want this reviewing and publication process to be:
●
transparent
●
timely
●
consistent
●
fair
●
anonymous
●
conducted through peer
review
● apply high standards of scientific quality
● minimize overhead for community and organizers
We can only achieve this if there is a high participation
by the community, most importantly, motivation
to submit work to venues organized by ACL and
willingness to serve as a reviewer.
If these goals can be achieved, then ACL will continue to
be one of the drivers of scientific progress
in the field, organize venues that foster
exchange/collaboration and support members of
the community in advancing during different
stages of their careers.
There was a general sense in the ACL community in 2019 that our reviewing and publication process was broken. Specifically, many community members
believed that reviewing and publication were not
timely, consistent, anonymous, and fair. This
went in line with a rapid community growth changing its demographics, and a
dramatic increase in the number of submissions and incremental resubmissions
causing a heavy overload of the reviewers and the reviewing infrastructure
available at the time. The complexity of the reviewing itself has increased
hugely as more stakeholders are involved, the PCs have adopted a hierarchical
structure (SAC, AC), and new requirements enter the arena continuously, e.g.,
the reproducibility of research. At the same time, the conference organization
and the reviewing are largely carried out by a restricted number of volunteers
with relatively little institutionalized support behind them.
To address this, the ACL reviewing committee was created at
the ACL meeting in Florence in 2019. It proposed ACL
Rolling Review (ARR) for improving reviewing and
publication at ACL. After receiving community
feedback, ACL started ACL Rolling Review as an
experiment in May of 2021.
Based on the experience with ARR so far, this document presents options
for how to develop reviewing in the ACL community in the future. It is organized along a set of dimensions,
where we give either a recommendation or a set
of options with associated pros/cons for each dimension. Even though we cannot freely mix and match different elements of the complex design space of
reviewing/publishing, our hope is that the
information we provide will be a good basis
for the ACL Exec and the ACL community to
develop a strategy for the future of reviewing in
NLP.
● infrastructure: The infrastructure created by the ARR tech director and his team. See the following sections for a summary of this system.
● integrated reviewing system: The reviewing for all ACL conferences is handled
by a single integrated reviewing system; this integrated reviewing system requires an infrastructure of the type that was created for ARR.
● rolling review (RR): Rolling deadlines for submission of papers, revise-and-resubmit etc.; rolling review requires an integrated reviewing system.
●
ARR: ACL
Rolling Review, the system that is currently running as a pilot and that was
used for ACL 2022 and NAACL 2022. ARR is an instance of rolling review (RR),
but one could also imagine other implementations of RR.
● fragmented reviewing system: A system where the infrastructure is re-created for each individual conference, with no continuity or visibility into what is happening.
● AC: area chair
● AE: action editor
● SAE: senior action editor
● SAC: senior area chair
● PC: program chair
ARR has built an infrastructure, on top of OpenReview, for keeping track of
data associated with reviewing and supporting
a range of functionality. This includes:
●
keeping track of
submissions over time (this avoids the starting from scratch problem in which
information about submissions and reviews is not shared across conferences)
● managing resubmissions (keeping track of the different revisions of the paper, the corresponding reviews and meta-reviews, making sure an attempt is made to assign the paper to the same reviewers and meta-reviewers)
● managing reviewers (including tracking reviewing load and reviewer demographics over time)
●
managing action
editors
● monitoring the health of reviewing over time (e.g., is quality of reviews getting better or worse)
●
managing load balance
over longer periods of time
●
automatic assignment of
papers to reviewers, reviewers to editors, and meta-reviewers to senior action
editors
●
COI handling
●
generation of letters
of recommendation/recognition for service provided
to the ACL community. Reviewers use these, depending on their country of
location and affiliation type, in end-of-year performance reviews,
tenure and promotion and for visa applications.
● enforcing responsible and ethical NLP research (e.g., through the Responsible NLP research checklist)
● an ethically approved data donation workflow for all authors and reviewers to collect a dataset for scaffolding and intelligent support of the reviewing process over time
We believe that we want to have most (if not all) of this
infrastructure / functionality available for the
foreseeable future. The alternative
would be what we define above as a “fragmented reviewing system”: a system
where the infrastructure is re-created for each individual conference, with no
continuity or visibility into what is happening
The ARR
infrastructure requires improvement until it can optimally serve the community.
(For example, OpenReview is currently not set up for resubmissions, which
causes a lot of pain.) However, rebuilding the
infrastructure on top of another system or
building an entirely separate system just for
ACL does not seem feasible to us.
●
ACL should keep the
infrastructure.
● Maintaining the infrastructure with volunteer work only is not sustainable. ACL should support the infrastructure by paying for 1 FTE. This FTE would take over a large part of the workload that is currently handled by volunteers (which is not sustainable).
Initially, RR was proposed as a mechanism for handling all
reviewing in NLP. However, RR has turned out to be a
good fit for some venues, but not for others.
When reviewing is not anonymous, RR is not a good fit because of the overhead of offering both anonymous and
non-anonymous reviewing in an integrated system. For
example, submissions for shared tasks and demo
tracks are often non-anonymous.
In our community, submissions to a workshop are often
intended for this particular workshop, e.g., because
the workshop will be a gathering of the
subcommunity working on a specific subtopic
and members of the subcommunity want to
participate to present their ongoing work. This reasoning also applies to special themes at conferences.
Finally, the question has been raised whether it is a good
use of our resources to support high-quality reviews
for workshops -- it is unclear that is
possible given the limited personpower of ARR action editors and
reviewers, at least not at the present moment.
If ARR is
continued, ARR should focus on the major ACL conferences (ACL, NAACL, EACL,
AACL, EMNLP) while the system is still under development. ARR should be used for “general” submissions
that would be a good fit for any of the major conferences. These are
submissions that are most likely to benefit from the possibility of
resubmission to the next conference in the next cycle. Workshops should run
their reviewing separately. However, they are free to accept papers reviewed in
a rolling review system (these will generally be papers whose evaluation makes
it unlikely they will be accepted by a major conference).
The original ARR proposal envisioned a one-month cycle.
Currently, ARR uses a 6-week cycle.
Longer cycles have been proposed because longer cycles
potentially reduce stress and workload for all
involved.
Bidding, manual reviewer reassignment, author response and
reviewer discussion are only possible with
cycles of at minimum 6 weeks in length.
We recommend an 8-week cycle because it seems to be the
best compromise between timeliness of
decisions on the one hand and the
considerations listed above on the other (reduction of workload/stress and potential of reintroducing manual
reviewer reassignment, author response and reviewer discussion).
However, whoever makes decisions on cycle length should be given a fair amount of discretion. One complicating factor is timing of conferences. For example, if two conferences are held very shortly one after the other (as is/was the case with ACL 2022 in late May 2022 and NAACL 2022 in mid July 2022, i.e., separated by less than 8 weeks), a longer cycle may not be feasible. If ARR is to continue, the conference calendars should take into consideration the length of the ARR cycle (the two conferences would ideally not be so close in the calendar as ACL 22 and NAACL 22 were/are).
Jonathan Kummerfeld has proposed an alternative way of handling cycles that incentivizes a smoother distribution of submissions across months, to more evenly balance the reviewing load. You can find it here:
Jonathan Kummerfeld’s proposal
One of the key motivators for RR was that there is a lot of
wasted effort if papers are rereviewed from scratch
when a paper is resubmitted to a new conference.
On the other hand, a major criticism of RR has been that
community members liked a fragmented reviewing
system (each conference's reviewing is separate from
previous conferences) because it allows a
reset if a bad set of reviewers was assigned.
"bad" may mean that the reviewers were
not competent, but it can also just mean that the subjective component present in any reviewing played out to
be negative for the submitted paper.
In ARR, authors may request new reviewers/action editors by clicking a radio button in the ARR Submission form. No justification for the request is necessary.
While keeping the same
review(er)s should be encouraged, it should
also be possible without too much effort to ask for
a new set of reviewers. We recommend that this policy be adopted as an ACL policy, regardless of which reviewing
framework is instituted.
This could also be extended to action editors. It probably
is not a good idea to extend it to senior action
editors (SAEs), especially if senior action
editors are identified with tracks (see
below).
A defining feature of RR is decoupling of reviewing and
acceptance decisions. If the two are closely coupled,
then the reviews and metareviews of the first
cycle at conference A cannot be reused in the
next cycle for conference B because a rereview
of the paper would be necessary.
A considerable number of RR reviews is less useful for making acceptance decisions than reviews in a “fragmented reviewing system” in which the (meta)reviewer writes a (meta)review for a specific conference, with a complete understanding of what the acceptance criteria for that conference are.
(i) Return to a fragmented
reviewing system: each (meta)review is only
usable for a single conference.
(ii) Rewrite metareviews
for each conference. So only the reviews would
be "permanent", but the metareviews would be one-off, only usable for a single conference. Arguably, the cost of rewriting,
say, an ACL metareview for NAACL should be
low, so the additional cost of doing this should be manageable.
(iii) Harmonize standards for our major conferences and
make explicit reference to those standards.
For example, (metareview) "Based on the
strengths/weaknesses listed above, I
recommend accepting this paper as a regular conference
paper at one of our major conferences: ACL,
NAACL, EACL, AACL, EMNLP."
(iv) Keep the current
design of ARR, i.e., generic (meta)reviews
that characterize a paper's strengths and weaknesses
without reference to a particular venue. Perhaps this original design can be made to work by tweaking the reviewing instructions. If the metareview is well written,
a senior area chair at a conference should be able to
make a decision based on it even though the
metareview is not specific to the conference.
There are divergent opinions about these four options.
(iii) could be the most elegant and clean option, but may
be difficult to implement. (ii) would be an
incremental fix that
should address some of the issues that program chairs
/ SAEs at conferences pointed out, at a limited cost in terms of additional load (i.e., no additional
load for reviewers, lightweight additional load for metareviewers).
One of the key motivators for RR was to reduce the overhead
of reviewing at ACL conferences.
A lot has
been learned in the ARR experiment. The system was stress-tested. We have seen
that a RR system can handle 3000 submissions in a month. Of course, during this
test, we also found a large number of problems. (Highlighting and discussing
these is one purpose of this document.) While not all goals of the experiment have been achieved during the
first year of ARR, it seems clear that an RR system,
once it runs smoothly, would indeed reduce the overhead
of reviewing and conference organization.
In a fragmented reviewing system, the organizers of a
conference start from scratch without prior
experience. If instead the system is run by a
group of people that do several review cycles
and conferences in a row, then this is expected to be more
efficient.
Apart from bugs, starting-up problems and OpenReview
issues, there are at least two elements of the current ARR setup that work against reducing the overhead.
First, there is perceived to be a mismatch between the
(meta)reviews ARR delivers and the (meta)reviews the
conferences need for decision making. See discussion
above.
Second, there are currently 4 reviewers assigned to each
paper, not 3. This policy was
instituted because only around 75% of reviews were arriving on time within a
one-month cycle. With a longer cycle (6 weeks or 8 weeks), completion rate of
reviews should approach 100% over time and it may be feasible to assign only 3
review(er)s to each submission.
The perceived mismatch between ARR metareviews and what conferences need was addressed above.
One would hope that we only need three reviewers per paper
in an ARR system that runs smoothly.
Apart from the issue of 3 vs 4 reviews: it seems unlikely
that returning to a fragmented reviewing system would
reduce load on everybody involved in the
process.
Given these considerations, we recommend keeping some form
of RR (or at least an integrated system) to reduce
the load on the community in the future.
Peer review only works if community members are willing
to serve as reviewers. Peer review should be fair:
everybody should do their share -- as opposed to a
few reviewing a lot and many not at all. Peer
reviewing also needs to be flexible: if
someone cannot review for a period of time due
to illness, family obligations etc., then we should
be able to accommodate for it .
The infrastructure created for ARR supports the desired functionality outlined above. We recommend keeping
it in support of effective, fair and flexible peer
review.
The infrastructure could also be extended easily to support
additional functionality. For example, someone who
submits more papers has to sign up for more
reviews.
OpenReview supports anonymous preprints. One could also
contemplate non-anonymous preprints within an RR
framework, e.g., once a review cycle for a
paper has been concluded. Finally, the ACL
anonymity policy could also be adapted to take
account of changes in ACL reviewing in the last
five years or so.
A separate document has been created by the ACL reviewing
committee that summarizes the options and makes
recommendations.
PROPOSAL FOR RR PUBLICATION POLICY
The infrastructure created for ARR supports COI handling –
modulo a number of bugs that have been / will be
fixed. We recommend keeping the ARR
infrastructure for COI handling.
Other things being equal, the reviewers assigned to a paper
should be diverse in at least two respects: at least
one senior reviewer (not just junior
reviewers) and not two reviewers should be
from the same group. This has not been enforced
in the past and is also not enforced in ARR.
These two constraints should be mandatory for reviewer
assignments: at least one senior reviewer and no two
reviewers from the same group. We recommend that this
policy is adopted and enforced.
The original ARR proposal envisioned "track-less"
reviewing, i.e., similar to reviewing at many
journals, there is no track structure.
The main reason for abandoning tracks was that it
simplifies setting up a system like ARR
considerably and eliminates some of the disadvantages that are known to exist
in tracks, such as inflexibility, quality and size differences among the tracks.
However, the
Program Chairs of
ACL 2022 and NAACL 2022, as well as members of the community have voiced
some concerns about abolishing tracks, specifically:
●
Manual assignment of
reviewers is more difficult without tracks, especially when dealing with
emergency assignments.
●
Balancing the program
of a conference (traditionally done through tracks) is more difficult without
tracks.
● We know that there is a lot of randomness to reviewing even in the best reviewing setup. This randomness can be controlled by tracks. Example: If a small subfield like Morphology does not have its own track, then the randomness of reviewing decisions may result in almost no papers accepted in one review cycle and many more than average in another. This is undesirable for those who depend on accepted papers for advancement in their career as well as for a balanced conference program. Tracks smooth out these random variations.
● Submissions in some small subfields of NLP like morphology may get worse reviews if they are reviewed by the general reviewing pool.
●
When an SAC puts together the ranked list of papers for a track,
they compare metareviews. Metareviews are more easily comparable if they are
from a small set of action editors. At ACL
2022, one track had metareviews from 64 action editors. This makes comparison
of papers based on metareviews harder.
(i) If an RR-type system continues: introduce tracks. All
conferences will be locked into those tracks.
(ii) Use senior action editors as stand-ins for tracks
(this is what ARR has begun doing).
(iii) No tracks, but improve assignment in another
way.
If reviewing is managed in an integrated system, then there
are many opportunities to measure and improve the
performance of the system that do not exist in a
fragmented approach.
Currently ARR has implemented and is working to improve:
1) Review profiles - critical to not require authors and reviewers to create new profiles again and again, and critical to have them for COI and reviewer assignment
a) Reviewer load tracking
b) Review ‘balance’ tracking
2) Sustainable reviewer assignment with integrity
a) COI detection
b) Expertise-driven reviewer assignment
3) Ethics reviewing (with the ACL ethics committee)
4) Reviews as data
a) Collecting reviews as data for the NLP community
b) Collecting assessments of review quality from authors and metareviewers
5) Consistent review and submission statistics
a) For the first time, we know the number of resubmissions over a calendar year, and we know the outcomes for many of those resubmissions
We recommend keeping the
common infrastructure created for ARR in order
to measure and improve the performance of reviewing for ACL venues.
A critical element of a high-performing reviewing system is
mentoring and training of reviewers -- junior
reviewers as well as reviewers that do not
perform at the highest level. Mentoring and
training should also be extended to action
editors.
The scope of mentoring/reviewing obviously should include
the (meta)reviews themselves (their utility
for serving as basis for acceptance decisions,
their helpfulness to authors, observance of
basic rules such as politeness and giving citations for missing related
work). Timeliness would be another important subject.
For action editors, reviewer assignment should be covered.
Reviewing/publication are of great interest as research
topics to our community. To conduct research, we need
to collect the data. An integrated system
makes data collection and dissemination
easier. In the long run, this research will lead to better tools, especially to
support inexperienced reviewers and improve reviewing workflows.
Keep the integrated system, encourage collection and
dissemination of datasets for research on
reviewing/publication.
A basic convention at NLP workshops and conferences is that
if you submit a paper then its review will be
completed on time for the venue to make an
acceptance decision.
ARR’s default currently is not to give such a guarantee. The reason is that it is easy to reach a high completion rate of 90% (or
perhaps 95% once the system is debugged), but to get
to 100%
takes considerable resources. Note that journals generally
do not guarantee completion of reviewing by a certain
date -- precisely because they do not want to
expend the resources on an issue that affects
a small percentage of submissions.
Allow the
ARR-reviewed submissions with fewer than the necessary number of submissions
(with less than 3 reviews or missing a meta-review) to nonetheless be committed
to a venue -- the decision making process of the venue would then proceed in
the same manner as it did before (in the fragmented reviewing system) for those
few papers.
ARR has adopted an opt-out policy: whoever submits a paper
for a cycle must also review in that cycle unless
they have a good reason why they cannot.
Some community members (as well as some PC chairs) would
prefer opt-in. They feel that many people have
complicated lives and should not be forced to
review even if it is not easy for them to justify that request (e.g.,
for reasons of privacy).
Peer review only works if most of those who submit papers
also contribute reviews. This argues for opt-out. Of
course, there needs to be an effective mechanism for
opting out if someone has a good reason for
not being able to help with the reviewing.
We therefore recommend opt-out.
There are many aspects of reviewing in the ACL community
that should be handled consistently across
conferences: ethics, reproducibility, COIs,
collecting data for research, review(er) stickiness, mentoring, load balancing,
reviewer recognition, ...
RECOMMENDATION
An integrated reviewing system supports consistency (as
well as lasting innovative changes to how these
matters are handled) and should therefore be
made permanent.