THE FUTURE OF REVIEWING IN THE ACL COMMUNITY

We are seeking community feedback on the content of this document in a survey available here: https://forms.office.com/r/44K0sFkrBA. The deadline for taking the survey is June 15th.

You can find information about the panel on the future of reviewing held at ACL 2022 in Dublin here.

This document was written by or with input from the ACL reviewing committee, the ACL and NAACL 2022 program chairs, the ARR team, as well as members of the ACL community (in particular a thread started by Ani Nenkova, Djame Seddah, Ryan Cotterell and Hal Daume proved useful). The writing process was coordinated by Hinrich Schuetze who also crafted the final version you see here. An attempt has been made to collect all important decision points from this input (resulting in 16 dimensions) and to make recommendations that reflect the consensus across all contributors.

Point of contact: Hinrich Schütze, hinrich@hotmail.com

HIGH-LEVEL OBJECTIVES

An important role of ACL in the NLP community is to organize the review and publication of research papers.

We believe that ACL and the ACL community want this reviewing and publication process to be:

● transparent

● timely

● consistent

● fair

● anonymous

● conducted through peer review

● apply high standards of scientific quality

● minimize overhead for community and organizers

We can only achieve this if there is a high participation by the community, most importantly, motivation to submit work to venues organized by ACL and willingness to serve as a reviewer.

If these goals can be achieved, then ACL will continue to be one of the drivers of scientific progress in the field, organize venues that foster exchange/collaboration and support members of the community in advancing during different stages of their careers.

PURPOSE OF THIS DOCUMENT

There was a general sense in the ACL community in 2019 that our reviewing and publication process was broken. Specifically, many community members believed that reviewing and publication were not timely, consistent, anonymous, and fair. This went in line with a rapid community growth changing its demographics, and a dramatic increase in the number of submissions and incremental resubmissions causing a heavy overload of the reviewers and the reviewing infrastructure available at the time. The complexity of the reviewing itself has increased hugely as more stakeholders are involved, the PCs have adopted a hierarchical structure (SAC, AC), and new requirements enter the arena continuously, e.g., the reproducibility of research. At the same time, the conference organization and the reviewing are largely carried out by a restricted number of volunteers with relatively little institutionalized support behind them.

To address this, the ACL reviewing committee was created at the ACL meeting in Florence in 2019. It proposed ACL Rolling Review (ARR) for improving reviewing and publication at ACL. After receiving community feedback, ACL started ACL Rolling Review as an experiment in May of 2021.

Based on the experience with ARR so far, this document presents options for how to develop reviewing in the ACL community in the future. It is organized along a set of dimensions, where we give either a recommendation or a set of options with associated pros/cons for each dimension. Even though we cannot freely mix and match different elements of the complex design space of reviewing/publishing, our hope is that the information we provide will be a good basis for the ACL Exec and the ACL community to develop a strategy for the future of reviewing in NLP.

TERMINOLOGY

● infrastructure: The infrastructure created by the ARR tech director and his team. See the following sections for a summary of this system.

● integrated reviewing system: The reviewing for all ACL conferences is handled

by a single integrated reviewing system; this integrated reviewing system requires an infrastructure of the type that was created for ARR.

● rolling review (RR): Rolling deadlines for submission of papers, revise-and-resubmit etc.; rolling review requires an integrated reviewing system.

● ARR: ACL Rolling Review, the system that is currently running as a pilot and that was used for ACL 2022 and NAACL 2022. ARR is an instance of rolling review (RR), but one could also imagine other implementations of RR.

● fragmented reviewing system: A system where the infrastructure is re-created for each individual conference, with no continuity or visibility into what is happening.

● AC: area chair

● AE: action editor

● SAE: senior action editor

● SAC: senior area chair

● PC: program chair

THE FUTURE OF REVIEWING IN NLP – DIMENSIONS AND RECOMMENDATIONS

1. Infrastructure

ARR has built an infrastructure, on top of OpenReview, for keeping track of data associated with reviewing and supporting a range of functionality. This includes:

● keeping track of submissions over time (this avoids the starting from scratch problem in which information about submissions and reviews is not shared across conferences)

● managing resubmissions (keeping track of the different revisions of the paper, the corresponding reviews and meta-reviews, making sure an attempt is made to assign the paper to the same reviewers and meta-reviewers)

● managing reviewers (including tracking reviewing load and reviewer demographics over time)

● managing action editors

● monitoring the health of reviewing over time (e.g., is quality of reviews getting better or worse)

● managing load balance over longer periods of time

● automatic assignment of papers to reviewers, reviewers to editors, and meta-reviewers to senior action editors

● COI handling

● generation of letters of recommendation/recognition for service provided to the ACL community. Reviewers use these, depending on their country of location and affiliation type, in end-of-year performance reviews, tenure and promotion and for visa applications.

● enforcing responsible and ethical NLP research (e.g., through the Responsible NLP research checklist)

● an ethically approved data donation workflow for all authors and reviewers to collect a dataset for scaffolding and intelligent support of the reviewing process over time

We believe that we want to have most (if not all) of this infrastructure / functionality available for the foreseeable future. The alternative would be what we define above as a “fragmented reviewing system”: a system where the infrastructure is re-created for each individual conference, with no continuity or visibility into what is happening

The ARR infrastructure requires improvement until it can optimally serve the community. (For example, OpenReview is currently not set up for resubmissions, which causes a lot of pain.) However, rebuilding the infrastructure on top of another system or building an entirely separate system just for ACL does not seem feasible to us.

RECOMMENDATIONS

● ACL should keep the infrastructure.

● Maintaining the infrastructure with volunteer work only is not sustainable. ACL should support the infrastructure by paying for 1 FTE. This FTE would take over a large part of the workload that is currently handled by volunteers (which is not sustainable).

2. Scope of rolling review (RR)

Initially, RR was proposed as a mechanism for handling all reviewing in NLP. However, RR has turned out to be a good fit for some venues, but not for others.

When reviewing is not anonymous, RR is not a good fit because of the overhead of offering both anonymous and non-anonymous reviewing in an integrated system. For example, submissions for shared tasks and demo tracks are often non-anonymous.

In our community, submissions to a workshop are often intended for this particular workshop, e.g., because the workshop will be a gathering of the subcommunity working on a specific subtopic and members of the subcommunity want to participate to present their ongoing work. This reasoning also applies to special themes at conferences.

Finally, the question has been raised whether it is a good use of our resources to support high-quality reviews for workshops -- it is unclear that is possible given the limited personpower of ARR action editors and reviewers, at least not at the present moment.

RECOMMENDATION

If ARR is continued, ARR should focus on the major ACL conferences (ACL, NAACL, EACL, AACL, EMNLP) while the system is still under development. ARR should be used for “general” submissions that would be a good fit for any of the major conferences. These are submissions that are most likely to benefit from the possibility of resubmission to the next conference in the next cycle. Workshops should run their reviewing separately. However, they are free to accept papers reviewed in a rolling review system (these will generally be papers whose evaluation makes it unlikely they will be accepted by a major conference).

3. RR cycle

The original ARR proposal envisioned a one-month cycle.

Currently, ARR uses a 6-week cycle.

Longer cycles have been proposed because longer cycles potentially reduce stress and workload for all involved.

Bidding, manual reviewer reassignment, author response and reviewer discussion are only possible with cycles of at minimum 6 weeks in length.

RECOMMENDATIONS

We recommend an 8-week cycle because it seems to be the best compromise between timeliness of decisions on the one hand and the considerations listed above on the other (reduction of workload/stress and potential of reintroducing manual reviewer reassignment, author response and reviewer discussion).

However, whoever makes decisions on cycle length should be given a fair amount of discretion. One complicating factor is timing of conferences. For example, if two conferences are held very shortly one after the other (as is/was the case with ACL 2022 in late May 2022 and NAACL 2022 in mid July 2022, i.e., separated by less than 8 weeks), a longer cycle may not be feasible. If ARR is to continue, the conference calendars should take into consideration the length of the ARR cycle (the two conferences would ideally not be so close in the calendar as ACL 22 and NAACL 22 were/are).

Jonathan Kummerfeld has proposed an alternative way of handling cycles that incentivizes a smoother distribution of submissions across months, to more evenly balance the reviewing load. You can find it here:

Jonathan Kummerfeld’s proposal

4. Review(er) stickiness

One of the key motivators for RR was that there is a lot of wasted effort if papers are rereviewed from scratch when a paper is resubmitted to a new conference.

On the other hand, a major criticism of RR has been that community members liked a fragmented reviewing system (each conference's reviewing is separate from previous conferences) because it allows a reset if a bad set of reviewers was assigned. "bad" may mean that the reviewers were not competent, but it can also just mean that the subjective component present in any reviewing played out to be negative for the submitted paper.

In ARR, authors may request new reviewers/action editors by clicking a radio button in the ARR Submission form. No justification for the request is necessary.

RECOMMENDATION

While keeping the same review(er)s should be encouraged, it should also be possible without too much effort to ask for a new set of reviewers. We recommend that this policy be adopted as an ACL policy, regardless of which reviewing framework is instituted.

This could also be extended to action editors. It probably is not a good idea to extend it to senior action editors (SAEs), especially if senior action editors are identified with tracks (see below).

5. Decoupling of reviews and acceptance

A defining feature of RR is decoupling of reviewing and acceptance decisions. If the two are closely coupled, then the reviews and metareviews of the first cycle at conference A cannot be reused in the next cycle for conference B because a rereview of the paper would be necessary.

A considerable number of RR reviews is less useful for making acceptance decisions than reviews in a “fragmented reviewing system” in which the (meta)reviewer writes a (meta)review for a specific conference, with a complete understanding of what the acceptance criteria for that conference are.

RECOMMENDATIONS: OPTIONS FOR MOVING FORWARD

(i) Return to a fragmented reviewing system: each (meta)review is only usable for a single conference.

(ii) Rewrite metareviews for each conference. So only the reviews would be "permanent", but the metareviews would be one-off, only usable for a single conference. Arguably, the cost of rewriting, say, an ACL metareview for NAACL should be low, so the additional cost of doing this should be manageable.

(iii) Harmonize standards for our major conferences and make explicit reference to those standards. For example, (metareview) "Based on the strengths/weaknesses listed above, I

recommend accepting this paper as a regular conference paper at one of our major conferences: ACL, NAACL, EACL, AACL, EMNLP."

(iv) Keep the current design of ARR, i.e., generic (meta)reviews that characterize a paper's strengths and weaknesses without reference to a particular venue. Perhaps this original design can be made to work by tweaking the reviewing instructions. If the metareview is well written, a senior area chair at a conference should be able to make a decision based on it even though the metareview is not specific to the conference.

There are divergent opinions about these four options.

(iii) could be the most elegant and clean option, but may be difficult to implement. (ii) would be an incremental fix that should address some of the issues that program chairs / SAEs at conferences pointed out, at a limited cost in terms of additional load (i.e., no additional load for reviewers, lightweight additional load for metareviewers).

6. Overhead of Reviewing and Conference Organization

One of the key motivators for RR was to reduce the overhead of reviewing at ACL conferences.

A lot has been learned in the ARR experiment. The system was stress-tested. We have seen that a RR system can handle 3000 submissions in a month. Of course, during this test, we also found a large number of problems. (Highlighting and discussing these is one purpose of this document.) While not all goals of the experiment have been achieved during the first year of ARR, it seems clear that an RR system, once it runs smoothly, would indeed reduce the overhead of reviewing and conference organization.

In a fragmented reviewing system, the organizers of a conference start from scratch without prior experience. If instead the system is run by a group of people that do several review cycles

and conferences in a row, then this is expected to be more efficient.

Apart from bugs, starting-up problems and OpenReview issues, there are at least two elements of the current ARR setup that work against reducing the overhead.

First, there is perceived to be a mismatch between the (meta)reviews ARR delivers and the (meta)reviews the conferences need for decision making. See discussion above.

Second, there are currently 4 reviewers assigned to each paper, not 3. This policy was instituted because only around 75% of reviews were arriving on time within a one-month cycle. With a longer cycle (6 weeks or 8 weeks), completion rate of reviews should approach 100% over time and it may be feasible to assign only 3 review(er)s to each submission.

RECOMMENDATIONS

The perceived mismatch between ARR metareviews and what conferences need was addressed above.

One would hope that we only need three reviewers per paper in an ARR system that runs smoothly.

Apart from the issue of 3 vs 4 reviews: it seems unlikely that returning to a fragmented reviewing system would reduce load on everybody involved in the process.

Given these considerations, we recommend keeping some form of RR (or at least an integrated system) to reduce the load on the community in the future.

7. Load Balancing

Peer review only works if community members are willing to serve as reviewers. Peer review should be fair: everybody should do their share -- as opposed to a few reviewing a lot and many not at all. Peer reviewing also needs to be flexible: if someone cannot review for a period of time due to illness, family obligations etc., then we should be able to accommodate for it .

RECOMMENDATION

The infrastructure created for ARR supports the desired functionality outlined above. We recommend keeping it in support of effective, fair and flexible peer review.

The infrastructure could also be extended easily to support additional functionality. For example, someone who submits more papers has to sign up for more reviews.

8. Preprints

OpenReview supports anonymous preprints. One could also contemplate non-anonymous preprints within an RR framework, e.g., once a review cycle for a paper has been concluded. Finally, the ACL anonymity policy could also be adapted to take account of changes in ACL reviewing in the last five years or so.

A separate document has been created by the ACL reviewing committee that summarizes the options and makes recommendations.

PROPOSAL FOR RR PUBLICATION POLICY

9. COI Handling

The infrastructure created for ARR supports COI handling – modulo a number of bugs that have been / will be fixed. We recommend keeping the ARR infrastructure for COI handling.

10. Reviewer Diversity

Other things being equal, the reviewers assigned to a paper should be diverse in at least two respects: at least one senior reviewer (not just junior reviewers) and not two reviewers should be from the same group. This has not been enforced in the past and is also not enforced in ARR.

RECOMMENDATION

These two constraints should be mandatory for reviewer assignments: at least one senior reviewer and no two reviewers from the same group. We recommend that this policy is adopted and enforced.

11. Tracks

The original ARR proposal envisioned "track-less" reviewing, i.e., similar to reviewing at many journals, there is no track structure.

The main reason for abandoning tracks was that it simplifies setting up a system like ARR considerably and eliminates some of the disadvantages that are known to exist in tracks, such as inflexibility, quality and size differences among the tracks.

However, the Program Chairs of ACL 2022 and NAACL 2022, as well as members of the community have voiced some concerns about abolishing tracks, specifically:

● Manual assignment of reviewers is more difficult without tracks, especially when dealing with emergency assignments.

● Balancing the program of a conference (traditionally done through tracks) is more difficult without tracks.

● We know that there is a lot of randomness to reviewing even in the best reviewing setup. This randomness can be controlled by tracks. Example: If a small subfield like Morphology does not have its own track, then the randomness of reviewing decisions may result in almost no papers accepted in one review cycle and many more than average in another. This is undesirable for those who depend on accepted papers for advancement in their career as well as for a balanced conference program. Tracks smooth out these random variations.

● Submissions in some small subfields of NLP like morphology may get worse reviews if they are reviewed by the general reviewing pool.

● When an SAC puts together the ranked list of papers for a track, they compare metareviews. Metareviews are more easily comparable if they are from a small set of action editors. At ACL 2022, one track had metareviews from 64 action editors. This makes comparison of papers based on metareviews harder.

RECOMMENDATIONS: OPTIONS FOR MOVING FORWARD

(i) If an RR-type system continues: introduce tracks. All conferences will be locked into those tracks.

(ii) Use senior action editors as stand-ins for tracks (this is what ARR has begun doing).

(iii) No tracks, but improve assignment in another way.

12. Performance of the Reviewing System

If reviewing is managed in an integrated system, then there are many opportunities to measure and improve the performance of the system that do not exist in a fragmented approach.

Currently ARR has implemented and is working to improve:

1) Review profiles - critical to not require authors and reviewers to create new profiles again and again, and critical to have them for COI and reviewer assignment

a) Reviewer load tracking

b) Review ‘balance’ tracking

2) Sustainable reviewer assignment with integrity

a) COI detection

b) Expertise-driven reviewer assignment

3) Ethics reviewing (with the ACL ethics committee)

4) Reviews as data

a) Collecting reviews as data for the NLP community

b) Collecting assessments of review quality from authors and metareviewers

5) Consistent review and submission statistics

a) For the first time, we know the number of resubmissions over a calendar year, and we know the outcomes for many of those resubmissions

RECOMMENDATION

We recommend keeping the common infrastructure created for ARR in order to measure and improve the performance of reviewing for ACL venues.

A critical element of a high-performing reviewing system is mentoring and training of reviewers -- junior reviewers as well as reviewers that do not perform at the highest level. Mentoring and training should also be extended to action editors.

The scope of mentoring/reviewing obviously should include the (meta)reviews themselves (their utility for serving as basis for acceptance decisions, their helpfulness to authors, observance of basic rules such as politeness and giving citations for missing related work). Timeliness would be another important subject. For action editors, reviewer assignment should be covered.

13. Research on Reviewing

Reviewing/publication are of great interest as research topics to our community. To conduct research, we need to collect the data. An integrated system makes data collection and dissemination easier. In the long run, this research will lead to better tools, especially to support inexperienced reviewers and improve reviewing workflows.

RECOMMENDATION

Keep the integrated system, encourage collection and dissemination of datasets for research on reviewing/publication.

14. Guaranteed Completion of Reviews

A basic convention at NLP workshops and conferences is that if you submit a paper then its review will be completed on time for the venue to make an acceptance decision.

ARR’s default currently is not to give such a guarantee. The reason is that it is easy to reach a high completion rate of 90% (or perhaps 95% once the system is debugged), but to get to 100%

takes considerable resources. Note that journals generally do not guarantee completion of reviewing by a certain date -- precisely because they do not want to expend the resources on an issue that affects a small percentage of submissions.

RECOMMENDATION

Allow the ARR-reviewed submissions with fewer than the necessary number of submissions (with less than 3 reviews or missing a meta-review) to nonetheless be committed to a venue -- the decision making process of the venue would then proceed in the same manner as it did before (in the fragmented reviewing system) for those few papers.

15. Opt-In vs. Opt-Out for Reviewing

ARR has adopted an opt-out policy: whoever submits a paper for a cycle must also review in that cycle unless they have a good reason why they cannot.

Some community members (as well as some PC chairs) would prefer opt-in. They feel that many people have complicated lives and should not be forced to review even if it is not easy for them to justify that request (e.g., for reasons of privacy).

RECOMMENDATION

Peer review only works if most of those who submit papers also contribute reviews. This argues for opt-out. Of course, there needs to be an effective mechanism for opting out if someone has a good reason for not being able to help with the reviewing.

We therefore recommend opt-out.

16. Consistency

There are many aspects of reviewing in the ACL community that should be handled consistently across conferences: ethics, reproducibility, COIs, collecting data for research, review(er) stickiness, mentoring, load balancing, reviewer recognition, ...

RECOMMENDATION

An integrated reviewing system supports consistency (as well as lasting innovative changes to how these matters are handled) and should therefore be made permanent.