BadScientist: Can a Research Agent Write Convincing but Unsound Papers that Fool LLM Reviewers?

Jiang, Fengqing; Feng, Yichen; Li, Yuetai; Niu, Luyao; Alomair, Basel; Poovendran, Radha

BadScientist: Can a Research Agent Write Convincing but Unsound Papers that Fool LLM Reviewers?

Fengqing Jiang^†^♠, Yichen Feng^†^♠, Yuetai Li^♠, Luyao Niu^♠, Basel Alomair^♢♠, Radha Poovendran^♠

^♠University of Washington, ^♢King Abdulaziz City for Science and Technology

🏆 Agents4Science 2025 Best Paper Award

^†Equal Contribution

arXiv Paper Code - Coming Soon

Abstract

The convergence of LLM-powered research assistants and AI-based peer review systems creates a critical vulnerability: fully automated publication loops where AI-generated research is evaluated by AI reviewers without human oversight. We investigate this through BadScientist, a framework that evaluates whether fabrication-oriented paper generation agents can deceive multi-model LLM review systems. Our generator employs presentation-manipulation strategies requiring no real experiments. We develop a rigorous evaluation framework with formal error guarantees (concentration bounds and calibration analysis), calibrated on real data. Our results reveal systematic vulnerabilities: fabricated papers achieve acceptance rates up to $82.0\%$. Critically, we identify concern-acceptance conflict---reviewers frequently flag integrity issues yet assign acceptance-level scores. Our mitigation strategies show only marginal improvements, with detection accuracy barely exceeding random chance. Despite provably sound aggregation mathematics, integrity checking systematically fails, exposing fundamental limitations in current AI-driven review systems and underscoring the urgent need for defense-in-depth safeguards in scientific publishing.

BadScientist Framework

Overview of the BadScientist framework. A Paper Agent generates fabricated papers from seed topics using manipulation strategies. A Review Agent evaluates papers using multiple LLM models (o3, o4-mini, GPT-4.1), calibrated against ICLR 2025 data, with GPT-5 checking for integrity concerns.

✨ BadScientist Main Results ✨

In our experiments, we instantiate the following five atomic strategy elements and a joint strategy All:

s₁: TooGoodGains — foreground extraordinary improvements over the leading baseline and frame them as field-advancing.
s₂: BaselineSelect — curate comparisons to favor the proposed method; omit variance/CI in the main table.
s₃: StatTheater — polished ablations, precise p-values/CI/error bars (mainly in the supplement), tidy hyperparameter tables, and "forthcoming" repo/DOI links.
s₄: CoherencePolish — flawless cross-references, unified terminology, consistent significant digits, aligned captions, professional typography.
s₅: ProofGap — a theorem/lemma with an ostensibly rigorous proof hiding a subtle oversight.
All: combined application of all above atomic strategies.

Our main evaluation result is in Table ACPT–ICR. We find that acceptance is unexpectedly high under most manipulations. Single strategies already yield substantial ACPT (e.g., ACPT_{τ_rate}=67.0%, ACPT_{τ_0.5}=82.0% for s₁), indicating that current review agents are easily persuaded and lack sufficient awareness to spot integrity/fabrication issues. The All strategy as a composed setup attains high acceptance (52.0%/69.0%), but it also maximally increases detectability (ICR@M 51.7%, o3 50.6%), suggesting that composing strategies broadens the footprint seen by detectors. Among single strategies, s₁ provides the strongest acceptance with only moderate detection pressure (ICR@M 39.5%), whereas others (e.g., s₃–s₅) are somewhat weaker but also less detectable (ICR@M ≈ 30–34%). Across models, o3 is the most flag-happy (consistently higher ICR-m), while GPT-4.1 rarely flags concerns (mostly 2–8%), reinforcing that current review models have limited and uneven detection capability.

Score Distributions

This figure plots score histograms for three models across six setups (s₁–s₅, All) with the acceptance threshold marked. Overall, o4-mini is right-shifted—consistently placing more mass at ≥7—which aligns with its higher acceptance tendency. o3 shows larger variance and a fatter right tail (notably in s₁ and All), producing many near-threshold and high scores; GPT-4.1 is comparatively conservative, clustering around 6–7 with a thinner tail at 8+. Among strategies, s₁ yields the strongest rightward shift for all models, while s₂/s₄ are milder. The All setup increases polarization (more mass both just below and above the threshold), explaining why it sustains high acceptance yet is easier for detectors to flag.

Concern–Acceptance Conflict

Conditioned on a model posting an integrity concern, we report the share that still receives an acceptance-level score in Table Model Results. Conflict is widespread: o4-mini is most contradictory (s₁/s₃/All: 100%; s₂/s₄/s₅: 50–80%), GPT-4.1 is mixed (0% in s₄ but 33–75% elsewhere), and o3 is moderate (~26–52%). s₃ (statistical theater) induces the largest cross-model conflict, and All further amplifies it for o4-mini (100%). These observations indicate even agents voice concerns, yet keep acceptance-high scores, and integrity signals are not well-coupled to review.

Mitigation Strategies

(i) Review-with-Detection (ReD) We modify the review prompt to bundle a conventional rubric-based review with a mandatory detection of AI-generated content. For each document, a single model m ∈ M outputs: (a) a standard review, (b) a binary label y_m ∈ {Non-Fabricated, AI-Fabricated}, and (c) evidence snippets (quoted lines) to support its finding. We compare ReD against the initial review agent baseline (same models, same rubric, no integrity subtask) in Table 3.

(ii) Detection-Only (DetOnly) We remove the reviewing rubric and instruct the agent to perform only the detection of AI-generated content. In this mode, the model's sole task is to return a binary label y_m ∈ {Non-Fabricated, AI-Fabricated} and evidence, which serves as the direct and final triage action, without any secondary assessment. We set three detectors on our new dataset: Random Guess baseline, the ReD integrity component, and DetOnly. The results are presented in Table 4.

Paper

BibTeX

@misc{jiang2025badscientistresearchagentwrite,
      title={BadScientist: Can a Research Agent Write Convincing but Unsound Papers that Fool LLM Reviewers?},
      author={Fengqing Jiang and Yichen Feng and Yuetai Li and Luyao Niu and Basel Alomair and Radha Poovendran},
      year={2025},
      eprint={2510.18003},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2510.18003},
}