Skip to content

ADR-006: LLM as First Reviewer, Not Final Judge

Status

Accepted (implementation deferred to v0.2.0)

Context and Problem Statement

DAGpedia aims to maintain scientific quality of contributed DAGs at scale. Manual expert review of every submission is the gold standard but does not scale with community growth. LLMs (large language models) have broad domain knowledge and can reason about causal structures, making them a candidate for automated quality checks.

The design question is: what role should LLMs play in the DAG validation pipeline?

Considered Options

  • LLM as final gatekeeper — DAGs are accepted or rejected automatically based on LLM judgment
  • LLM as first reviewer — LLM produces a structured report of potential issues; a human makes the final decision
  • No LLM involvement — all validation is human-driven

Decision Outcome

Chosen option: LLM as first reviewer, not final judge

LLMs are well-suited to flagging common structural and semantic issues in DAGs before human review. However, LLMs can hallucinate citations, misread context-specific causal reasoning, and lack access to unpublished or non-English literature. Granting them final authority would introduce systematic errors that undermine DAGpedia's scientific credibility.

The role of the LLM is to reduce the cognitive load on human reviewers by surfacing obvious problems — not to replace expert judgment.

Validation checks the LLM performs

Category Example check
Temporal ordering Does the outcome precede the exposure?
Missing common causes Are obvious confounders absent?
Edge direction plausibility Is the causal direction biologically/socially implausible?
Collider risk Is a collider included in the adjustment set?
Over-adjustment Does the adjustment set include a mediator on the causal path?
Unmeasured confounding Should unmeasured confounders (U nodes) be present?

Output format

The LLM produces a structured YAML report with categorized issues, severity levels (high / medium / low), and a confidence score. Human reviewers can confirm or dismiss each flagged issue; dismissals are logged with a reason, creating an audit trail consistent with the living DAGs epistemic framework.

Dismissed issues are retained in the record — the act of explicitly overriding a flag is itself scientifically meaningful information.

Grounding strategy

To reduce hallucination risk on evidence grading, LLM calls that assess evidence strength are grounded via PubMed API lookup before the LLM is asked to judge. The LLM reasons over retrieved abstracts, not training-data memory.

Consequences

  • Good: Scales review capacity without sacrificing human oversight
  • Good: Consistent first-pass checks regardless of reviewer expertise
  • Good: Dismissed flags create a transparent record of scientific reasoning
  • Bad: LLM API calls introduce latency and cost per submission
  • Bad: Requires careful prompt engineering to avoid over-flagging (reviewer fatigue) or under-flagging (false confidence)

References

  • Related: ADR-005 (Tier system — LLM validation feeds into Reviewed tier)
  • Reynolds RJ. Am J Epidemiol. 2026;195(5):1365–1367. https://doi.org/10.1093/aje/kwag029