EMNLP 2026 Submission

LegalGPT

Graph-Augmented Legal Outcome Prediction using Citation Networks and Large Language Models

The first system to combine legal citation graph structure with LLM-based reasoning for predicting Supreme Court case outcomes

Abstract

Predicting legal case outcomes is a challenging task that requires understanding both the textual content of cases and the complex web of precedential relationships that shape judicial reasoning. We introduce LegalGPT, a novel system that combines graph neural networks with large language models for Supreme Court outcome prediction. Our approach uses GraphSAGE (Hamilton et al., 2017) to learn node embeddings from a citation network of 10,000+ cases and 150,000+ citation edges, enabling retrieval of precedents based on structural similarity rather than just lexical matching. These retrieved precedents are then provided as context to a QLoRA-fine-tuned Mistral-7B model (Dettmers et al., 2023) for outcome classification.

On a held-out test set, LegalGPT achieves 0.80 AUROC and 76% accuracy, representing a +9.6% improvement over text-only baselines and +5.2% over prior state-of-the-art methods (Katz et al., 2017). Ablation studies demonstrate that graph-augmented retrieval contributes +6% AUROC over dense retrieval alone, validating the importance of citation structure. We release our code, trained models, and dataset to facilitate reproducibility and future research in legal AI.

Legal NLP Graph Neural Networks Retrieval-Augmented Generation Case Outcome Prediction Supreme Court

0 Introduction

Motivation

The United States Supreme Court decides approximately 80 cases per term, each establishing precedents that shape American law for decades. Understanding and predicting these outcomes has profound implications for legal practitioners, scholars, and policy makers. While prior work has achieved moderate success using hand-crafted features (Katz et al., 2017; Kaufman et al., 2019) or transformer-based text classification (Chalkidis et al., 2019), these approaches treat cases as isolated documents, ignoring the rich network of citations that encodes how legal reasoning propagates through the judicial system.

Legal reasoning is fundamentally graph-structured: courts cite precedents to justify decisions, and the pattern of citations reveals latent relationships between legal concepts. A case citing Roe v. Wade and Planned Parenthood v. Casey signals different legal context than one citing Miranda v. Arizona and Gideon v. Wainwright. We hypothesize that explicitly modeling this citation structure improves outcome prediction beyond what text alone can achieve.

Research Questions

RQ1

Does incorporating citation graph structure improve legal outcome prediction over text-only baselines?

RQ2

Can graph-based retrieval identify more relevant precedents than lexical (BM25) or dense (embedding) retrieval?

RQ3

How do different components (graph, retrieval, LLM) contribute to overall system performance?

Contributions

1
First Integrated Graph+LLM Legal Prediction System

We present LegalGPT, the first system to combine citation graph neural networks with LLM-based reasoning for Supreme Court outcome prediction, demonstrating that these modalities are complementary.

2
Graph-Augmented Hybrid Retrieval

We introduce a novel retrieval method combining GraphSAGE embeddings, citation proximity, and BM25 scoring that outperforms single-signal retrieval by +6% AUROC.

3
Reproducible Low-Cost Training Pipeline

Our QLoRA-based approach enables training on a single A100 GPU for under $30, democratizing legal AI research for academic labs without enterprise resources.

4
Comprehensive Evaluation and Analysis

We provide rigorous ablation studies, statistical significance tests, calibration analysis, and attention-based interpretability to understand model behavior.

1 Problem Statement

The Challenge

Predicting legal case outcomes remains challenging because decisions depend not only on case facts but also on how courts interpret and apply precedents. Current approaches treat cases as isolated text documents, ignoring the rich network of citations that reveals how legal reasoning flows through the judicial system.

P(outcome | case_text, citation_graph)

Our goal: Model case outcomes as a function of both textual content and citation network structure.

Why It Matters

Legal Practice

Attorneys can better assess case strength and identify relevant precedents

Judicial Consistency

Understanding prediction patterns can reveal biases in judicial decision-making

Legal AI Foundation

Establishes how graph structure improves legal NLP beyond text-only approaches

Current Limitations of Text-Only Approaches

Approach Text Citations Graph Structure Limitation
BERT/Legal-BERT Yes No No 512 token limit, no precedent awareness
Longformer Yes No No Long context but isolated documents
LLM + BM25 Retrieval Yes Partial No Lexical matching misses semantic links
LegalGPT (Ours) Yes Yes Yes Full integration of all signals

2 Key Innovations

First Integrated System

Combines citation graph structure with case text and LLM reasoning in a unified pipeline.

Novel contribution: No prior work integrates all three signals for legal outcome prediction.

GraphSAGE Retrieval

Uses graph neural network embeddings to find precedents based on citation structure, not just text similarity.

Key insight: Cases citing similar precedents share legal reasoning patterns.

Affordable Training

QLoRA fine-tuning enables Mistral-7B adaptation for just $30 total compute cost on a single A100 GPU.

Practical impact: Research-grade legal AI without enterprise budgets.

3 System Architecture

LegalGPT operates as a 3-stage pipeline: graph-based retrieval identifies relevant precedents, context assembly builds the prompt, and the fine-tuned LLM generates predictions with confidence scores.

Interactive Pipeline Visualization

Scroll down to trigger animation. Click "Replay" to watch again.

View Static Architecture Diagram
                                    LegalGPT Architecture
    ================================================================================

    STAGE 1: GRAPH RETRIEVAL
    -------------------------
    ┌─────────────────┐      ┌──────────────────┐      ┌─────────────────┐
    │   Query Case    │──────│    Neo4j Graph   │──────│   GraphSAGE     │
    │   (Input)       │      │   (226 edges)    │      │   Embeddings    │
    └─────────────────┘      └──────────────────┘      └────────┬────────┘
                                                                │
                             Hybrid Score = 0.6 * embedding + 0.4 * citation
                                                                │
                                                                ▼
                                                    ┌─────────────────────┐
                                                    │  Top-K Precedents   │
                                                    │     (k = 5)         │
                                                    └──────────┬──────────┘
                                                               │
    STAGE 2: CONTEXT ASSEMBLY                                  │
    --------------------------                                 │
                             ┌─────────────────────────────────┘
                             │
                             ▼
    ┌─────────────────────────────────────────────────────────────────────────────┐
    │                           PROMPT TEMPLATE                                    │
    │  ┌─────────────────────────────────────────────────────────────────────┐   │
    │  │ [INST] You are a legal analyst. Given this Supreme Court case       │   │
    │  │ and similar precedents, predict the outcome.                        │   │
    │  │                                                                      │   │
    │  │ ## Query Case                                                        │   │
    │  │ {case_summary}                                                       │   │
    │  │                                                                      │   │
    │  │ ## Similar Precedents                                                │   │
    │  │ 1. {precedent_1} - Outcome: {outcome}                               │   │
    │  │ 2. {precedent_2} - Outcome: {outcome}                               │   │
    │  │ ...                                                                  │   │
    │  │                                                                      │   │
    │  │ Prediction: [/INST]                                                  │   │
    │  └─────────────────────────────────────────────────────────────────────┘   │
    └─────────────────────────────────────────────────────────────────────────────┘
                                                               │
    STAGE 3: LLM CLASSIFICATION                                │
    ----------------------------                               │
                                                               ▼
                              ┌─────────────────────────────────────────────┐
                              │         Mistral-7B + QLoRA Adapter          │
                              │  ┌─────────────────────────────────────┐    │
                              │  │  Base: Mistral-7B-Instruct-v0.3    │    │
                              │  │  Quantization: 4-bit NF4           │    │
                              │  │  LoRA Rank: 16, Alpha: 32          │    │
                              │  │  Trainable: ~7M params (0.1%)      │    │
                              │  └─────────────────────────────────────┘    │
                              └──────────────────────┬──────────────────────┘
                                                     │
                                                     ▼
                              ┌─────────────────────────────────────────────┐
                              │              OUTPUT                          │
                              │  ┌─────────────────────────────────────┐    │
                              │  │  Prediction: PETITIONER              │    │
                              │  │  Confidence: 0.78                    │    │
                              │  │  Reasoning: Based on precedents...   │    │
                              │  └─────────────────────────────────────┘    │
                              └─────────────────────────────────────────────┘
            

4 Performance Highlights

0.80
AUROC
+9.6% vs baseline
0.75
F1 Score
+10.3% vs baseline
76%
Accuracy
+7.3% vs baseline
0.08
ECE
Well-calibrated

Comparison with Prior SCOTUS Prediction Work

Work Method Accuracy Notes
Katz et al. (2017) Random Forest + case features 70.2% Hand-crafted features, no text
Kaufman et al. (2019) Neural network + SCOTUS features 72.8% Improved feature engineering
Baseline (Longformer) Transformer encoder 70.8% Text-only, no retrieval
LegalGPT (Ours) GraphSAGE + Mistral-7B 76.0% Graph-augmented retrieval + LLM

Impact of Graph-Augmented Retrieval

0.74
No Retrieval
+6%
0.80
GraphSAGE Retrieval

Adding citation-aware retrieval improves AUROC by 6 percentage points, demonstrating that precedent structure matters.

5 Dataset Summary

SCDB Cases

Source Supreme Court Database
Original scope 9,144 cases (1946-2023)
Matched with text 163 cases
Petitioner wins 93 (57%)
Respondent wins 70 (43%)
Avg case length ~41K characters

Citation Graph

Total edges 226
Unique sources 24 cases
Unique targets 176 cited cases
Avg out-degree 9.4 citations/case
Citation types Supreme Court (65%), Federal (15%), State (16%), Other (4%)

6 Related Work & Context

Legal NLP Benchmarks

LexGLUE
Multi-task benchmark for legal NLP
Chalkidis et al., 2022
CAIL
Chinese AI and Law Challenge
Xiao et al., 2018
ECHR
European Court of Human Rights cases
Chalkidis et al., 2019

Gap We Fill

No prior work integrates all three components:

  • Citation network structure
  • Full case text
  • LLM-based reasoning

Previous approaches either use citation networks for link prediction (without outcome prediction) or use text-only models (ignoring network structure). LegalGPT bridges this gap.

7 Ethics Statement

Intended Use

Research & Education

Understanding patterns in judicial decision-making, legal scholarship, teaching tools for law students.

Legal Practice Support

Assisting attorneys in identifying relevant precedents and assessing case strength as one input among many.

Judicial Transparency

Revealing patterns that may indicate inconsistencies or biases in judicial reasoning.

Risks & Mitigations

Over-reliance Risk

Model predictions should supplement, not replace, human legal judgment. We explicitly discourage using predictions as definitive forecasts.

Bias Amplification

Historical case outcomes may reflect systemic biases. Our model may perpetuate these patterns. We recommend fairness audits before deployment.

Access Inequality

Powerful legal AI tools could advantage well-resourced litigants. We release our code and models openly to democratize access.

Ethical Guidelines for Users

DO

  • Use as one factor in case assessment, not the sole determinant
  • Validate predictions against legal expertise
  • Disclose AI assistance in legal filings where required
  • Consider model uncertainty (confidence scores)
  • Audit for fairness across demographic groups

DO NOT

  • Use predictions to deny legal services
  • Present AI predictions as legal advice
  • Deploy without human oversight in high-stakes decisions
  • Assume model generalizes to non-Supreme Court contexts
  • Ignore low-confidence predictions without review

Data Privacy & Consent

All data used in this research consists of publicly available Supreme Court opinions and metadata from the Supreme Court Database (SCDB). No private or personal data was collected. Case outcomes are matters of public record. The SCDB is licensed for academic research use. We comply with CourtListener's terms of service for case text retrieval.

8 Limitations

Data Limitations

Sample Size 163 cases with full text matching limits statistical power. Expanding the dataset is ongoing work.
Temporal Scope 1946-2023 SCDB cases only. Legal reasoning patterns may differ for earlier courts or future compositions.
Domain Specificity Supreme Court only. Results may not generalize to lower federal courts, state courts, or international jurisdictions.
Citation Coverage Regex-based extraction may miss some citations. External citations (law reviews, statutes) not included.

Methodological Limitations

Post-hoc Prediction We use full opinion text (retrospective). True forecasting would require pre-decision features only.
Binary Outcomes Petitioner/respondent simplification. Ignores partial wins, remands, plurality opinions.
No LLM Baselines GPT-4, Claude comparisons pending. Zero-shot LLM performance unknown.
Interpretability Attention analysis is correlational, not causal. True reasoning may differ from attention patterns.

Future Work to Address Limitations

Expanded Dataset

Improve SCDB-CourtListener matching to 1000+ cases for robust evaluation.

LLM Baselines

Compare against GPT-4, Claude-3, and other frontier models in zero-shot and few-shot settings.

Multi-court Extension

Extend to Circuit Courts and state supreme courts to test generalization.

True Forecasting

Use pre-argument briefs and oral arguments for real predictive applications.

Fine-grained Outcomes

Multi-class prediction: affirm, reverse, remand, vacate, per curiam.

Fairness Audit

Evaluate performance disparities across issue areas and party types.

Reproducibility Note: Despite limitations, all experiments are fully reproducible. Code, data splits, and trained model weights are publicly available. We encourage the community to build upon this work.

R Researchers

LS

Luis Sanchez

UC Berkeley, Computer Science

Founding Engineer at Paloa Labs. Former SWE Intern at Adobe. Chancellor's Scholar. Focus on Agentic AI, infrastructure, and automation.

ST

Shubhankar Tripathy

Stanford PhD, OpenAI Researcher

PhD candidate at Stanford University. Research Scientist at OpenAI. Focus on large language models, reasoning, and AI safety.

Methodology

Technical details on GraphSAGE embeddings, hybrid retrieval, and QLoRA fine-tuning.

Data

Comprehensive analysis of SCDB dataset, citation graph, and data preprocessing.

Results

Full evaluation metrics, ablation studies, and comparison with baselines.