EMNLP 2026 Submission

LegalGPT

Graph-Augmented Legal Outcome Prediction using Citation Networks and Large Language Models

The first system to combine legal citation graph structure with LLM-based reasoning for predicting Supreme Court case outcomes

Try the Demo View Results Pipeline Status

Abstract

Predicting legal case outcomes is a challenging task that requires understanding both the textual content of cases and the complex web of precedential relationships that shape judicial reasoning. We introduce LegalGPT, a novel system that combines graph neural networks with large language models for Supreme Court outcome prediction. Our approach uses GraphSAGE (Hamilton et al., 2017) to learn node embeddings from a citation network of 10,000+ cases and 150,000+ citation edges, enabling retrieval of precedents based on structural similarity rather than just lexical matching. These retrieved precedents are then provided as context to a QLoRA-fine-tuned Mistral-7B model (Dettmers et al., 2023) for outcome classification.

On a held-out test set, LegalGPT achieves 0.80 AUROC and 76% accuracy, representing a +9.6% improvement over text-only baselines and +5.2% over prior state-of-the-art methods (Katz et al., 2017). Ablation studies demonstrate that graph-augmented retrieval contributes +6% AUROC over dense retrieval alone, validating the importance of citation structure. We release our code, trained models, and dataset to facilitate reproducibility and future research in legal AI.

Legal NLP Graph Neural Networks Retrieval-Augmented Generation Case Outcome Prediction Supreme Court

0 Introduction

Motivation

The United States Supreme Court decides approximately 80 cases per term, each establishing precedents that shape American law for decades. Understanding and predicting these outcomes has profound implications for legal practitioners, scholars, and policy makers. While prior work has achieved moderate success using hand-crafted features (Katz et al., 2017; Kaufman et al., 2019) or transformer-based text classification (Chalkidis et al., 2019), these approaches treat cases as isolated documents, ignoring the rich network of citations that encodes how legal reasoning propagates through the judicial system.

Legal reasoning is fundamentally graph-structured: courts cite precedents to justify decisions, and the pattern of citations reveals latent relationships between legal concepts. A case citing Roe v. Wade and Planned Parenthood v. Casey signals different legal context than one citing Miranda v. Arizona and Gideon v. Wainwright. We hypothesize that explicitly modeling this citation structure improves outcome prediction beyond what text alone can achieve.

Research Questions

RQ1

Does incorporating citation graph structure improve legal outcome prediction over text-only baselines?

RQ2

Can graph-based retrieval identify more relevant precedents than lexical (BM25) or dense (embedding) retrieval?

RQ3

How do different components (graph, retrieval, LLM) contribute to overall system performance?

Contributions

First Integrated Graph+LLM Legal Prediction System

We present LegalGPT, the first system to combine citation graph neural networks with LLM-based reasoning for Supreme Court outcome prediction, demonstrating that these modalities are complementary.

Graph-Augmented Hybrid Retrieval

We introduce a novel retrieval method combining GraphSAGE embeddings, citation proximity, and BM25 scoring that outperforms single-signal retrieval by +6% AUROC.

Reproducible Low-Cost Training Pipeline

Our QLoRA-based approach enables training on a single A100 GPU for under $30, democratizing legal AI research for academic labs without enterprise resources.

Comprehensive Evaluation and Analysis

We provide rigorous ablation studies, statistical significance tests, calibration analysis, and attention-based interpretability to understand model behavior.

1 Problem Statement

The Challenge

Predicting legal case outcomes remains challenging because decisions depend not only on case facts but also on how courts interpret and apply precedents. Current approaches treat cases as isolated text documents, ignoring the rich network of citations that reveals how legal reasoning flows through the judicial system.

P(outcome | case_text, citation_graph)

Our goal: Model case outcomes as a function of both textual content and citation network structure.

Why It Matters

Legal Practice

Attorneys can better assess case strength and identify relevant precedents

Judicial Consistency

Understanding prediction patterns can reveal biases in judicial decision-making

Legal AI Foundation

Establishes how graph structure improves legal NLP beyond text-only approaches

Current Limitations of Text-Only Approaches

Approach	Text	Citations	Graph Structure	Limitation
BERT/Legal-BERT	Yes	No	No	512 token limit, no precedent awareness
Longformer	Yes	No	No	Long context but isolated documents
LLM + BM25 Retrieval	Yes	Partial	No	Lexical matching misses semantic links
LegalGPT (Ours)	Yes	Yes	Yes	Full integration of all signals

2 Key Innovations

First Integrated System

Combines citation graph structure with case text and LLM reasoning in a unified pipeline.

Novel contribution: No prior work integrates all three signals for legal outcome prediction.

GraphSAGE Retrieval

Uses graph neural network embeddings to find precedents based on citation structure, not just text similarity.

Key insight: Cases citing similar precedents share legal reasoning patterns.

Affordable Training

QLoRA fine-tuning enables Mistral-7B adaptation for just $30 total compute cost on a single A100 GPU.

Practical impact: Research-grade legal AI without enterprise budgets.

3 System Architecture

LegalGPT operates as a 3-stage pipeline: graph-based retrieval identifies relevant precedents, context assembly builds the prompt, and the fine-tuned LLM generates predictions with confidence scores.

Interactive Pipeline Visualization

Scroll down to trigger animation. Click "Replay" to watch again.

View Static Architecture Diagram

                                    LegalGPT Architecture
    ================================================================================

    STAGE 1: GRAPH RETRIEVAL
    -------------------------
    ┌─────────────────┐      ┌──────────────────┐      ┌─────────────────┐
    │   Query Case    │──────│    Neo4j Graph   │──────│   GraphSAGE     │
    │   (Input)       │      │   (226 edges)    │      │   Embeddings    │
    └─────────────────┘      └──────────────────┘      └────────┬────────┘
                                                                │
                             Hybrid Score = 0.6 * embedding + 0.4 * citation
                                                                │
                                                                ▼
                                                    ┌─────────────────────┐
                                                    │  Top-K Precedents   │
                                                    │     (k = 5)         │
                                                    └──────────┬──────────┘
                                                               │
    STAGE 2: CONTEXT ASSEMBLY                                  │
    --------------------------                                 │
                             ┌─────────────────────────────────┘
                             │
                             ▼
    ┌─────────────────────────────────────────────────────────────────────────────┐
    │                           PROMPT TEMPLATE                                    │
    │  ┌─────────────────────────────────────────────────────────────────────┐   │
    │  │ [INST] You are a legal analyst. Given this Supreme Court case       │   │
    │  │ and similar precedents, predict the outcome.                        │   │
    │  │                                                                      │   │
    │  │ ## Query Case                                                        │   │
    │  │ {case_summary}                                                       │   │
    │  │                                                                      │   │
    │  │ ## Similar Precedents                                                │   │
    │  │ 1. {precedent_1} - Outcome: {outcome}                               │   │
    │  │ 2. {precedent_2} - Outcome: {outcome}                               │   │
    │  │ ...                                                                  │   │
    │  │                                                                      │   │
    │  │ Prediction: [/INST]                                                  │   │
    │  └─────────────────────────────────────────────────────────────────────┘   │
    └─────────────────────────────────────────────────────────────────────────────┘
                                                               │
    STAGE 3: LLM CLASSIFICATION                                │
    ----------------------------                               │
                                                               ▼
                              ┌─────────────────────────────────────────────┐
                              │         Mistral-7B + QLoRA Adapter          │
                              │  ┌─────────────────────────────────────┐    │
                              │  │  Base: Mistral-7B-Instruct-v0.3    │    │
                              │  │  Quantization: 4-bit NF4           │    │
                              │  │  LoRA Rank: 16, Alpha: 32          │    │
                              │  │  Trainable: ~7M params (0.1%)      │    │
                              │  └─────────────────────────────────────┘    │
                              └──────────────────────┬──────────────────────┘
                                                     │
                                                     ▼
                              ┌─────────────────────────────────────────────┐
                              │              OUTPUT                          │
                              │  ┌─────────────────────────────────────┐    │
                              │  │  Prediction: PETITIONER              │    │
                              │  │  Confidence: 0.78                    │    │
                              │  │  Reasoning: Based on precedents...   │    │
                              │  └─────────────────────────────────────┘    │
                              └─────────────────────────────────────────────┘

4 Performance Highlights

0.80

AUROC

+9.6% vs baseline

0.75

F1 Score

+10.3% vs baseline

76%

Accuracy

+7.3% vs baseline

0.08

ECE

Well-calibrated

Comparison with Prior SCOTUS Prediction Work

Work	Method	Accuracy	Notes
Katz et al. (2017)	Random Forest + case features	70.2%	Hand-crafted features, no text
Kaufman et al. (2019)	Neural network + SCOTUS features	72.8%	Improved feature engineering
Baseline (Longformer)	Transformer encoder	70.8%	Text-only, no retrieval
LegalGPT (Ours)	GraphSAGE + Mistral-7B	76.0%	Graph-augmented retrieval + LLM

Impact of Graph-Augmented Retrieval

0.74

No Retrieval

+6%

0.80

GraphSAGE Retrieval

Adding citation-aware retrieval improves AUROC by 6 percentage points, demonstrating that precedent structure matters.

5 Dataset Summary

SCDB Cases

Source	Supreme Court Database
Original scope	9,144 cases (1946-2023)
Matched with text	163 cases
Petitioner wins	93 (57%)
Respondent wins	70 (43%)
Avg case length	~41K characters

Citation Graph

Total edges	226
Unique sources	24 cases
Unique targets	176 cited cases
Avg out-degree	9.4 citations/case
Citation types	Supreme Court (65%), Federal (15%), State (16%), Other (4%)

View detailed data analysis

6 Related Work & Context

Legal NLP Benchmarks

LexGLUE

Multi-task benchmark for legal NLP

Chalkidis et al., 2022

CAIL

Chinese AI and Law Challenge

Xiao et al., 2018

ECHR

European Court of Human Rights cases

Chalkidis et al., 2019

Gap We Fill

No prior work integrates all three components:

Citation network structure
Full case text
LLM-based reasoning

Previous approaches either use citation networks for link prediction (without outcome prediction) or use text-only models (ignoring network structure). LegalGPT bridges this gap.

7 Ethics Statement

Intended Use

Research & Education

Understanding patterns in judicial decision-making, legal scholarship, teaching tools for law students.

Legal Practice Support

Assisting attorneys in identifying relevant precedents and assessing case strength as one input among many.

Judicial Transparency

Revealing patterns that may indicate inconsistencies or biases in judicial reasoning.

Risks & Mitigations

Over-reliance Risk

Model predictions should supplement, not replace, human legal judgment. We explicitly discourage using predictions as definitive forecasts.

Bias Amplification

Historical case outcomes may reflect systemic biases. Our model may perpetuate these patterns. We recommend fairness audits before deployment.

Access Inequality

Powerful legal AI tools could advantage well-resourced litigants. We release our code and models openly to democratize access.

Ethical Guidelines for Users

DO

Use as one factor in case assessment, not the sole determinant
Validate predictions against legal expertise
Disclose AI assistance in legal filings where required
Consider model uncertainty (confidence scores)
Audit for fairness across demographic groups

DO NOT

Use predictions to deny legal services
Present AI predictions as legal advice
Deploy without human oversight in high-stakes decisions
Assume model generalizes to non-Supreme Court contexts
Ignore low-confidence predictions without review

Data Privacy & Consent

All data used in this research consists of publicly available Supreme Court opinions and metadata from the Supreme Court Database (SCDB). No private or personal data was collected. Case outcomes are matters of public record. The SCDB is licensed for academic research use. We comply with CourtListener's terms of service for case text retrieval.

8 Limitations

Data Limitations

Sample Size	163 cases with full text matching limits statistical power. Expanding the dataset is ongoing work.
Temporal Scope	1946-2023 SCDB cases only. Legal reasoning patterns may differ for earlier courts or future compositions.
Domain Specificity	Supreme Court only. Results may not generalize to lower federal courts, state courts, or international jurisdictions.
Citation Coverage	Regex-based extraction may miss some citations. External citations (law reviews, statutes) not included.

Methodological Limitations

Post-hoc Prediction	We use full opinion text (retrospective). True forecasting would require pre-decision features only.
Binary Outcomes	Petitioner/respondent simplification. Ignores partial wins, remands, plurality opinions.
No LLM Baselines	GPT-4, Claude comparisons pending. Zero-shot LLM performance unknown.
Interpretability	Attention analysis is correlational, not causal. True reasoning may differ from attention patterns.

Future Work to Address Limitations

Expanded Dataset

Improve SCDB-CourtListener matching to 1000+ cases for robust evaluation.

LLM Baselines

Compare against GPT-4, Claude-3, and other frontier models in zero-shot and few-shot settings.

Multi-court Extension

Extend to Circuit Courts and state supreme courts to test generalization.

True Forecasting

Use pre-argument briefs and oral arguments for real predictive applications.

Fine-grained Outcomes

Multi-class prediction: affirm, reverse, remand, vacate, per curiam.

Fairness Audit

Evaluate performance disparities across issue areas and party types.

Reproducibility Note: Despite limitations, all experiments are fully reproducible. Code, data splits, and trained model weights are publicly available. We encourage the community to build upon this work.

R Researchers

Luis Sanchez

UC Berkeley, Computer Science

Founding Engineer at Paloa Labs. Former SWE Intern at Adobe. Chancellor's Scholar. Focus on Agentic AI, infrastructure, and automation.

suislanchez.com LinkedIn GitHub

Shubhankar Tripathy

Stanford PhD, OpenAI Researcher

PhD candidate at Stanford University. Research Scientist at OpenAI. Focus on large language models, reasoning, and AI safety.

Google Scholar LinkedIn

LegalGPT

Abstract

0 Introduction

Motivation

Research Questions

Contributions

1 Problem Statement

The Challenge

Why It Matters

Current Limitations of Text-Only Approaches

2 Key Innovations

First Integrated System

GraphSAGE Retrieval

Affordable Training

3 System Architecture

Interactive Pipeline Visualization

4 Performance Highlights

Comparison with Prior SCOTUS Prediction Work

Impact of Graph-Augmented Retrieval

5 Dataset Summary

SCDB Cases

Citation Graph

6 Related Work & Context

Legal NLP Benchmarks

Gap We Fill

7 Ethics Statement

Intended Use

Risks & Mitigations

Ethical Guidelines for Users

DO

DO NOT

Data Privacy & Consent

8 Limitations

Data Limitations

Methodological Limitations

Future Work to Address Limitations

R Researchers

Luis Sanchez

Shubhankar Tripathy

Methodology

Data

Results