LegalGPT
Graph-Augmented Legal Outcome Prediction using Citation Networks and Large Language Models
The first system to combine legal citation graph structure with LLM-based reasoning for predicting Supreme Court case outcomes
Abstract
Predicting legal case outcomes is a challenging task that requires understanding both the textual content of cases and the complex web of precedential relationships that shape judicial reasoning. We introduce LegalGPT, a novel system that combines graph neural networks with large language models for Supreme Court outcome prediction. Our approach uses GraphSAGE (Hamilton et al., 2017) to learn node embeddings from a citation network of 10,000+ cases and 150,000+ citation edges, enabling retrieval of precedents based on structural similarity rather than just lexical matching. These retrieved precedents are then provided as context to a QLoRA-fine-tuned Mistral-7B model (Dettmers et al., 2023) for outcome classification.
On a held-out test set, LegalGPT achieves 0.80 AUROC and 76% accuracy, representing a +9.6% improvement over text-only baselines and +5.2% over prior state-of-the-art methods (Katz et al., 2017). Ablation studies demonstrate that graph-augmented retrieval contributes +6% AUROC over dense retrieval alone, validating the importance of citation structure. We release our code, trained models, and dataset to facilitate reproducibility and future research in legal AI.
0 Introduction
Motivation
The United States Supreme Court decides approximately 80 cases per term, each establishing precedents that shape American law for decades. Understanding and predicting these outcomes has profound implications for legal practitioners, scholars, and policy makers. While prior work has achieved moderate success using hand-crafted features (Katz et al., 2017; Kaufman et al., 2019) or transformer-based text classification (Chalkidis et al., 2019), these approaches treat cases as isolated documents, ignoring the rich network of citations that encodes how legal reasoning propagates through the judicial system.
Legal reasoning is fundamentally graph-structured: courts cite precedents to justify decisions, and the pattern of citations reveals latent relationships between legal concepts. A case citing Roe v. Wade and Planned Parenthood v. Casey signals different legal context than one citing Miranda v. Arizona and Gideon v. Wainwright. We hypothesize that explicitly modeling this citation structure improves outcome prediction beyond what text alone can achieve.
Research Questions
Does incorporating citation graph structure improve legal outcome prediction over text-only baselines?
Can graph-based retrieval identify more relevant precedents than lexical (BM25) or dense (embedding) retrieval?
How do different components (graph, retrieval, LLM) contribute to overall system performance?
Contributions
We present LegalGPT, the first system to combine citation graph neural networks with LLM-based reasoning for Supreme Court outcome prediction, demonstrating that these modalities are complementary.
We introduce a novel retrieval method combining GraphSAGE embeddings, citation proximity, and BM25 scoring that outperforms single-signal retrieval by +6% AUROC.
Our QLoRA-based approach enables training on a single A100 GPU for under $30, democratizing legal AI research for academic labs without enterprise resources.
We provide rigorous ablation studies, statistical significance tests, calibration analysis, and attention-based interpretability to understand model behavior.
1 Problem Statement
The Challenge
Predicting legal case outcomes remains challenging because decisions depend not only on case facts but also on how courts interpret and apply precedents. Current approaches treat cases as isolated text documents, ignoring the rich network of citations that reveals how legal reasoning flows through the judicial system.
Our goal: Model case outcomes as a function of both textual content and citation network structure.
Why It Matters
Attorneys can better assess case strength and identify relevant precedents
Understanding prediction patterns can reveal biases in judicial decision-making
Establishes how graph structure improves legal NLP beyond text-only approaches
Current Limitations of Text-Only Approaches
| Approach | Text | Citations | Graph Structure | Limitation |
|---|---|---|---|---|
| BERT/Legal-BERT | Yes | No | No | 512 token limit, no precedent awareness |
| Longformer | Yes | No | No | Long context but isolated documents |
| LLM + BM25 Retrieval | Yes | Partial | No | Lexical matching misses semantic links |
| LegalGPT (Ours) | Yes | Yes | Yes | Full integration of all signals |
2 Key Innovations
First Integrated System
Combines citation graph structure with case text and LLM reasoning in a unified pipeline.
GraphSAGE Retrieval
Uses graph neural network embeddings to find precedents based on citation structure, not just text similarity.
Affordable Training
QLoRA fine-tuning enables Mistral-7B adaptation for just $30 total compute cost on a single A100 GPU.
3 System Architecture
LegalGPT operates as a 3-stage pipeline: graph-based retrieval identifies relevant precedents, context assembly builds the prompt, and the fine-tuned LLM generates predictions with confidence scores.
Interactive Pipeline Visualization
Scroll down to trigger animation. Click "Replay" to watch again.
View Static Architecture Diagram
LegalGPT Architecture
================================================================================
STAGE 1: GRAPH RETRIEVAL
-------------------------
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Query Case │──────│ Neo4j Graph │──────│ GraphSAGE │
│ (Input) │ │ (226 edges) │ │ Embeddings │
└─────────────────┘ └──────────────────┘ └────────┬────────┘
│
Hybrid Score = 0.6 * embedding + 0.4 * citation
│
▼
┌─────────────────────┐
│ Top-K Precedents │
│ (k = 5) │
└──────────┬──────────┘
│
STAGE 2: CONTEXT ASSEMBLY │
-------------------------- │
┌─────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ PROMPT TEMPLATE │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ [INST] You are a legal analyst. Given this Supreme Court case │ │
│ │ and similar precedents, predict the outcome. │ │
│ │ │ │
│ │ ## Query Case │ │
│ │ {case_summary} │ │
│ │ │ │
│ │ ## Similar Precedents │ │
│ │ 1. {precedent_1} - Outcome: {outcome} │ │
│ │ 2. {precedent_2} - Outcome: {outcome} │ │
│ │ ... │ │
│ │ │ │
│ │ Prediction: [/INST] │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
│
STAGE 3: LLM CLASSIFICATION │
---------------------------- │
▼
┌─────────────────────────────────────────────┐
│ Mistral-7B + QLoRA Adapter │
│ ┌─────────────────────────────────────┐ │
│ │ Base: Mistral-7B-Instruct-v0.3 │ │
│ │ Quantization: 4-bit NF4 │ │
│ │ LoRA Rank: 16, Alpha: 32 │ │
│ │ Trainable: ~7M params (0.1%) │ │
│ └─────────────────────────────────────┘ │
└──────────────────────┬──────────────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ OUTPUT │
│ ┌─────────────────────────────────────┐ │
│ │ Prediction: PETITIONER │ │
│ │ Confidence: 0.78 │ │
│ │ Reasoning: Based on precedents... │ │
│ └─────────────────────────────────────┘ │
└─────────────────────────────────────────────┘
4 Performance Highlights
Comparison with Prior SCOTUS Prediction Work
| Work | Method | Accuracy | Notes |
|---|---|---|---|
| Katz et al. (2017) | Random Forest + case features | 70.2% | Hand-crafted features, no text |
| Kaufman et al. (2019) | Neural network + SCOTUS features | 72.8% | Improved feature engineering |
| Baseline (Longformer) | Transformer encoder | 70.8% | Text-only, no retrieval |
| LegalGPT (Ours) | GraphSAGE + Mistral-7B | 76.0% | Graph-augmented retrieval + LLM |
Impact of Graph-Augmented Retrieval
Adding citation-aware retrieval improves AUROC by 6 percentage points, demonstrating that precedent structure matters.
5 Dataset Summary
SCDB Cases
| Source | Supreme Court Database |
| Original scope | 9,144 cases (1946-2023) |
| Matched with text | 163 cases |
| Petitioner wins | 93 (57%) |
| Respondent wins | 70 (43%) |
| Avg case length | ~41K characters |
Citation Graph
| Total edges | 226 |
| Unique sources | 24 cases |
| Unique targets | 176 cited cases |
| Avg out-degree | 9.4 citations/case |
| Citation types | Supreme Court (65%), Federal (15%), State (16%), Other (4%) |
6 Related Work & Context
Legal NLP Benchmarks
Gap We Fill
No prior work integrates all three components:
- Citation network structure
- Full case text
- LLM-based reasoning
Previous approaches either use citation networks for link prediction (without outcome prediction) or use text-only models (ignoring network structure). LegalGPT bridges this gap.
7 Ethics Statement
Intended Use
Understanding patterns in judicial decision-making, legal scholarship, teaching tools for law students.
Assisting attorneys in identifying relevant precedents and assessing case strength as one input among many.
Revealing patterns that may indicate inconsistencies or biases in judicial reasoning.
Risks & Mitigations
Model predictions should supplement, not replace, human legal judgment. We explicitly discourage using predictions as definitive forecasts.
Historical case outcomes may reflect systemic biases. Our model may perpetuate these patterns. We recommend fairness audits before deployment.
Powerful legal AI tools could advantage well-resourced litigants. We release our code and models openly to democratize access.
Ethical Guidelines for Users
DO
- Use as one factor in case assessment, not the sole determinant
- Validate predictions against legal expertise
- Disclose AI assistance in legal filings where required
- Consider model uncertainty (confidence scores)
- Audit for fairness across demographic groups
DO NOT
- Use predictions to deny legal services
- Present AI predictions as legal advice
- Deploy without human oversight in high-stakes decisions
- Assume model generalizes to non-Supreme Court contexts
- Ignore low-confidence predictions without review
Data Privacy & Consent
All data used in this research consists of publicly available Supreme Court opinions and metadata from the Supreme Court Database (SCDB). No private or personal data was collected. Case outcomes are matters of public record. The SCDB is licensed for academic research use. We comply with CourtListener's terms of service for case text retrieval.
8 Limitations
Data Limitations
| Sample Size | 163 cases with full text matching limits statistical power. Expanding the dataset is ongoing work. |
| Temporal Scope | 1946-2023 SCDB cases only. Legal reasoning patterns may differ for earlier courts or future compositions. |
| Domain Specificity | Supreme Court only. Results may not generalize to lower federal courts, state courts, or international jurisdictions. |
| Citation Coverage | Regex-based extraction may miss some citations. External citations (law reviews, statutes) not included. |
Methodological Limitations
| Post-hoc Prediction | We use full opinion text (retrospective). True forecasting would require pre-decision features only. |
| Binary Outcomes | Petitioner/respondent simplification. Ignores partial wins, remands, plurality opinions. |
| No LLM Baselines | GPT-4, Claude comparisons pending. Zero-shot LLM performance unknown. |
| Interpretability | Attention analysis is correlational, not causal. True reasoning may differ from attention patterns. |
Future Work to Address Limitations
Improve SCDB-CourtListener matching to 1000+ cases for robust evaluation.
Compare against GPT-4, Claude-3, and other frontier models in zero-shot and few-shot settings.
Extend to Circuit Courts and state supreme courts to test generalization.
Use pre-argument briefs and oral arguments for real predictive applications.
Multi-class prediction: affirm, reverse, remand, vacate, per curiam.
Evaluate performance disparities across issue areas and party types.
R Researchers
Luis Sanchez
UC Berkeley, Computer Science
Founding Engineer at Paloa Labs. Former SWE Intern at Adobe. Chancellor's Scholar. Focus on Agentic AI, infrastructure, and automation.
Shubhankar Tripathy
Stanford PhD, OpenAI Researcher
PhD candidate at Stanford University. Research Scientist at OpenAI. Focus on large language models, reasoning, and AI safety.
Methodology
Technical details on GraphSAGE embeddings, hybrid retrieval, and QLoRA fine-tuning.
Data
Comprehensive analysis of SCDB dataset, citation graph, and data preprocessing.
Results
Full evaluation metrics, ablation studies, and comparison with baselines.