Methodology

Luis Sanchez

UC Berkeley

Shubhankar Tripathy

Stanford PhD, OpenAI

1 System Architecture Overview

LegalGPT implements a three-stage pipeline combining graph neural networks for precedent retrieval with large language models for outcome prediction. The architecture is inspired by retrieval-augmented generation (RAG) frameworks (Lewis et al., 2020) but incorporates citation graph structure as a first-class signal.

LegalGPT System Architecture

INPUT CASE

text + metadata

↓

STAGE 1: Graph Retrieval

Citation Graph

Neo4j

10K nodes, 150K edges

→

GraphSAGE

2-layer encoder

128-dim output

→

Hybrid Retriever

k=5 precedents

α=0.6 (embedding)

↓

Top-k precedents with relevance scores

STAGE 2: Context Assembly

[INST] System prompt + Query case + Precedents [/INST]

Query: ~5K tokens | Precedents: ~15K tokens (5×3K) | Total: ~20K tokens

↓

STAGE 3: LLM Prediction

Mistral-7B

Instruct v0.3

frozen weights

→

QLoRA Adapters

r=16, α=32

~7M params (0.1%)

→

Classification

Linear 4096→2

Softmax output

↓

OUTPUT

P(petitioner) | P(respondent)

Stage 1: Graph Retrieval

GraphSAGE learns node embeddings that capture both semantic content and citation structure. Hybrid scoring combines embedding similarity with graph proximity.

Stage 2: Context Assembly

Retrieved precedents are formatted with metadata (date, outcome, relevance score) into a structured prompt following Mistral's instruction format.

Stage 3: LLM Prediction

QLoRA fine-tuning adapts the frozen Mistral-7B model with only 0.1% trainable parameters, enabling efficient domain adaptation.

2 Task Formulation

Formal Definition

Legal Outcome Prediction Task:

Input:  C = (T, M, G)
  where:
    T = case text (opinion, arguments)
    M = metadata (date, court, parties)
    G = citation subgraph context

Output: y ∈ {petitioner, respondent}

Objective: Learn f: (T, M, G) → y
  that maximizes P(y | T, M, G)

Label Definition (SCDB)

Label	Definition	Frequency
petitioner (1)	Party bringing appeal wins	57%
respondent (0)	Party responding wins	43%

Labels derived from SCDB partyWinning variable. Cases with unclear outcomes (remands, mixed decisions) excluded.

Why This Formulation?

Design Choices

Binary classification: Simplifies evaluation; multi-class (unanimous, split, remand) is future work
Case-level prediction: Predicts overall winner, not issue-by-issue outcomes
Post-hoc prediction: Uses full opinion text (retrospective analysis, not pre-decision forecasting)

Comparison to Prior Work

Katz et al. (2017): Used pre-argument features only (true forecasting)
Chalkidis et al. (2019): ECHR violation prediction (similar setup)
Ours: Full text + citation context (maximum information)

3 GraphSAGE Embeddings

We employ GraphSAGE (Hamilton et al., 2017) to learn inductive node representations that capture both textual semantics and citation graph structure. Unlike transductive methods (e.g., DeepWalk, Node2Vec), GraphSAGE can embed unseen nodes at inference time.

Message Passing Visualization

Watch how neighbor information aggregates to update the central node embedding.

Architecture Configuration

Parameter	Value	Justification
Input Dimensions	384	all-MiniLM-L6-v2
Hidden Dimensions	256	2× compression
Output Dimensions	128	Retrieval efficiency
Number of Layers	2	2-hop neighborhood
Aggregator	MEAN	Permutation invariant
Activation	ReLU	Standard choice
Dropout	0.3	Regularization
Normalization	L2	Unit sphere

Message Passing Formulation

GraphSAGE Layer (Hamilton et al., 2017):

AGGREGATE:
  a_v^(k) = MEAN({h_u^(k-1) : u ∈ N(v)})

COMBINE:
  h_v^(k) = σ(W^(k) · CONCAT(h_v^(k-1), a_v^(k)))

With L2 normalization:
  h_v^(k) = h_v^(k) / ||h_v^(k)||₂

Where:
  h_v^(0) = x_v (initial node features)
  N(v) = {u : (u,v) ∈ E} (cited cases)
  σ = ReLU activation
  W^(k) ∈ ℝ^{d_k × 2d_{k-1}}

The MEAN aggregator provides permutation invariance over neighborhoods. L2 normalization ensures embeddings lie on the unit hypersphere for cosine similarity retrieval.

Node Feature Initialization

Text Embedding	384-dim	sentence-transformers/all-MiniLM-L6-v2
Temporal Feature	1-dim	Normalized year: (year - 1946) / 77
Total Input	385-dim	Concatenated features

Neighborhood Sampling

Layer 1 neighbors	25
Layer 2 neighbors	10
Total sampled	≤ 275 nodes/case

Uniform sampling from neighbors. Capped to limit memory during training.

PyTorch Geometric Implementation

import torch
import torch.nn as nn
from torch_geometric.nn import SAGEConv

class LegalGraphSAGE(nn.Module):
    def __init__(self, in_dim=385, hidden_dim=256, out_dim=128, dropout=0.3):
        super().__init__()
        self.conv1 = SAGEConv(in_dim, hidden_dim, normalize=True)
        self.conv2 = SAGEConv(hidden_dim, out_dim, normalize=True)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, edge_index):
        # Layer 1: input → hidden
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = self.dropout(x)

        # Layer 2: hidden → output
        x = self.conv2(x, edge_index)

        # L2 normalize for cosine similarity
        x = F.normalize(x, p=2, dim=1)
        return x

4 QLoRA Fine-tuning

We employ QLoRA (Dettmers et al., 2023) to efficiently fine-tune Mistral-7B-Instruct for legal outcome prediction. QLoRA combines 4-bit quantization with Low-Rank Adaptation (LoRA; Hu et al., 2022), reducing memory requirements by ~4× while maintaining full fine-tuning performance.

Base Model: Mistral-7B-Instruct-v0.3

Attribute	Value
Parameters	7.24B
Architecture	Transformer decoder
Context Length	32,768 tokens
Hidden Dimension	4,096
Attention Heads	32
Layers	32
Vocabulary	32,000
License	Apache 2.0

QLoRA Configuration

Parameter	Value	Notes
Quantization	4-bit NF4	NormalFloat
Double Quant	True	Quantize constants
Compute dtype	bfloat16	Mixed precision
LoRA Rank (r)	16	Low-rank dim
LoRA Alpha (α)	32	Scaling = α/r
LoRA Dropout	0.05	Regularization
Trainable	~7M (0.1%)	Of 7.24B total

LoRA Mathematical Formulation

Low-Rank Adaptation (Hu et al., 2022):

Original: h = W₀x
LoRA:     h = W₀x + ΔWx
              = W₀x + BAx

Where:
  W₀ ∈ ℝ^{d×k} (frozen pretrained)
  B ∈ ℝ^{d×r} (trainable)
  A ∈ ℝ^{r×k} (trainable)
  r << min(d, k) (low rank)

Scaling: ΔW = (α/r) · BA
  with α = 32, r = 16 → scale = 2

Target Modules

Attention: q_proj, k_proj, v_proj, o_proj

MLP: gate_proj, up_proj, down_proj

We apply LoRA to all linear layers in both attention and MLP blocks, following Dettmers et al. (2023) recommendation for QLoRA.

Memory savings: 4-bit quantization reduces model memory from ~14GB to ~4GB, enabling training on single A100-40GB.

Training Hyperparameters

Learning Rate	2e-4
LR Scheduler	Cosine with warmup
Warmup Ratio	0.1 (10%)
Batch Size	4
Gradient Accumulation	4 steps
Effective Batch Size	16
Epochs	3
Max Sequence Length	4,096 tokens
Optimizer	AdamW (8-bit)
Weight Decay	0.01
Gradient Clipping	1.0

Classification Head

Sequence Classification Architecture:

1. Extract last token hidden state:
   h_last = LLM(input_ids)[:, -1, :]
   h_last ∈ ℝ^{batch × 4096}

2. Linear projection:
   logits = W_cls · h_last + b_cls
   W_cls ∈ ℝ^{2 × 4096}
   logits ∈ ℝ^{batch × 2}

3. Softmax probabilities:
   P(y|x) = softmax(logits)

We use the last token representation following standard practice for causal LM classification (Radford et al., 2019).

5 Loss Functions

Training objectives for the classification task and GraphSAGE link prediction, with regularization techniques for improved generalization.

Cross-Entropy Loss

Standard Cross-Entropy:

ℒ_CE = -∑_{i=1}^{N} ∑_{c=1}^{C} y_{i,c} · log(p_{i,c})

For binary classification (C=2):

ℒ_CE = -1/N ∑_{i=1}^{N} [y_i·log(p_i) + (1-y_i)·log(1-p_i)]

Where:
  N = number of samples
  C = number of classes (2)
  y_{i,c} = ground truth (one-hot)
  p_{i,c} = predicted probability

Cross-entropy measures the divergence between predicted and true distributions. Minimizing CE is equivalent to maximum likelihood estimation (Goodfellow et al., 2016).

Label Smoothing Regularization

Label Smoothing (Szegedy et al., 2016):

Smoothed targets:
  y'_{i,c} = y_{i,c}·(1-ε) + ε/C

With ε = 0.1:
  Hard: [1, 0] → Soft: [0.95, 0.05]
  Hard: [0, 1] → Soft: [0.05, 0.95]

Equivalent loss:
  ℒ_LS = (1-ε)·ℒ_CE(y,p) + ε·H(u,p)

Where H(u,p) is CE with uniform dist.

Label smoothing prevents overconfident predictions, improving calibration and generalization (Müller et al., 2019).

Link Prediction Loss (GraphSAGE)

Contrastive Link Prediction Objective:

ℒ_link = -∑_{(u,v)∈E} log(σ(z_u^T · z_v)) - Q · 𝔼_{v_n∼P_n} [log(σ(-z_u^T · z_{v_n}))]

Simplified binary cross-entropy form:

ℒ_link = -1/|E| ∑_{(u,v)∈E} [log(σ(z_u^T·z_v)) + ∑_{j=1}^{Q} log(σ(-z_u^T·z_{v_j}^-))]

Where:
  E = observed citation edges
  z_u, z_v = learned node embeddings (128-dim)
  σ = sigmoid function
  Q = negative samples per positive (Q=5)
  P_n(v) ∝ degree(v)^0.75 (negative distribution)
  v_j^- = sampled negative node

Positive Term

Maximize similarity for cited pairs

Negative Term

Minimize similarity for non-cited

Contrastive

Learn discriminative embeddings

Total Training Objective

Multi-task Loss:

ℒ_total = ℒ_classification + λ·ℒ_link

Where:
  ℒ_classification: QLoRA fine-tuning loss
  ℒ_link: GraphSAGE link prediction loss
  λ = 0.1 (balancing coefficient)

Training schedule:
  1. Pre-train GraphSAGE (link prediction)
  2. Freeze GraphSAGE, train QLoRA
  3. Optional: joint fine-tuning (λ > 0)

Regularization Summary

Technique	Value	Effect
Label Smoothing	ε=0.1	Calibration
Weight Decay	0.01	L2 penalty
Dropout (LoRA)	0.05	Adaptation
Dropout (GNN)	0.3	Graph layers
Gradient Clip	1.0	Stability

6 Negative Sampling Strategy

Effective negative sampling is crucial for learning discriminative graph embeddings. We employ degree-biased sampling with hard negative mining following best practices from knowledge graph embedding literature.

Sampling Distribution

Negative Sampling Distribution:

P_n(v) ∝ degree(v)^α

With α = 0.75 (Mikolov et al., 2013):
  - α = 0: uniform sampling
  - α = 1: degree-proportional
  - α = 0.75: smoothed (empirically optimal)

Effect: Reduces sampling of rare nodes,
focuses on distinguishing similar cases.

Positive edges (train)	~120,000
Negative ratio (Q)	5
Sampling exponent (α)	0.75
Training pairs/epoch	~720,000

Hard Negative Mining

Hard negatives: Cases that are structurally or semantically close but not directly cited.

Hard negative criteria:
1. 2-hop neighbors (cited by same case)
2. Same legal issue area (SCDB code)
3. Temporal proximity (±5 years)
4. High text similarity (>0.7 cosine)

Curriculum Schedule

Epochs 1-2	100% random	Easy start
Epochs 3-4	80% + 20% hard	Gradual
Epochs 5+	60% + 40% hard	Full difficulty

Temporal Constraints

Citation temporal constraint:

For edge (u, v) where u cites v:
  date(v) < date(u)  # v must precede u

This ensures:
  - No future citations (anachronistic)
  - Realistic precedent relationships
  - Proper train/test temporal split

Implications for Negative Sampling

• Negatives must also respect temporal ordering
• Cannot sample future cases as negatives for historical ones
• Prevents temporal data leakage during training

7 Hybrid Retrieval System

Our retrieval system combines dense embedding similarity with sparse citation graph structure, following the hybrid retrieval paradigm (Karpukhin et al., 2020; Ma et al., 2021).

Score Computation Visualization

See how three scoring signals combine into a unified retrieval score.

Hybrid Scoring Function

Score(q, d) = α · S_embed(q, d) + β · S_citation(q, d) + γ · S_text(q, d)

α = 0.40

GraphSAGE Similarity

Learned structural + semantic

β = 0.35

Citation Proximity

Graph distance signal

γ = 0.25

Text Similarity

BM25 lexical matching

S_embed: Embedding Similarity

Cosine similarity of GraphSAGE embeddings:

S_embed(q, d) = cos(z_q, z_d)
              = (z_q · z_d) / (||z_q|| · ||z_d||)

Since embeddings are L2-normalized:
S_embed(q, d) = z_q · z_d  (dot product)

Range: [-1, 1] → normalized to [0, 1]

S_citation: Graph Proximity

Inverse shortest path distance:

S_citation(q, d) = 1 / (1 + dist(q, d))

Where dist(q, d) is shortest path
in the citation graph.

Special cases:
  - Direct citation: dist=1 → S=0.5
  - 2-hop: dist=2 → S=0.33
  - Unreachable: dist=∞ → S=0

S_text: BM25 Similarity

BM25 (Robertson et al., 1995):

S_text(q, d) = ∑_{t∈q} IDF(t) ·
  (f(t,d)·(k₁+1)) /
  (f(t,d) + k₁·(1-b+b·|d|/avgdl))

Parameters:
  k₁ = 1.2 (term frequency saturation)
  b = 0.75 (length normalization)

Retrieval Algorithm

def retrieve_precedents(query_case, k=5):
    # Stage 1: Candidate generation (fast)
    candidates = []
    candidates += ann_search(query_embedding, n=100)
    candidates += citation_neighbors(query, hops=2)
    candidates += bm25_search(query_text, n=50)
    candidates = deduplicate(candidates)

    # Stage 2: Hybrid re-ranking
    scores = []
    for doc in candidates:
        s = α * embed_sim(query, doc)
        s += β * citation_proximity(query, doc)
        s += γ * bm25_score(query, doc)
        scores.append((doc, s))

    # Stage 3: Return top-k
    return sorted(scores, reverse=True)[:k]

Weight Ablation Results

Configuration	AUROC	Δ
α=1.0 (embed only)	0.76	-0.04
β=1.0 (citation only)	0.77	-0.03
γ=1.0 (BM25 only)	0.74	-0.06
α=0.5, β=0.5	0.78	-0.02
α=0.4, β=0.35, γ=0.25	0.80	—

Optimal weights found via grid search on validation set. Three-signal combination outperforms any single signal.

8 Embedding Fusion Architecture

Multi-modal representation combining semantic text features with structural graph information through learned fusion layers.

Embedding Space Visualization

2D PCA projection showing case embeddings clustered by outcome. Query case finds nearest neighbors for retrieval.

Fusion Architecture Diagram

Case Text

(full opinion)

↓

Sentence-BERT

all-MiniLM-L6-v2

(frozen)

↓

e_text

(384-dim)

Citation Graph

(node + edges)

↓

GraphSAGE

2-layer GNN

(trained)

↓

e_graph

(128-dim)

⊕

Concatenation

[e_text; e_graph]

(512-dim)

↓

Fusion MLP

512 → 256 → 128

ReLU + Dropout

↓

e_fused

(128-dim, L2 normalized)

Concatenation Fusion

Simple Concatenation:

e_concat = [e_text; e_graph]
         = [384-dim; 128-dim]
         = 512-dim

Fusion MLP:
h1 = ReLU(W1 · e_concat + b1)  # 512→256
h1 = Dropout(h1, p=0.2)
h2 = ReLU(W2 · h1 + b2)        # 256→128
e_fused = L2_normalize(h2)

Where:
  W1 ∈ ℝ^{256×512}, b1 ∈ ℝ^256
  W2 ∈ ℝ^{128×256}, b2 ∈ ℝ^128

Concatenation preserves all information from both modalities. The MLP learns non-linear interactions (Baltrusaitis et al., 2019).

Alternative: Gated Fusion

Gated Fusion (Arevalo et al., 2017):

g = σ(W_g · [e_text; e_graph] + b_g)
e_fused = g ⊙ tanh(W_t·e_text) +
          (1-g) ⊙ tanh(W_g·e_graph)

Where:
  g ∈ ℝ^d: learned gating vector
  σ: sigmoid activation
  ⊙: element-wise multiplication

Advantage: Adaptive weighting per
dimension based on input content.

Gated fusion allows the model to dynamically weight each modality. We found concatenation performs comparably with simpler implementation.

Fusion Ablation Results

0.74

Text Only

0.76

Graph Only

0.79

Concatenation

0.80

Concat + MLP

Fusion provides +6% AUROC over text-only and +4% over graph-only baselines.

9 Prompt Engineering

The prompt template structures retrieved precedents with metadata to enable effective in-context learning. We follow Mistral's instruction format with careful attention to token budget management.

Full Prompt Template

[INST] You are a legal expert specializing in U.S. Supreme Court case analysis.
Your task is to predict the outcome of a case based on its content and relevant precedents.

## Case to Analyze
Case Name: {case_name}
Docket Number: {docket}
Decision Date: {date}
Legal Issue Area: {issue_area}

Case Text (Opinion):
{case_text_truncated}

## Relevant Precedents
The following cases have been identified as relevant based on citation patterns and semantic similarity:

{for i, precedent in enumerate(retrieved_cases, 1)}
### Precedent {i}: {precedent.name} ({precedent.year})
Relevance Score: {precedent.score:.2f}
Outcome: {precedent.outcome}
Citation Distance: {precedent.citation_distance} hops

Key Excerpt:
{precedent.text_excerpt}

{endfor}

## Task
Based on the case text and the patterns observed in relevant precedents, predict whether
the PETITIONER (party bringing the appeal) or RESPONDENT (party responding) will win.

Consider:
1. How the legal issues align with precedent outcomes
2. The strength of citation relationships
3. The evolution of legal doctrine over time

Prediction: [/INST]

Token Budget Management

Component	Tokens	%
System prompt	~200	1%
Query case text	~5,000	25%
Precedent 1	~3,000	15%
Precedent 2	~3,000	15%
Precedent 3	~3,000	15%
Precedent 4	~3,000	15%
Precedent 5	~3,000	15%
Total	~20,200	62%

Remaining 38% of 32K context reserved for model generation and safety margin.

Truncation Strategy

Hierarchical truncation:

1. Case text: truncate to 5000 tokens
   - Keep first 2500 (background)
   - Keep last 2500 (holding/conclusion)

2. Precedent excerpts: 3000 tokens each
   - Prioritize holding sections
   - Include key cited passages

3. If over budget:
   - Reduce precedent count (k=5→4→3)
   - Further truncate excerpts

We preserve case beginnings and endings which typically contain facts and holdings respectively.

Prompt Engineering Ablations

Variant	AUROC
No system prompt	0.77
No precedent metadata	0.78
Random precedent order	0.79
No relevance scores	0.79
Full template	0.80

Key Findings

• System prompt improves consistency (+3%)
• Precedent metadata provides useful signal (+2%)
• Ordering by relevance score marginally helps (+1%)
• Explicit relevance scores aid attention (+1%)

10 Attention Analysis & Interpretability

Understanding which precedents and text spans the model attends to provides interpretability for legal practitioners and validates that the model learns meaningful legal reasoning patterns.

Attention Extraction Method

Multi-Head Attention Aggregation:

For layer L, head h:
  A^{(L,h)} = softmax(QK^T / √d_k)

Attention to precedent p:
  α_p = (1/H) ∑_{h=1}^{H} ∑_{t∈T_p} A^{(L,h)}_{cls,t}

Where:
  H = 32 attention heads
  T_p = token positions for precedent p
  cls = classification token (last)
  L = 32 (final layer)

Attribution Methods

Method	Formula
Attention Rollout	Ā = ∏_{l=1}^{L} (0.5·I + 0.5·A^{(l)})
Gradient × Input	attr_i = \|∂ℒ/∂x_i · x_i\|
Integrated Gradients	IG_i = x_i · ∫_{α=0}^{1} ∂F/∂x_i dα
SHAP Values	φ_i = ∑_S \|S\|!(n-\|S\|-1)!/n! · Δ_i
Leave-One-Out	Δ_p = P(y\|all) - P(y\|all\{p})

Example: Attention Distribution for Obergefell v. Hodges (2015)

Query Case: Obergefell v. Hodges - Same-sex marriage constitutional right

Precedent Attention Weights (normalized):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Loving v. Virginia (1967)        ████████████████████████████████ 0.31
United States v. Windsor (2013)  ██████████████████████████████   0.28
Lawrence v. Texas (2003)         ██████████████████████           0.21
Romer v. Evans (1996)            █████████████                    0.12
Griswold v. Connecticut (1965)   ████████                         0.08
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Key Attended Phrases:
• "fundamental right to marry" (Loving)           → 0.42 local attention
• "equal dignity in the eyes of the law" (Windsor) → 0.38 local attention
• "liberty protects the person" (Lawrence)         → 0.35 local attention

Model Prediction: PETITIONER (confidence: 0.87)
Actual Outcome:   PETITIONER ✓

Observed Attention Patterns

Pattern	Effect
Recency bias	1.4× for recent
Outcome alignment	1.8× for matching
Citation distance	2.1× for direct
Issue area overlap	1.6× for same
Relevance score	r = 0.72 correlation

Faithfulness Evaluation

Metric	Value	Interpretation
Attention-Prediction r	0.67	Strong
LOO Consistency	83%	High
ROAR@10%	-12% acc	Meaningful
Human Agreement κ	0.54	Moderate

11 Model Calibration

Well-calibrated probability estimates are essential for legal applications where practitioners need to assess prediction reliability. We apply temperature scaling (Guo et al., 2017) for post-hoc calibration.

Expected Calibration Error (ECE)

ECE (Naeini et al., 2015):

ECE = ∑_{m=1}^{M} (|B_m|/n) · |acc(B_m) - conf(B_m)|

Where:
  M = 15 equal-width bins
  B_m = samples in confidence bin m
  acc(B_m) = accuracy in bin m
  conf(B_m) = mean confidence in bin m

Interpretation:
  ECE = 0.00: perfect calibration
  ECE < 0.05: well-calibrated
  ECE > 0.15: poorly calibrated

Temperature Scaling

Post-hoc calibration (Guo et al., 2017):

p_calibrated = softmax(z / T)

Where:
  z = logits from model
  T = temperature parameter

Optimization:
  T* = argmin_T NLL(y, softmax(z/T))

Effects:
  T > 1: soften predictions (less confident)
  T < 1: sharpen predictions (more confident)
  T = 1: no change

Reliability Diagram

Accuracy vs Confidence (15 bins):

1.0 │                              ╭── Perfect
    │                        ●   ╱    calibration
0.8 │                   ●  ●   ╱
    │              ●  ●     ╱    ● After T-scaling
    │         ○  ●       ╱       ○ Before
0.6 │      ○  ●  ○     ╱
    │   ○  ●  ○     ╱
0.4 │○  ●        ╱
    │          ╱
0.2 │        ╱
    │      ╱
0.0 ├──────┼──────┼──────┼──────┤
   0.0   0.25   0.5   0.75   1.0
              Confidence

Calibration Metrics

ECE (before)	0.127
ECE (after T-scaling)	0.034
Optimal Temperature	T* = 1.42
MCE (Max Cal. Error)	0.089
Brier Score	0.168
NLL (calibrated)	0.412

Confidence Distribution

0.5-0.6:

15%

0.6-0.7:

22%

0.7-0.8:

31%

0.8-0.9:

24%

0.9-1.0:

8%

Practical Implications

Well-Calibrated for Legal Use

After temperature scaling, ECE of 0.034 means when the model predicts 70% confidence, it's correct approximately 70% of the time. This reliability is crucial for:

• Risk assessment by legal practitioners
• Identifying cases needing human review
• Setting decision thresholds for different use cases

12 Statistical Significance Testing

Rigorous statistical methods for comparing model performance and establishing significance of improvements over baselines.

McNemar's Test

McNemar's Test (Dietterich, 1998):

Contingency table for paired predictions:

              Model B
            Correct  Wrong
Model A  ┌─────────┬─────────┐
Correct  │    a    │    b    │
Wrong    │    c    │    d    │
         └─────────┴─────────┘

Test statistic:
  χ² = (|b - c| - 1)² / (b + c)

H₀: P(b) = P(c) (models equivalent)
H₁: P(b) ≠ P(c) (models differ)

Reject H₀ if χ² > 3.84 (α = 0.05)

Bootstrap Confidence Intervals

BCa Bootstrap (Efron & Tibshirani, 1993):

for b in 1..B:  # B = 10,000
    D_b = sample(D_test, n, replace=True)
    θ_b = metric(model, D_b)

Percentile CI:
  CI_95% = [θ_{(0.025·B)}, θ_{(0.975·B)}]

BCa Correction:
  z₀ = Φ⁻¹(#{θ_b < θ̂} / B)
  a = ∑(θ̄ - θ_i)³ / (6·(∑(θ̄ - θ_i)²)^1.5)

Adjusted percentiles account for bias
and skewness in bootstrap distribution.

Statistical Comparison Results

Comparison	AUROC Δ	95% CI	McNemar χ²	p-value	Significant?
Ours vs. Mistral (no retrieval)	+0.06	[0.04, 0.08]	18.7	<0.001	Yes ***
Ours vs. BM25 Retrieval	+0.03	[0.01, 0.05]	8.4	0.004	Yes **
Ours vs. Longformer	+0.07	[0.05, 0.09]	24.1	<0.001	Yes ***
Ours vs. Legal-BERT	+0.09	[0.06, 0.12]	31.5	<0.001	Yes ***

*** p < 0.001, ** p < 0.01, * p < 0.05 (Bonferroni-corrected for 4 comparisons, α = 0.0125)

Effect Size (Cohen's h)

Cohen's h for proportions:

h = 2·arcsin(√p₁) - 2·arcsin(√p₂)

Interpretation:
  |h| < 0.2: small effect
  |h| ≈ 0.5: medium effect
  |h| > 0.8: large effect

Our improvement vs baseline:
  h = 0.62 (medium-large)

Multiple Testing Correction

Method	Bonferroni
# Comparisons	4
α (original)	0.05
α (corrected)	0.0125
All significant?	Yes ✓

Variance Analysis

Cross-val folds	5
AUROC mean	0.798
AUROC std	±0.012
Seeds tested	3
Seed variance	±0.008

13 Computational Requirements

~4h

Training Time

Single A100 GPU

3s

Inference/Case

~20 cases/minute

$30

Total Cost

Complete pipeline

Component	Time	Hardware	Memory	Est. Cost
Data preprocessing	~30 min	CPU (8 cores)	8GB RAM	$0
Citation extraction	~10 min	CPU	4GB RAM	$0
GraphSAGE training	~30 min	T4 GPU	12GB VRAM	~$1
QLoRA fine-tuning	~4 hours	A100 80GB	35GB VRAM	~$15
Evaluation	~15 min	A100	20GB VRAM	~$1
Total	~5.5 hours	—	—	~$17-30

Cloud Platform

All experiments run on Modal Labs with A100 GPUs at ~$3.50/hour. Code designed for serverless execution with automatic scaling.

Reproducibility Cost

Total compute cost under $30 makes this research accessible to academic labs and independent researchers without institutional GPU clusters.

14 Reproducibility Checklist

Following EMNLP reproducibility guidelines, we provide comprehensive documentation for result replication.

Code & Data Availability

Source Code	github.com/[repo]
License	MIT
SCDB Data	scdb.wustl.edu
Case Text API	courtlistener.com
Trained Models	HuggingFace Hub
Processed Data	Zenodo archive

Environment Specification

# Key dependencies (requirements.txt)
torch==2.1.0
transformers==4.36.0
peft==0.7.0
bitsandbytes==0.41.3
torch-geometric==2.4.0
sentence-transformers==2.2.2
neo4j==5.14.0
scikit-learn==1.3.2
modal==0.56.0

# Python version
python==3.10.12

# CUDA version
cuda==12.1

Random Seeds & Determinism

SEED = 42

All experiments

torch.backends.cudnn.deterministic = True

CUDA determinism

Data splits saved

data/splits/*.json

EMNLP Checklist Items

✓ Hyperparameters documented
✓ Training/evaluation code provided
✓ Data preprocessing scripts included
✓ Model checkpoints available
✓ Statistical significance tests
✓ Compute requirements stated
✓ Variance across runs reported
✓ License specified

Running Experiments

# Clone repository
git clone https://github.com/[repo]
cd caselaw-graph-ring

# Install dependencies
pip install -r requirements.txt

# Download data
python scripts/download_data.py

# Train GraphSAGE
python -m src.graph.train

# Fine-tune LLM (Modal)
modal run src/model/train.py

# Evaluate
python -m src.model.evaluate

Table of Contents

1 System Architecture Overview

LegalGPT System Architecture

Stage 1: Graph Retrieval

Stage 2: Context Assembly

Stage 3: LLM Prediction

2 Task Formulation

Formal Definition

Label Definition (SCDB)

Why This Formulation?

Design Choices

Comparison to Prior Work

3 GraphSAGE Embeddings

Message Passing Visualization

Architecture Configuration

Message Passing Formulation

Node Feature Initialization

Neighborhood Sampling

PyTorch Geometric Implementation

4 QLoRA Fine-tuning

Base Model: Mistral-7B-Instruct-v0.3

QLoRA Configuration

LoRA Mathematical Formulation

Target Modules

Training Hyperparameters

Classification Head

5 Loss Functions

Cross-Entropy Loss

Label Smoothing Regularization

Link Prediction Loss (GraphSAGE)

Total Training Objective

Regularization Summary

6 Negative Sampling Strategy

Sampling Distribution

Hard Negative Mining

Curriculum Schedule

Temporal Constraints

Implications for Negative Sampling

7 Hybrid Retrieval System

Score Computation Visualization

Hybrid Scoring Function

S_embed: Embedding Similarity

S_citation: Graph Proximity

S_text: BM25 Similarity

Retrieval Algorithm

Weight Ablation Results

8 Embedding Fusion Architecture

Embedding Space Visualization

Fusion Architecture Diagram

Concatenation Fusion

Alternative: Gated Fusion

Fusion Ablation Results

9 Prompt Engineering

Full Prompt Template

Token Budget Management

Truncation Strategy

Prompt Engineering Ablations

Key Findings

10 Attention Analysis & Interpretability

Attention Extraction Method

Attribution Methods

Example: Attention Distribution for Obergefell v. Hodges (2015)

Observed Attention Patterns

Faithfulness Evaluation

11 Model Calibration

Expected Calibration Error (ECE)

Temperature Scaling

Reliability Diagram

Calibration Metrics

Confidence Distribution

Practical Implications

Well-Calibrated for Legal Use

12 Statistical Significance Testing

McNemar's Test

Bootstrap Confidence Intervals

Statistical Comparison Results

Effect Size (Cohen's h)

Multiple Testing Correction

Variance Analysis