Methodology

Comprehensive technical details of the LegalGPT graph-augmented legal prediction system

GraphSAGE QLoRA RAG Mistral-7B
Luis Sanchez
UC Berkeley
luisanchez@berkeley.edu
Shubhankar Tripathy
Stanford PhD, OpenAI
stripathy@umass.edu

Table of Contents

1 System Architecture Overview

LegalGPT implements a three-stage pipeline combining graph neural networks for precedent retrieval with large language models for outcome prediction. The architecture is inspired by retrieval-augmented generation (RAG) frameworks (Lewis et al., 2020) but incorporates citation graph structure as a first-class signal.

LegalGPT System Architecture

INPUT CASE
text + metadata
STAGE 1: Graph Retrieval
Citation Graph
Neo4j
10K nodes, 150K edges
GraphSAGE
2-layer encoder
128-dim output
Hybrid Retriever
k=5 precedents
α=0.6 (embedding)
Top-k precedents with relevance scores
STAGE 2: Context Assembly
[INST] System prompt + Query case + Precedents [/INST]
Query: ~5K tokens | Precedents: ~15K tokens (5×3K) | Total: ~20K tokens
STAGE 3: LLM Prediction
Mistral-7B
Instruct v0.3
frozen weights
QLoRA Adapters
r=16, α=32
~7M params (0.1%)
Classification
Linear 4096→2
Softmax output
OUTPUT
P(petitioner) | P(respondent)

Stage 1: Graph Retrieval

GraphSAGE learns node embeddings that capture both semantic content and citation structure. Hybrid scoring combines embedding similarity with graph proximity.

Stage 2: Context Assembly

Retrieved precedents are formatted with metadata (date, outcome, relevance score) into a structured prompt following Mistral's instruction format.

Stage 3: LLM Prediction

QLoRA fine-tuning adapts the frozen Mistral-7B model with only 0.1% trainable parameters, enabling efficient domain adaptation.

2 Task Formulation

Formal Definition

Legal Outcome Prediction Task:

Input:  C = (T, M, G)
  where:
    T = case text (opinion, arguments)
    M = metadata (date, court, parties)
    G = citation subgraph context

Output: y ∈ {petitioner, respondent}

Objective: Learn f: (T, M, G) → y
  that maximizes P(y | T, M, G)
                

Label Definition (SCDB)

Label Definition Frequency
petitioner (1) Party bringing appeal wins 57%
respondent (0) Party responding wins 43%

Labels derived from SCDB partyWinning variable. Cases with unclear outcomes (remands, mixed decisions) excluded.

Why This Formulation?

Design Choices

  • Binary classification: Simplifies evaluation; multi-class (unanimous, split, remand) is future work
  • Case-level prediction: Predicts overall winner, not issue-by-issue outcomes
  • Post-hoc prediction: Uses full opinion text (retrospective analysis, not pre-decision forecasting)

Comparison to Prior Work

  • Katz et al. (2017): Used pre-argument features only (true forecasting)
  • Chalkidis et al. (2019): ECHR violation prediction (similar setup)
  • Ours: Full text + citation context (maximum information)

3 GraphSAGE Embeddings

We employ GraphSAGE (Hamilton et al., 2017) to learn inductive node representations that capture both textual semantics and citation graph structure. Unlike transductive methods (e.g., DeepWalk, Node2Vec), GraphSAGE can embed unseen nodes at inference time.

Message Passing Visualization

Watch how neighbor information aggregates to update the central node embedding.

Architecture Configuration

Parameter Value Justification
Input Dimensions 384 all-MiniLM-L6-v2
Hidden Dimensions 256 2× compression
Output Dimensions 128 Retrieval efficiency
Number of Layers 2 2-hop neighborhood
Aggregator MEAN Permutation invariant
Activation ReLU Standard choice
Dropout 0.3 Regularization
Normalization L2 Unit sphere

Message Passing Formulation

GraphSAGE Layer (Hamilton et al., 2017):

AGGREGATE:
  a_v^(k) = MEAN({h_u^(k-1) : u ∈ N(v)})

COMBINE:
  h_v^(k) = σ(W^(k) · CONCAT(h_v^(k-1), a_v^(k)))

With L2 normalization:
  h_v^(k) = h_v^(k) / ||h_v^(k)||₂

Where:
  h_v^(0) = x_v (initial node features)
  N(v) = {u : (u,v) ∈ E} (cited cases)
  σ = ReLU activation
  W^(k) ∈ ℝ^{d_k × 2d_{k-1}}
                

The MEAN aggregator provides permutation invariance over neighborhoods. L2 normalization ensures embeddings lie on the unit hypersphere for cosine similarity retrieval.

Node Feature Initialization

Text Embedding 384-dim sentence-transformers/all-MiniLM-L6-v2
Temporal Feature 1-dim Normalized year: (year - 1946) / 77
Total Input 385-dim Concatenated features

Neighborhood Sampling

Layer 1 neighbors 25
Layer 2 neighbors 10
Total sampled ≤ 275 nodes/case

Uniform sampling from neighbors. Capped to limit memory during training.

PyTorch Geometric Implementation

import torch
import torch.nn as nn
from torch_geometric.nn import SAGEConv

class LegalGraphSAGE(nn.Module):
    def __init__(self, in_dim=385, hidden_dim=256, out_dim=128, dropout=0.3):
        super().__init__()
        self.conv1 = SAGEConv(in_dim, hidden_dim, normalize=True)
        self.conv2 = SAGEConv(hidden_dim, out_dim, normalize=True)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, edge_index):
        # Layer 1: input → hidden
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = self.dropout(x)

        # Layer 2: hidden → output
        x = self.conv2(x, edge_index)

        # L2 normalize for cosine similarity
        x = F.normalize(x, p=2, dim=1)
        return x
        

4 QLoRA Fine-tuning

We employ QLoRA (Dettmers et al., 2023) to efficiently fine-tune Mistral-7B-Instruct for legal outcome prediction. QLoRA combines 4-bit quantization with Low-Rank Adaptation (LoRA; Hu et al., 2022), reducing memory requirements by ~4× while maintaining full fine-tuning performance.

Base Model: Mistral-7B-Instruct-v0.3

Attribute Value
Parameters7.24B
ArchitectureTransformer decoder
Context Length32,768 tokens
Hidden Dimension4,096
Attention Heads32
Layers32
Vocabulary32,000
LicenseApache 2.0

QLoRA Configuration

Parameter Value Notes
Quantization4-bit NF4NormalFloat
Double QuantTrueQuantize constants
Compute dtypebfloat16Mixed precision
LoRA Rank (r)16Low-rank dim
LoRA Alpha (α)32Scaling = α/r
LoRA Dropout0.05Regularization
Trainable~7M (0.1%)Of 7.24B total

LoRA Mathematical Formulation

Low-Rank Adaptation (Hu et al., 2022):

Original: h = W₀x
LoRA:     h = W₀x + ΔWx
              = W₀x + BAx

Where:
  W₀ ∈ ℝ^{d×k} (frozen pretrained)
  B ∈ ℝ^{d×r} (trainable)
  A ∈ ℝ^{r×k} (trainable)
  r << min(d, k) (low rank)

Scaling: ΔW = (α/r) · BA
  with α = 32, r = 16 → scale = 2
                

Target Modules

Attention: q_proj, k_proj, v_proj, o_proj
MLP: gate_proj, up_proj, down_proj

We apply LoRA to all linear layers in both attention and MLP blocks, following Dettmers et al. (2023) recommendation for QLoRA.

Memory savings: 4-bit quantization reduces model memory from ~14GB to ~4GB, enabling training on single A100-40GB.

Training Hyperparameters

Learning Rate2e-4
LR SchedulerCosine with warmup
Warmup Ratio0.1 (10%)
Batch Size4
Gradient Accumulation4 steps
Effective Batch Size16
Epochs3
Max Sequence Length4,096 tokens
OptimizerAdamW (8-bit)
Weight Decay0.01
Gradient Clipping1.0

Classification Head

Sequence Classification Architecture:

1. Extract last token hidden state:
   h_last = LLM(input_ids)[:, -1, :]
   h_last ∈ ℝ^{batch × 4096}

2. Linear projection:
   logits = W_cls · h_last + b_cls
   W_cls ∈ ℝ^{2 × 4096}
   logits ∈ ℝ^{batch × 2}

3. Softmax probabilities:
   P(y|x) = softmax(logits)
                

We use the last token representation following standard practice for causal LM classification (Radford et al., 2019).

5 Loss Functions

Training objectives for the classification task and GraphSAGE link prediction, with regularization techniques for improved generalization.

Cross-Entropy Loss

Standard Cross-Entropy:

ℒ_CE = -∑_{i=1}^{N} ∑_{c=1}^{C} y_{i,c} · log(p_{i,c})

For binary classification (C=2):

ℒ_CE = -1/N ∑_{i=1}^{N} [y_i·log(p_i) + (1-y_i)·log(1-p_i)]

Where:
  N = number of samples
  C = number of classes (2)
  y_{i,c} = ground truth (one-hot)
  p_{i,c} = predicted probability
                

Cross-entropy measures the divergence between predicted and true distributions. Minimizing CE is equivalent to maximum likelihood estimation (Goodfellow et al., 2016).

Label Smoothing Regularization

Label Smoothing (Szegedy et al., 2016):

Smoothed targets:
  y'_{i,c} = y_{i,c}·(1-ε) + ε/C

With ε = 0.1:
  Hard: [1, 0] → Soft: [0.95, 0.05]
  Hard: [0, 1] → Soft: [0.05, 0.95]

Equivalent loss:
  ℒ_LS = (1-ε)·ℒ_CE(y,p) + ε·H(u,p)

Where H(u,p) is CE with uniform dist.
                

Label smoothing prevents overconfident predictions, improving calibration and generalization (Müller et al., 2019).

Link Prediction Loss (GraphSAGE)

Contrastive Link Prediction Objective:

ℒ_link = -∑_{(u,v)∈E} log(σ(z_u^T · z_v)) - Q · 𝔼_{v_n∼P_n} [log(σ(-z_u^T · z_{v_n}))]

Simplified binary cross-entropy form:

ℒ_link = -1/|E| ∑_{(u,v)∈E} [log(σ(z_u^T·z_v)) + ∑_{j=1}^{Q} log(σ(-z_u^T·z_{v_j}^-))]

Where:
  E = observed citation edges
  z_u, z_v = learned node embeddings (128-dim)
  σ = sigmoid function
  Q = negative samples per positive (Q=5)
  P_n(v) ∝ degree(v)^0.75 (negative distribution)
  v_j^- = sampled negative node
            
Positive Term
Maximize similarity for cited pairs
Negative Term
Minimize similarity for non-cited
Contrastive
Learn discriminative embeddings

Total Training Objective

Multi-task Loss:

ℒ_total = ℒ_classification + λ·ℒ_link

Where:
  ℒ_classification: QLoRA fine-tuning loss
  ℒ_link: GraphSAGE link prediction loss
  λ = 0.1 (balancing coefficient)

Training schedule:
  1. Pre-train GraphSAGE (link prediction)
  2. Freeze GraphSAGE, train QLoRA
  3. Optional: joint fine-tuning (λ > 0)
                

Regularization Summary

Technique Value Effect
Label Smoothingε=0.1Calibration
Weight Decay0.01L2 penalty
Dropout (LoRA)0.05Adaptation
Dropout (GNN)0.3Graph layers
Gradient Clip1.0Stability

6 Negative Sampling Strategy

Effective negative sampling is crucial for learning discriminative graph embeddings. We employ degree-biased sampling with hard negative mining following best practices from knowledge graph embedding literature.

Sampling Distribution

Negative Sampling Distribution:

P_n(v) ∝ degree(v)^α

With α = 0.75 (Mikolov et al., 2013):
  - α = 0: uniform sampling
  - α = 1: degree-proportional
  - α = 0.75: smoothed (empirically optimal)

Effect: Reduces sampling of rare nodes,
focuses on distinguishing similar cases.
                
Positive edges (train)~120,000
Negative ratio (Q)5
Sampling exponent (α)0.75
Training pairs/epoch~720,000

Hard Negative Mining

Hard negatives: Cases that are structurally or semantically close but not directly cited.

Hard negative criteria:
1. 2-hop neighbors (cited by same case)
2. Same legal issue area (SCDB code)
3. Temporal proximity (±5 years)
4. High text similarity (>0.7 cosine)
                    

Curriculum Schedule

Epochs 1-2100% randomEasy start
Epochs 3-480% + 20% hardGradual
Epochs 5+60% + 40% hardFull difficulty

Temporal Constraints

Citation temporal constraint:

For edge (u, v) where u cites v:
  date(v) < date(u)  # v must precede u

This ensures:
  - No future citations (anachronistic)
  - Realistic precedent relationships
  - Proper train/test temporal split
                    

Implications for Negative Sampling

  • • Negatives must also respect temporal ordering
  • • Cannot sample future cases as negatives for historical ones
  • • Prevents temporal data leakage during training

7 Hybrid Retrieval System

Our retrieval system combines dense embedding similarity with sparse citation graph structure, following the hybrid retrieval paradigm (Karpukhin et al., 2020; Ma et al., 2021).

Score Computation Visualization

See how three scoring signals combine into a unified retrieval score.

Hybrid Scoring Function

Score(q, d) = α · S_embed(q, d) + β · S_citation(q, d) + γ · S_text(q, d)
α = 0.40
GraphSAGE Similarity
Learned structural + semantic
β = 0.35
Citation Proximity
Graph distance signal
γ = 0.25
Text Similarity
BM25 lexical matching

S_embed: Embedding Similarity

Cosine similarity of GraphSAGE embeddings:

S_embed(q, d) = cos(z_q, z_d)
              = (z_q · z_d) / (||z_q|| · ||z_d||)

Since embeddings are L2-normalized:
S_embed(q, d) = z_q · z_d  (dot product)

Range: [-1, 1] → normalized to [0, 1]
                

S_citation: Graph Proximity

Inverse shortest path distance:

S_citation(q, d) = 1 / (1 + dist(q, d))

Where dist(q, d) is shortest path
in the citation graph.

Special cases:
  - Direct citation: dist=1 → S=0.5
  - 2-hop: dist=2 → S=0.33
  - Unreachable: dist=∞ → S=0
                

S_text: BM25 Similarity

BM25 (Robertson et al., 1995):

S_text(q, d) = ∑_{t∈q} IDF(t) ·
  (f(t,d)·(k₁+1)) /
  (f(t,d) + k₁·(1-b+b·|d|/avgdl))

Parameters:
  k₁ = 1.2 (term frequency saturation)
  b = 0.75 (length normalization)
                

Retrieval Algorithm

def retrieve_precedents(query_case, k=5):
    # Stage 1: Candidate generation (fast)
    candidates = []
    candidates += ann_search(query_embedding, n=100)
    candidates += citation_neighbors(query, hops=2)
    candidates += bm25_search(query_text, n=50)
    candidates = deduplicate(candidates)

    # Stage 2: Hybrid re-ranking
    scores = []
    for doc in candidates:
        s = α * embed_sim(query, doc)
        s += β * citation_proximity(query, doc)
        s += γ * bm25_score(query, doc)
        scores.append((doc, s))

    # Stage 3: Return top-k
    return sorted(scores, reverse=True)[:k]
                

Weight Ablation Results

Configuration AUROC Δ
α=1.0 (embed only)0.76-0.04
β=1.0 (citation only)0.77-0.03
γ=1.0 (BM25 only)0.74-0.06
α=0.5, β=0.50.78-0.02
α=0.4, β=0.35, γ=0.250.80

Optimal weights found via grid search on validation set. Three-signal combination outperforms any single signal.

8 Embedding Fusion Architecture

Multi-modal representation combining semantic text features with structural graph information through learned fusion layers.

Embedding Space Visualization

2D PCA projection showing case embeddings clustered by outcome. Query case finds nearest neighbors for retrieval.

Fusion Architecture Diagram

Case Text
(full opinion)
Sentence-BERT
all-MiniLM-L6-v2
(frozen)
e_text
(384-dim)
Citation Graph
(node + edges)
GraphSAGE
2-layer GNN
(trained)
e_graph
(128-dim)
Concatenation
[e_text; e_graph]
(512-dim)
Fusion MLP
512 → 256 → 128
ReLU + Dropout
e_fused
(128-dim, L2 normalized)

Concatenation Fusion

Simple Concatenation:

e_concat = [e_text; e_graph]
         = [384-dim; 128-dim]
         = 512-dim

Fusion MLP:
h1 = ReLU(W1 · e_concat + b1)  # 512→256
h1 = Dropout(h1, p=0.2)
h2 = ReLU(W2 · h1 + b2)        # 256→128
e_fused = L2_normalize(h2)

Where:
  W1 ∈ ℝ^{256×512}, b1 ∈ ℝ^256
  W2 ∈ ℝ^{128×256}, b2 ∈ ℝ^128
                

Concatenation preserves all information from both modalities. The MLP learns non-linear interactions (Baltrusaitis et al., 2019).

Alternative: Gated Fusion

Gated Fusion (Arevalo et al., 2017):

g = σ(W_g · [e_text; e_graph] + b_g)
e_fused = g ⊙ tanh(W_t·e_text) +
          (1-g) ⊙ tanh(W_g·e_graph)

Where:
  g ∈ ℝ^d: learned gating vector
  σ: sigmoid activation
  ⊙: element-wise multiplication

Advantage: Adaptive weighting per
dimension based on input content.
                

Gated fusion allows the model to dynamically weight each modality. We found concatenation performs comparably with simpler implementation.

Fusion Ablation Results

0.74
Text Only
0.76
Graph Only
0.79
Concatenation
0.80
Concat + MLP

Fusion provides +6% AUROC over text-only and +4% over graph-only baselines.

9 Prompt Engineering

The prompt template structures retrieved precedents with metadata to enable effective in-context learning. We follow Mistral's instruction format with careful attention to token budget management.

Full Prompt Template

[INST] You are a legal expert specializing in U.S. Supreme Court case analysis.
Your task is to predict the outcome of a case based on its content and relevant precedents.

## Case to Analyze
Case Name: {case_name}
Docket Number: {docket}
Decision Date: {date}
Legal Issue Area: {issue_area}

Case Text (Opinion):
{case_text_truncated}

## Relevant Precedents
The following cases have been identified as relevant based on citation patterns and semantic similarity:

{for i, precedent in enumerate(retrieved_cases, 1)}
### Precedent {i}: {precedent.name} ({precedent.year})
Relevance Score: {precedent.score:.2f}
Outcome: {precedent.outcome}
Citation Distance: {precedent.citation_distance} hops

Key Excerpt:
{precedent.text_excerpt}

{endfor}

## Task
Based on the case text and the patterns observed in relevant precedents, predict whether
the PETITIONER (party bringing the appeal) or RESPONDENT (party responding) will win.

Consider:
1. How the legal issues align with precedent outcomes
2. The strength of citation relationships
3. The evolution of legal doctrine over time

Prediction: [/INST]
        

Token Budget Management

Component Tokens %
System prompt~2001%
Query case text~5,00025%
Precedent 1~3,00015%
Precedent 2~3,00015%
Precedent 3~3,00015%
Precedent 4~3,00015%
Precedent 5~3,00015%
Total~20,20062%

Remaining 38% of 32K context reserved for model generation and safety margin.

Truncation Strategy

Hierarchical truncation:

1. Case text: truncate to 5000 tokens
   - Keep first 2500 (background)
   - Keep last 2500 (holding/conclusion)

2. Precedent excerpts: 3000 tokens each
   - Prioritize holding sections
   - Include key cited passages

3. If over budget:
   - Reduce precedent count (k=5→4→3)
   - Further truncate excerpts
                

We preserve case beginnings and endings which typically contain facts and holdings respectively.

Prompt Engineering Ablations

Variant AUROC
No system prompt0.77
No precedent metadata0.78
Random precedent order0.79
No relevance scores0.79
Full template0.80

Key Findings

  • • System prompt improves consistency (+3%)
  • • Precedent metadata provides useful signal (+2%)
  • • Ordering by relevance score marginally helps (+1%)
  • • Explicit relevance scores aid attention (+1%)

10 Attention Analysis & Interpretability

Understanding which precedents and text spans the model attends to provides interpretability for legal practitioners and validates that the model learns meaningful legal reasoning patterns.

Attention Extraction Method

Multi-Head Attention Aggregation:

For layer L, head h:
  A^{(L,h)} = softmax(QK^T / √d_k)

Attention to precedent p:
  α_p = (1/H) ∑_{h=1}^{H} ∑_{t∈T_p} A^{(L,h)}_{cls,t}

Where:
  H = 32 attention heads
  T_p = token positions for precedent p
  cls = classification token (last)
  L = 32 (final layer)
                

Attribution Methods

Method Formula
Attention Rollout Ā = ∏_{l=1}^{L} (0.5·I + 0.5·A^{(l)})
Gradient × Input attr_i = |∂ℒ/∂x_i · x_i|
Integrated Gradients IG_i = x_i · ∫_{α=0}^{1} ∂F/∂x_i dα
SHAP Values φ_i = ∑_S |S|!(n-|S|-1)!/n! · Δ_i
Leave-One-Out Δ_p = P(y|all) - P(y|all\{p})

Example: Attention Distribution for Obergefell v. Hodges (2015)

Query Case: Obergefell v. Hodges - Same-sex marriage constitutional right

Precedent Attention Weights (normalized):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Loving v. Virginia (1967)        ████████████████████████████████ 0.31
United States v. Windsor (2013)  ██████████████████████████████   0.28
Lawrence v. Texas (2003)         ██████████████████████           0.21
Romer v. Evans (1996)            █████████████                    0.12
Griswold v. Connecticut (1965)   ████████                         0.08
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Key Attended Phrases:
• "fundamental right to marry" (Loving)           → 0.42 local attention
• "equal dignity in the eyes of the law" (Windsor) → 0.38 local attention
• "liberty protects the person" (Lawrence)         → 0.35 local attention

Model Prediction: PETITIONER (confidence: 0.87)
Actual Outcome:   PETITIONER ✓
        

Observed Attention Patterns

Pattern Effect
Recency bias1.4× for recent
Outcome alignment1.8× for matching
Citation distance2.1× for direct
Issue area overlap1.6× for same
Relevance scorer = 0.72 correlation

Faithfulness Evaluation

Metric Value Interpretation
Attention-Prediction r0.67Strong
LOO Consistency83%High
ROAR@10%-12% accMeaningful
Human Agreement κ0.54Moderate

11 Model Calibration

Well-calibrated probability estimates are essential for legal applications where practitioners need to assess prediction reliability. We apply temperature scaling (Guo et al., 2017) for post-hoc calibration.

Expected Calibration Error (ECE)

ECE (Naeini et al., 2015):

ECE = ∑_{m=1}^{M} (|B_m|/n) · |acc(B_m) - conf(B_m)|

Where:
  M = 15 equal-width bins
  B_m = samples in confidence bin m
  acc(B_m) = accuracy in bin m
  conf(B_m) = mean confidence in bin m

Interpretation:
  ECE = 0.00: perfect calibration
  ECE < 0.05: well-calibrated
  ECE > 0.15: poorly calibrated
                

Temperature Scaling

Post-hoc calibration (Guo et al., 2017):

p_calibrated = softmax(z / T)

Where:
  z = logits from model
  T = temperature parameter

Optimization:
  T* = argmin_T NLL(y, softmax(z/T))

Effects:
  T > 1: soften predictions (less confident)
  T < 1: sharpen predictions (more confident)
  T = 1: no change
                

Reliability Diagram

Accuracy vs Confidence (15 bins):

1.0 │                              ╭── Perfect
    │                        ●   ╱    calibration
0.8 │                   ●  ●   ╱
    │              ●  ●     ╱    ● After T-scaling
    │         ○  ●       ╱       ○ Before
0.6 │      ○  ●  ○     ╱
    │   ○  ●  ○     ╱
0.4 │○  ●        ╱
    │          ╱
0.2 │        ╱
    │      ╱
0.0 ├──────┼──────┼──────┼──────┤
   0.0   0.25   0.5   0.75   1.0
              Confidence
                

Calibration Metrics

ECE (before)0.127
ECE (after T-scaling)0.034
Optimal TemperatureT* = 1.42
MCE (Max Cal. Error)0.089
Brier Score0.168
NLL (calibrated)0.412

Confidence Distribution

0.5-0.6:
15%
0.6-0.7:
22%
0.7-0.8:
31%
0.8-0.9:
24%
0.9-1.0:
8%

Practical Implications

Well-Calibrated for Legal Use

After temperature scaling, ECE of 0.034 means when the model predicts 70% confidence, it's correct approximately 70% of the time. This reliability is crucial for:

  • • Risk assessment by legal practitioners
  • • Identifying cases needing human review
  • • Setting decision thresholds for different use cases

12 Statistical Significance Testing

Rigorous statistical methods for comparing model performance and establishing significance of improvements over baselines.

McNemar's Test

McNemar's Test (Dietterich, 1998):

Contingency table for paired predictions:

              Model B
            Correct  Wrong
Model A  ┌─────────┬─────────┐
Correct  │    a    │    b    │
Wrong    │    c    │    d    │
         └─────────┴─────────┘

Test statistic:
  χ² = (|b - c| - 1)² / (b + c)

H₀: P(b) = P(c) (models equivalent)
H₁: P(b) ≠ P(c) (models differ)

Reject H₀ if χ² > 3.84 (α = 0.05)
                

Bootstrap Confidence Intervals

BCa Bootstrap (Efron & Tibshirani, 1993):

for b in 1..B:  # B = 10,000
    D_b = sample(D_test, n, replace=True)
    θ_b = metric(model, D_b)

Percentile CI:
  CI_95% = [θ_{(0.025·B)}, θ_{(0.975·B)}]

BCa Correction:
  z₀ = Φ⁻¹(#{θ_b < θ̂} / B)
  a = ∑(θ̄ - θ_i)³ / (6·(∑(θ̄ - θ_i)²)^1.5)

Adjusted percentiles account for bias
and skewness in bootstrap distribution.
                

Statistical Comparison Results

Comparison AUROC Δ 95% CI McNemar χ² p-value Significant?
Ours vs. Mistral (no retrieval) +0.06 [0.04, 0.08] 18.7 <0.001 Yes ***
Ours vs. BM25 Retrieval +0.03 [0.01, 0.05] 8.4 0.004 Yes **
Ours vs. Longformer +0.07 [0.05, 0.09] 24.1 <0.001 Yes ***
Ours vs. Legal-BERT +0.09 [0.06, 0.12] 31.5 <0.001 Yes ***

*** p < 0.001, ** p < 0.01, * p < 0.05 (Bonferroni-corrected for 4 comparisons, α = 0.0125)

Effect Size (Cohen's h)

Cohen's h for proportions:

h = 2·arcsin(√p₁) - 2·arcsin(√p₂)

Interpretation:
  |h| < 0.2: small effect
  |h| ≈ 0.5: medium effect
  |h| > 0.8: large effect

Our improvement vs baseline:
  h = 0.62 (medium-large)
                

Multiple Testing Correction

MethodBonferroni
# Comparisons4
α (original)0.05
α (corrected)0.0125
All significant?Yes ✓

Variance Analysis

Cross-val folds5
AUROC mean0.798
AUROC std±0.012
Seeds tested3
Seed variance±0.008

13 Computational Requirements

~4h
Training Time
Single A100 GPU
3s
Inference/Case
~20 cases/minute
$30
Total Cost
Complete pipeline
Component Time Hardware Memory Est. Cost
Data preprocessing ~30 min CPU (8 cores) 8GB RAM $0
Citation extraction ~10 min CPU 4GB RAM $0
GraphSAGE training ~30 min T4 GPU 12GB VRAM ~$1
QLoRA fine-tuning ~4 hours A100 80GB 35GB VRAM ~$15
Evaluation ~15 min A100 20GB VRAM ~$1
Total ~5.5 hours ~$17-30

Cloud Platform

All experiments run on Modal Labs with A100 GPUs at ~$3.50/hour. Code designed for serverless execution with automatic scaling.

Reproducibility Cost

Total compute cost under $30 makes this research accessible to academic labs and independent researchers without institutional GPU clusters.

14 Reproducibility Checklist

Following EMNLP reproducibility guidelines, we provide comprehensive documentation for result replication.

Code & Data Availability

Source Code github.com/[repo]
License MIT
SCDB Data scdb.wustl.edu
Case Text API courtlistener.com
Trained Models HuggingFace Hub
Processed Data Zenodo archive

Environment Specification

# Key dependencies (requirements.txt)
torch==2.1.0
transformers==4.36.0
peft==0.7.0
bitsandbytes==0.41.3
torch-geometric==2.4.0
sentence-transformers==2.2.2
neo4j==5.14.0
scikit-learn==1.3.2
modal==0.56.0

# Python version
python==3.10.12

# CUDA version
cuda==12.1
                

Random Seeds & Determinism

SEED = 42
All experiments
torch.backends.cudnn.deterministic = True
CUDA determinism
Data splits saved
data/splits/*.json

EMNLP Checklist Items

  • Hyperparameters documented
  • Training/evaluation code provided
  • Data preprocessing scripts included
  • Model checkpoints available
  • Statistical significance tests
  • Compute requirements stated
  • Variance across runs reported
  • License specified

Running Experiments

# Clone repository
git clone https://github.com/[repo]
cd caselaw-graph-ring

# Install dependencies
pip install -r requirements.txt

# Download data
python scripts/download_data.py

# Train GraphSAGE
python -m src.graph.train

# Fine-tune LLM (Modal)
modal run src/model/train.py

# Evaluate
python -m src.model.evaluate