Methodology
Comprehensive technical details of the LegalGPT graph-augmented legal prediction system
Table of Contents
1 System Architecture Overview
LegalGPT implements a three-stage pipeline combining graph neural networks for precedent retrieval with large language models for outcome prediction. The architecture is inspired by retrieval-augmented generation (RAG) frameworks (Lewis et al., 2020) but incorporates citation graph structure as a first-class signal.
LegalGPT System Architecture
Stage 1: Graph Retrieval
GraphSAGE learns node embeddings that capture both semantic content and citation structure. Hybrid scoring combines embedding similarity with graph proximity.
Stage 2: Context Assembly
Retrieved precedents are formatted with metadata (date, outcome, relevance score) into a structured prompt following Mistral's instruction format.
Stage 3: LLM Prediction
QLoRA fine-tuning adapts the frozen Mistral-7B model with only 0.1% trainable parameters, enabling efficient domain adaptation.
2 Task Formulation
Formal Definition
Legal Outcome Prediction Task:
Input: C = (T, M, G)
where:
T = case text (opinion, arguments)
M = metadata (date, court, parties)
G = citation subgraph context
Output: y ∈ {petitioner, respondent}
Objective: Learn f: (T, M, G) → y
that maximizes P(y | T, M, G)
Label Definition (SCDB)
| Label | Definition | Frequency |
|---|---|---|
| petitioner (1) | Party bringing appeal wins | 57% |
| respondent (0) | Party responding wins | 43% |
Labels derived from SCDB partyWinning variable. Cases with unclear outcomes (remands, mixed decisions) excluded.
Why This Formulation?
Design Choices
- Binary classification: Simplifies evaluation; multi-class (unanimous, split, remand) is future work
- Case-level prediction: Predicts overall winner, not issue-by-issue outcomes
- Post-hoc prediction: Uses full opinion text (retrospective analysis, not pre-decision forecasting)
Comparison to Prior Work
- Katz et al. (2017): Used pre-argument features only (true forecasting)
- Chalkidis et al. (2019): ECHR violation prediction (similar setup)
- Ours: Full text + citation context (maximum information)
3 GraphSAGE Embeddings
We employ GraphSAGE (Hamilton et al., 2017) to learn inductive node representations that capture both textual semantics and citation graph structure. Unlike transductive methods (e.g., DeepWalk, Node2Vec), GraphSAGE can embed unseen nodes at inference time.
Message Passing Visualization
Watch how neighbor information aggregates to update the central node embedding.
Architecture Configuration
| Parameter | Value | Justification |
|---|---|---|
| Input Dimensions | 384 | all-MiniLM-L6-v2 |
| Hidden Dimensions | 256 | 2× compression |
| Output Dimensions | 128 | Retrieval efficiency |
| Number of Layers | 2 | 2-hop neighborhood |
| Aggregator | MEAN | Permutation invariant |
| Activation | ReLU | Standard choice |
| Dropout | 0.3 | Regularization |
| Normalization | L2 | Unit sphere |
Message Passing Formulation
GraphSAGE Layer (Hamilton et al., 2017):
AGGREGATE:
a_v^(k) = MEAN({h_u^(k-1) : u ∈ N(v)})
COMBINE:
h_v^(k) = σ(W^(k) · CONCAT(h_v^(k-1), a_v^(k)))
With L2 normalization:
h_v^(k) = h_v^(k) / ||h_v^(k)||₂
Where:
h_v^(0) = x_v (initial node features)
N(v) = {u : (u,v) ∈ E} (cited cases)
σ = ReLU activation
W^(k) ∈ ℝ^{d_k × 2d_{k-1}}
The MEAN aggregator provides permutation invariance over neighborhoods. L2 normalization ensures embeddings lie on the unit hypersphere for cosine similarity retrieval.
Node Feature Initialization
| Text Embedding | 384-dim | sentence-transformers/all-MiniLM-L6-v2 |
| Temporal Feature | 1-dim | Normalized year: (year - 1946) / 77 |
| Total Input | 385-dim | Concatenated features |
Neighborhood Sampling
| Layer 1 neighbors | 25 |
| Layer 2 neighbors | 10 |
| Total sampled | ≤ 275 nodes/case |
Uniform sampling from neighbors. Capped to limit memory during training.
PyTorch Geometric Implementation
import torch import torch.nn as nn from torch_geometric.nn import SAGEConv class LegalGraphSAGE(nn.Module): def __init__(self, in_dim=385, hidden_dim=256, out_dim=128, dropout=0.3): super().__init__() self.conv1 = SAGEConv(in_dim, hidden_dim, normalize=True) self.conv2 = SAGEConv(hidden_dim, out_dim, normalize=True) self.dropout = nn.Dropout(dropout) def forward(self, x, edge_index): # Layer 1: input → hidden x = self.conv1(x, edge_index) x = F.relu(x) x = self.dropout(x) # Layer 2: hidden → output x = self.conv2(x, edge_index) # L2 normalize for cosine similarity x = F.normalize(x, p=2, dim=1) return x
4 QLoRA Fine-tuning
We employ QLoRA (Dettmers et al., 2023) to efficiently fine-tune Mistral-7B-Instruct for legal outcome prediction. QLoRA combines 4-bit quantization with Low-Rank Adaptation (LoRA; Hu et al., 2022), reducing memory requirements by ~4× while maintaining full fine-tuning performance.
Base Model: Mistral-7B-Instruct-v0.3
| Attribute | Value |
|---|---|
| Parameters | 7.24B |
| Architecture | Transformer decoder |
| Context Length | 32,768 tokens |
| Hidden Dimension | 4,096 |
| Attention Heads | 32 |
| Layers | 32 |
| Vocabulary | 32,000 |
| License | Apache 2.0 |
QLoRA Configuration
| Parameter | Value | Notes |
|---|---|---|
| Quantization | 4-bit NF4 | NormalFloat |
| Double Quant | True | Quantize constants |
| Compute dtype | bfloat16 | Mixed precision |
| LoRA Rank (r) | 16 | Low-rank dim |
| LoRA Alpha (α) | 32 | Scaling = α/r |
| LoRA Dropout | 0.05 | Regularization |
| Trainable | ~7M (0.1%) | Of 7.24B total |
LoRA Mathematical Formulation
Low-Rank Adaptation (Hu et al., 2022):
Original: h = W₀x
LoRA: h = W₀x + ΔWx
= W₀x + BAx
Where:
W₀ ∈ ℝ^{d×k} (frozen pretrained)
B ∈ ℝ^{d×r} (trainable)
A ∈ ℝ^{r×k} (trainable)
r << min(d, k) (low rank)
Scaling: ΔW = (α/r) · BA
with α = 32, r = 16 → scale = 2
Target Modules
q_proj, k_proj, v_proj, o_proj
gate_proj, up_proj, down_proj
We apply LoRA to all linear layers in both attention and MLP blocks, following Dettmers et al. (2023) recommendation for QLoRA.
Training Hyperparameters
| Learning Rate | 2e-4 |
| LR Scheduler | Cosine with warmup |
| Warmup Ratio | 0.1 (10%) |
| Batch Size | 4 |
| Gradient Accumulation | 4 steps |
| Effective Batch Size | 16 |
| Epochs | 3 |
| Max Sequence Length | 4,096 tokens |
| Optimizer | AdamW (8-bit) |
| Weight Decay | 0.01 |
| Gradient Clipping | 1.0 |
Classification Head
Sequence Classification Architecture:
1. Extract last token hidden state:
h_last = LLM(input_ids)[:, -1, :]
h_last ∈ ℝ^{batch × 4096}
2. Linear projection:
logits = W_cls · h_last + b_cls
W_cls ∈ ℝ^{2 × 4096}
logits ∈ ℝ^{batch × 2}
3. Softmax probabilities:
P(y|x) = softmax(logits)
We use the last token representation following standard practice for causal LM classification (Radford et al., 2019).
5 Loss Functions
Training objectives for the classification task and GraphSAGE link prediction, with regularization techniques for improved generalization.
Cross-Entropy Loss
Standard Cross-Entropy:
ℒ_CE = -∑_{i=1}^{N} ∑_{c=1}^{C} y_{i,c} · log(p_{i,c})
For binary classification (C=2):
ℒ_CE = -1/N ∑_{i=1}^{N} [y_i·log(p_i) + (1-y_i)·log(1-p_i)]
Where:
N = number of samples
C = number of classes (2)
y_{i,c} = ground truth (one-hot)
p_{i,c} = predicted probability
Cross-entropy measures the divergence between predicted and true distributions. Minimizing CE is equivalent to maximum likelihood estimation (Goodfellow et al., 2016).
Label Smoothing Regularization
Label Smoothing (Szegedy et al., 2016):
Smoothed targets:
y'_{i,c} = y_{i,c}·(1-ε) + ε/C
With ε = 0.1:
Hard: [1, 0] → Soft: [0.95, 0.05]
Hard: [0, 1] → Soft: [0.05, 0.95]
Equivalent loss:
ℒ_LS = (1-ε)·ℒ_CE(y,p) + ε·H(u,p)
Where H(u,p) is CE with uniform dist.
Label smoothing prevents overconfident predictions, improving calibration and generalization (Müller et al., 2019).
Link Prediction Loss (GraphSAGE)
Contrastive Link Prediction Objective:
ℒ_link = -∑_{(u,v)∈E} log(σ(z_u^T · z_v)) - Q · 𝔼_{v_n∼P_n} [log(σ(-z_u^T · z_{v_n}))]
Simplified binary cross-entropy form:
ℒ_link = -1/|E| ∑_{(u,v)∈E} [log(σ(z_u^T·z_v)) + ∑_{j=1}^{Q} log(σ(-z_u^T·z_{v_j}^-))]
Where:
E = observed citation edges
z_u, z_v = learned node embeddings (128-dim)
σ = sigmoid function
Q = negative samples per positive (Q=5)
P_n(v) ∝ degree(v)^0.75 (negative distribution)
v_j^- = sampled negative node
Total Training Objective
Multi-task Loss:
ℒ_total = ℒ_classification + λ·ℒ_link
Where:
ℒ_classification: QLoRA fine-tuning loss
ℒ_link: GraphSAGE link prediction loss
λ = 0.1 (balancing coefficient)
Training schedule:
1. Pre-train GraphSAGE (link prediction)
2. Freeze GraphSAGE, train QLoRA
3. Optional: joint fine-tuning (λ > 0)
Regularization Summary
| Technique | Value | Effect |
|---|---|---|
| Label Smoothing | ε=0.1 | Calibration |
| Weight Decay | 0.01 | L2 penalty |
| Dropout (LoRA) | 0.05 | Adaptation |
| Dropout (GNN) | 0.3 | Graph layers |
| Gradient Clip | 1.0 | Stability |
6 Negative Sampling Strategy
Effective negative sampling is crucial for learning discriminative graph embeddings. We employ degree-biased sampling with hard negative mining following best practices from knowledge graph embedding literature.
Sampling Distribution
Negative Sampling Distribution:
P_n(v) ∝ degree(v)^α
With α = 0.75 (Mikolov et al., 2013):
- α = 0: uniform sampling
- α = 1: degree-proportional
- α = 0.75: smoothed (empirically optimal)
Effect: Reduces sampling of rare nodes,
focuses on distinguishing similar cases.
| Positive edges (train) | ~120,000 |
| Negative ratio (Q) | 5 |
| Sampling exponent (α) | 0.75 |
| Training pairs/epoch | ~720,000 |
Hard Negative Mining
Hard negatives: Cases that are structurally or semantically close but not directly cited.
Hard negative criteria:
1. 2-hop neighbors (cited by same case)
2. Same legal issue area (SCDB code)
3. Temporal proximity (±5 years)
4. High text similarity (>0.7 cosine)
Curriculum Schedule
| Epochs 1-2 | 100% random | Easy start |
| Epochs 3-4 | 80% + 20% hard | Gradual |
| Epochs 5+ | 60% + 40% hard | Full difficulty |
Temporal Constraints
Citation temporal constraint:
For edge (u, v) where u cites v:
date(v) < date(u) # v must precede u
This ensures:
- No future citations (anachronistic)
- Realistic precedent relationships
- Proper train/test temporal split
Implications for Negative Sampling
- • Negatives must also respect temporal ordering
- • Cannot sample future cases as negatives for historical ones
- • Prevents temporal data leakage during training
7 Hybrid Retrieval System
Our retrieval system combines dense embedding similarity with sparse citation graph structure, following the hybrid retrieval paradigm (Karpukhin et al., 2020; Ma et al., 2021).
Score Computation Visualization
See how three scoring signals combine into a unified retrieval score.
Hybrid Scoring Function
S_embed: Embedding Similarity
Cosine similarity of GraphSAGE embeddings:
S_embed(q, d) = cos(z_q, z_d)
= (z_q · z_d) / (||z_q|| · ||z_d||)
Since embeddings are L2-normalized:
S_embed(q, d) = z_q · z_d (dot product)
Range: [-1, 1] → normalized to [0, 1]
S_citation: Graph Proximity
Inverse shortest path distance:
S_citation(q, d) = 1 / (1 + dist(q, d))
Where dist(q, d) is shortest path
in the citation graph.
Special cases:
- Direct citation: dist=1 → S=0.5
- 2-hop: dist=2 → S=0.33
- Unreachable: dist=∞ → S=0
S_text: BM25 Similarity
BM25 (Robertson et al., 1995):
S_text(q, d) = ∑_{t∈q} IDF(t) ·
(f(t,d)·(k₁+1)) /
(f(t,d) + k₁·(1-b+b·|d|/avgdl))
Parameters:
k₁ = 1.2 (term frequency saturation)
b = 0.75 (length normalization)
Retrieval Algorithm
def retrieve_precedents(query_case, k=5): # Stage 1: Candidate generation (fast) candidates = [] candidates += ann_search(query_embedding, n=100) candidates += citation_neighbors(query, hops=2) candidates += bm25_search(query_text, n=50) candidates = deduplicate(candidates) # Stage 2: Hybrid re-ranking scores = [] for doc in candidates: s = α * embed_sim(query, doc) s += β * citation_proximity(query, doc) s += γ * bm25_score(query, doc) scores.append((doc, s)) # Stage 3: Return top-k return sorted(scores, reverse=True)[:k]
Weight Ablation Results
| Configuration | AUROC | Δ |
|---|---|---|
| α=1.0 (embed only) | 0.76 | -0.04 |
| β=1.0 (citation only) | 0.77 | -0.03 |
| γ=1.0 (BM25 only) | 0.74 | -0.06 |
| α=0.5, β=0.5 | 0.78 | -0.02 |
| α=0.4, β=0.35, γ=0.25 | 0.80 | — |
Optimal weights found via grid search on validation set. Three-signal combination outperforms any single signal.
8 Embedding Fusion Architecture
Multi-modal representation combining semantic text features with structural graph information through learned fusion layers.
Embedding Space Visualization
2D PCA projection showing case embeddings clustered by outcome. Query case finds nearest neighbors for retrieval.
Fusion Architecture Diagram
Concatenation Fusion
Simple Concatenation:
e_concat = [e_text; e_graph]
= [384-dim; 128-dim]
= 512-dim
Fusion MLP:
h1 = ReLU(W1 · e_concat + b1) # 512→256
h1 = Dropout(h1, p=0.2)
h2 = ReLU(W2 · h1 + b2) # 256→128
e_fused = L2_normalize(h2)
Where:
W1 ∈ ℝ^{256×512}, b1 ∈ ℝ^256
W2 ∈ ℝ^{128×256}, b2 ∈ ℝ^128
Concatenation preserves all information from both modalities. The MLP learns non-linear interactions (Baltrusaitis et al., 2019).
Alternative: Gated Fusion
Gated Fusion (Arevalo et al., 2017):
g = σ(W_g · [e_text; e_graph] + b_g)
e_fused = g ⊙ tanh(W_t·e_text) +
(1-g) ⊙ tanh(W_g·e_graph)
Where:
g ∈ ℝ^d: learned gating vector
σ: sigmoid activation
⊙: element-wise multiplication
Advantage: Adaptive weighting per
dimension based on input content.
Gated fusion allows the model to dynamically weight each modality. We found concatenation performs comparably with simpler implementation.
Fusion Ablation Results
Fusion provides +6% AUROC over text-only and +4% over graph-only baselines.
9 Prompt Engineering
The prompt template structures retrieved precedents with metadata to enable effective in-context learning. We follow Mistral's instruction format with careful attention to token budget management.
Full Prompt Template
[INST] You are a legal expert specializing in U.S. Supreme Court case analysis. Your task is to predict the outcome of a case based on its content and relevant precedents. ## Case to Analyze Case Name: {case_name} Docket Number: {docket} Decision Date: {date} Legal Issue Area: {issue_area} Case Text (Opinion): {case_text_truncated} ## Relevant Precedents The following cases have been identified as relevant based on citation patterns and semantic similarity: {for i, precedent in enumerate(retrieved_cases, 1)} ### Precedent {i}: {precedent.name} ({precedent.year}) Relevance Score: {precedent.score:.2f} Outcome: {precedent.outcome} Citation Distance: {precedent.citation_distance} hops Key Excerpt: {precedent.text_excerpt} {endfor} ## Task Based on the case text and the patterns observed in relevant precedents, predict whether the PETITIONER (party bringing the appeal) or RESPONDENT (party responding) will win. Consider: 1. How the legal issues align with precedent outcomes 2. The strength of citation relationships 3. The evolution of legal doctrine over time Prediction: [/INST]
Token Budget Management
| Component | Tokens | % |
|---|---|---|
| System prompt | ~200 | 1% |
| Query case text | ~5,000 | 25% |
| Precedent 1 | ~3,000 | 15% |
| Precedent 2 | ~3,000 | 15% |
| Precedent 3 | ~3,000 | 15% |
| Precedent 4 | ~3,000 | 15% |
| Precedent 5 | ~3,000 | 15% |
| Total | ~20,200 | 62% |
Remaining 38% of 32K context reserved for model generation and safety margin.
Truncation Strategy
Hierarchical truncation:
1. Case text: truncate to 5000 tokens
- Keep first 2500 (background)
- Keep last 2500 (holding/conclusion)
2. Precedent excerpts: 3000 tokens each
- Prioritize holding sections
- Include key cited passages
3. If over budget:
- Reduce precedent count (k=5→4→3)
- Further truncate excerpts
We preserve case beginnings and endings which typically contain facts and holdings respectively.
Prompt Engineering Ablations
| Variant | AUROC |
|---|---|
| No system prompt | 0.77 |
| No precedent metadata | 0.78 |
| Random precedent order | 0.79 |
| No relevance scores | 0.79 |
| Full template | 0.80 |
Key Findings
- • System prompt improves consistency (+3%)
- • Precedent metadata provides useful signal (+2%)
- • Ordering by relevance score marginally helps (+1%)
- • Explicit relevance scores aid attention (+1%)
10 Attention Analysis & Interpretability
Understanding which precedents and text spans the model attends to provides interpretability for legal practitioners and validates that the model learns meaningful legal reasoning patterns.
Attention Extraction Method
Multi-Head Attention Aggregation:
For layer L, head h:
A^{(L,h)} = softmax(QK^T / √d_k)
Attention to precedent p:
α_p = (1/H) ∑_{h=1}^{H} ∑_{t∈T_p} A^{(L,h)}_{cls,t}
Where:
H = 32 attention heads
T_p = token positions for precedent p
cls = classification token (last)
L = 32 (final layer)
Attribution Methods
| Method | Formula |
|---|---|
| Attention Rollout | Ā = ∏_{l=1}^{L} (0.5·I + 0.5·A^{(l)}) |
| Gradient × Input | attr_i = |∂ℒ/∂x_i · x_i| |
| Integrated Gradients | IG_i = x_i · ∫_{α=0}^{1} ∂F/∂x_i dα |
| SHAP Values | φ_i = ∑_S |S|!(n-|S|-1)!/n! · Δ_i |
| Leave-One-Out | Δ_p = P(y|all) - P(y|all\{p}) |
Example: Attention Distribution for Obergefell v. Hodges (2015)
Query Case: Obergefell v. Hodges - Same-sex marriage constitutional right Precedent Attention Weights (normalized): ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Loving v. Virginia (1967) ████████████████████████████████ 0.31 United States v. Windsor (2013) ██████████████████████████████ 0.28 Lawrence v. Texas (2003) ██████████████████████ 0.21 Romer v. Evans (1996) █████████████ 0.12 Griswold v. Connecticut (1965) ████████ 0.08 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Key Attended Phrases: • "fundamental right to marry" (Loving) → 0.42 local attention • "equal dignity in the eyes of the law" (Windsor) → 0.38 local attention • "liberty protects the person" (Lawrence) → 0.35 local attention Model Prediction: PETITIONER (confidence: 0.87) Actual Outcome: PETITIONER ✓
Observed Attention Patterns
| Pattern | Effect |
|---|---|
| Recency bias | 1.4× for recent |
| Outcome alignment | 1.8× for matching |
| Citation distance | 2.1× for direct |
| Issue area overlap | 1.6× for same |
| Relevance score | r = 0.72 correlation |
Faithfulness Evaluation
| Metric | Value | Interpretation |
|---|---|---|
| Attention-Prediction r | 0.67 | Strong |
| LOO Consistency | 83% | High |
| ROAR@10% | -12% acc | Meaningful |
| Human Agreement κ | 0.54 | Moderate |
11 Model Calibration
Well-calibrated probability estimates are essential for legal applications where practitioners need to assess prediction reliability. We apply temperature scaling (Guo et al., 2017) for post-hoc calibration.
Expected Calibration Error (ECE)
ECE (Naeini et al., 2015):
ECE = ∑_{m=1}^{M} (|B_m|/n) · |acc(B_m) - conf(B_m)|
Where:
M = 15 equal-width bins
B_m = samples in confidence bin m
acc(B_m) = accuracy in bin m
conf(B_m) = mean confidence in bin m
Interpretation:
ECE = 0.00: perfect calibration
ECE < 0.05: well-calibrated
ECE > 0.15: poorly calibrated
Temperature Scaling
Post-hoc calibration (Guo et al., 2017):
p_calibrated = softmax(z / T)
Where:
z = logits from model
T = temperature parameter
Optimization:
T* = argmin_T NLL(y, softmax(z/T))
Effects:
T > 1: soften predictions (less confident)
T < 1: sharpen predictions (more confident)
T = 1: no change
Reliability Diagram
Accuracy vs Confidence (15 bins):
1.0 │ ╭── Perfect
│ ● ╱ calibration
0.8 │ ● ● ╱
│ ● ● ╱ ● After T-scaling
│ ○ ● ╱ ○ Before
0.6 │ ○ ● ○ ╱
│ ○ ● ○ ╱
0.4 │○ ● ╱
│ ╱
0.2 │ ╱
│ ╱
0.0 ├──────┼──────┼──────┼──────┤
0.0 0.25 0.5 0.75 1.0
Confidence
Calibration Metrics
| ECE (before) | 0.127 |
| ECE (after T-scaling) | 0.034 |
| Optimal Temperature | T* = 1.42 |
| MCE (Max Cal. Error) | 0.089 |
| Brier Score | 0.168 |
| NLL (calibrated) | 0.412 |
Confidence Distribution
Practical Implications
Well-Calibrated for Legal Use
After temperature scaling, ECE of 0.034 means when the model predicts 70% confidence, it's correct approximately 70% of the time. This reliability is crucial for:
- • Risk assessment by legal practitioners
- • Identifying cases needing human review
- • Setting decision thresholds for different use cases
12 Statistical Significance Testing
Rigorous statistical methods for comparing model performance and establishing significance of improvements over baselines.
McNemar's Test
McNemar's Test (Dietterich, 1998):
Contingency table for paired predictions:
Model B
Correct Wrong
Model A ┌─────────┬─────────┐
Correct │ a │ b │
Wrong │ c │ d │
└─────────┴─────────┘
Test statistic:
χ² = (|b - c| - 1)² / (b + c)
H₀: P(b) = P(c) (models equivalent)
H₁: P(b) ≠ P(c) (models differ)
Reject H₀ if χ² > 3.84 (α = 0.05)
Bootstrap Confidence Intervals
BCa Bootstrap (Efron & Tibshirani, 1993):
for b in 1..B: # B = 10,000
D_b = sample(D_test, n, replace=True)
θ_b = metric(model, D_b)
Percentile CI:
CI_95% = [θ_{(0.025·B)}, θ_{(0.975·B)}]
BCa Correction:
z₀ = Φ⁻¹(#{θ_b < θ̂} / B)
a = ∑(θ̄ - θ_i)³ / (6·(∑(θ̄ - θ_i)²)^1.5)
Adjusted percentiles account for bias
and skewness in bootstrap distribution.
Statistical Comparison Results
| Comparison | AUROC Δ | 95% CI | McNemar χ² | p-value | Significant? |
|---|---|---|---|---|---|
| Ours vs. Mistral (no retrieval) | +0.06 | [0.04, 0.08] | 18.7 | <0.001 | Yes *** |
| Ours vs. BM25 Retrieval | +0.03 | [0.01, 0.05] | 8.4 | 0.004 | Yes ** |
| Ours vs. Longformer | +0.07 | [0.05, 0.09] | 24.1 | <0.001 | Yes *** |
| Ours vs. Legal-BERT | +0.09 | [0.06, 0.12] | 31.5 | <0.001 | Yes *** |
*** p < 0.001, ** p < 0.01, * p < 0.05 (Bonferroni-corrected for 4 comparisons, α = 0.0125)
Effect Size (Cohen's h)
Cohen's h for proportions:
h = 2·arcsin(√p₁) - 2·arcsin(√p₂)
Interpretation:
|h| < 0.2: small effect
|h| ≈ 0.5: medium effect
|h| > 0.8: large effect
Our improvement vs baseline:
h = 0.62 (medium-large)
Multiple Testing Correction
| Method | Bonferroni |
| # Comparisons | 4 |
| α (original) | 0.05 |
| α (corrected) | 0.0125 |
| All significant? | Yes ✓ |
Variance Analysis
| Cross-val folds | 5 |
| AUROC mean | 0.798 |
| AUROC std | ±0.012 |
| Seeds tested | 3 |
| Seed variance | ±0.008 |
13 Computational Requirements
| Component | Time | Hardware | Memory | Est. Cost |
|---|---|---|---|---|
| Data preprocessing | ~30 min | CPU (8 cores) | 8GB RAM | $0 |
| Citation extraction | ~10 min | CPU | 4GB RAM | $0 |
| GraphSAGE training | ~30 min | T4 GPU | 12GB VRAM | ~$1 |
| QLoRA fine-tuning | ~4 hours | A100 80GB | 35GB VRAM | ~$15 |
| Evaluation | ~15 min | A100 | 20GB VRAM | ~$1 |
| Total | ~5.5 hours | — | — | ~$17-30 |
Cloud Platform
All experiments run on Modal Labs with A100 GPUs at ~$3.50/hour. Code designed for serverless execution with automatic scaling.
Reproducibility Cost
Total compute cost under $30 makes this research accessible to academic labs and independent researchers without institutional GPU clusters.
14 Reproducibility Checklist
Following EMNLP reproducibility guidelines, we provide comprehensive documentation for result replication.
Code & Data Availability
| Source Code | github.com/[repo] |
| License | MIT |
| SCDB Data | scdb.wustl.edu |
| Case Text API | courtlistener.com |
| Trained Models | HuggingFace Hub |
| Processed Data | Zenodo archive |
Environment Specification
# Key dependencies (requirements.txt)
torch==2.1.0
transformers==4.36.0
peft==0.7.0
bitsandbytes==0.41.3
torch-geometric==2.4.0
sentence-transformers==2.2.2
neo4j==5.14.0
scikit-learn==1.3.2
modal==0.56.0
# Python version
python==3.10.12
# CUDA version
cuda==12.1
Random Seeds & Determinism
EMNLP Checklist Items
- ✓ Hyperparameters documented
- ✓ Training/evaluation code provided
- ✓ Data preprocessing scripts included
- ✓ Model checkpoints available
- ✓ Statistical significance tests
- ✓ Compute requirements stated
- ✓ Variance across runs reported
- ✓ License specified
Running Experiments
# Clone repository git clone https://github.com/[repo] cd caselaw-graph-ring # Install dependencies pip install -r requirements.txt # Download data python scripts/download_data.py # Train GraphSAGE python -m src.graph.train # Fine-tune LLM (Modal) modal run src/model/train.py # Evaluate python -m src.model.evaluate