Data & Dataset Analysis

Comprehensive analysis of the SCDB dataset, CourtListener integration, and citation graph construction

1 Data Pipeline Overview

    DATA PIPELINE
    ═════════════════════════════════════════════════════════════════════════════

    ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
    │     SCDB        │    │  CourtListener  │    │   Processed     │
    │  (9,144 cases)  │───▶│      API        │───▶│   Dataset       │
    │   1946-2023     │    │  (Full Text)    │    │  (163 cases)    │
    └─────────────────┘    └─────────────────┘    └────────┬────────┘
                                                           │
                           ┌───────────────────────────────┼───────────────────────────────┐
                           │                               │                               │
                           ▼                               ▼                               ▼
                  ┌─────────────────┐          ┌─────────────────┐          ┌─────────────────┐
                  │   Training      │          │   Validation    │          │     Test        │
                  │   (113 cases)   │          │   (25 cases)    │          │   (25 cases)    │
                  │      69%        │          │      15%        │          │      15%        │
                  └─────────────────┘          └─────────────────┘          └─────────────────┘
        
Data Sources

SCDB: Washington University Supreme Court Database provides case metadata, voting patterns, and outcomes.

CourtListener: Free Law Project API provides full case text (CAP API deprecated in 2024).

API Integration

CourtListener v4 API with case matching via docket number and case name fuzzy matching. Rate limited to 5,000 requests/day for authenticated users.

2 Supreme Court Database (SCDB)

Dataset Characteristics

Attribute Value
Source Washington University Law
Original scope 9,144 cases
Date range 1946-2023
Matched with text 163 cases
Match rate 1.8% (163/9,144)
Matched date range 1947-2019

Variables Used

case_id
Unique SCDB identifier
date_decision
Date of Supreme Court decision
winning_party
Binary outcome (petitioner/respondent)
case_name
Full case name for matching

Outcome Distribution

Petitioner Wins
93
57% of cases
The party who brings the case to the Supreme Court
Respondent Wins
70
43% of cases
The party responding to the petition
Note: Slight class imbalance (57/43) addressed via stratified sampling in train/val/test splits.

3 Case Text Analysis

Text Statistics

Metric Value
Average length 41547 characters
Minimum length 611 characters
Maximum length 220,773 characters
Estimated avg tokens ~10,000 tokens
Median length ~35,000 characters

Text Preprocessing

1
HTML Removal
Strip HTML tags from CourtListener responses
2
Unicode Normalization
NFKC normalization for consistent encoding
3
Whitespace Cleanup
Normalize spacing and line breaks
4
Truncation
Limit to 4096 tokens for model context

Temporal Distribution

Distribution of matched cases by decade (1947-2019)

4 Citation Graph Analysis

Citation Network Construction

Watch how cases connect through citations, forming the legal precedent network.

226
Total Edges
24
Source Cases
176
Cited Cases
9.4
Avg Out-Degree

Detailed Graph Metrics

Metric Value
Total edges 226
Unique source cases 24
Unique cited cases 176
Average out-degree 9.4
Maximum out-degree 44 citations
Average in-degree 1.28
Maximum in-degree 9 citations
Graph density 0.0054

Citation Types

147
Supreme Court
(US Reports) 65%
34
Federal Appeals
15%
37
State/Regional
16%
8
Other
4%

Citation Extraction Methodology

Citations extracted using regex patterns for standard legal citation formats:

# US Reports (Supreme Court) r'\d+\s+U\.?\s?S\.?\s+\d+' # Supreme Court Reporter r'\d+\s+S\.?\s?Ct\.?\s+\d+' # Lawyer's Edition r'\d+\s+L\.?\s?Ed\.?\s*(2d)?\s+\d+' # Federal Reporter r'\d+\s+F\.\s?(2d|3d)?\s+\d+'

5 Data Splits

113
Training
69% of data
65 petitioner / 48 respondent
25
Validation
15% of data
14 petitioner / 11 respondent
25
Test
15% of data
14 petitioner / 11 respondent

Stratified Sampling

All splits maintain approximately 57/43 class balance through stratified random sampling with fixed seed (42) for reproducibility.

Temporal Considerations

Random splits used rather than temporal to maximize training data. Future work may explore temporal holdout evaluation.

6 Data Quality & Limitations

Match Rate Analysis

1.8%
163 / 9,144 cases matched

Reasons for Low Match Rate

  • 1. Name variations: SCDB and CourtListener use different naming conventions
  • 2. Missing full text: Many older cases lack digitized opinions
  • 3. Per curiam decisions: Short procedural rulings excluded
  • 4. API limitations: CourtListener coverage varies by era

Implications & Future Work

Generalization Caveat

Results may not generalize to the full SCDB population. The matched subset may have selection bias toward cases with more complete records.

Expansion Plans

  • + Integrate additional legal databases (Westlaw, LexisNexis APIs)
  • + Improve fuzzy matching with case citation linking
  • + OCR historical case documents for broader coverage
  • + Expand to lower federal courts (Circuit Courts)

7 Citation Network Visualization

Interactive citation graph visualization

Nodes represent cases, edges represent citations. Node color indicates outcome.

python -m src.graph.visualize --output static/graph.html