Data & Dataset Analysis
Comprehensive analysis of the SCDB dataset, CourtListener integration, and citation graph construction
1 Data Pipeline Overview
DATA PIPELINE
═════════════════════════════════════════════════════════════════════════════
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ SCDB │ │ CourtListener │ │ Processed │
│ (9,144 cases) │───▶│ API │───▶│ Dataset │
│ 1946-2023 │ │ (Full Text) │ │ (163 cases) │
└─────────────────┘ └─────────────────┘ └────────┬────────┘
│
┌───────────────────────────────┼───────────────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Training │ │ Validation │ │ Test │
│ (113 cases) │ │ (25 cases) │ │ (25 cases) │
│ 69% │ │ 15% │ │ 15% │
└─────────────────┘ └─────────────────┘ └─────────────────┘
SCDB: Washington University Supreme Court Database provides case metadata, voting patterns, and outcomes.
CourtListener: Free Law Project API provides full case text (CAP API deprecated in 2024).
CourtListener v4 API with case matching via docket number and case name fuzzy matching. Rate limited to 5,000 requests/day for authenticated users.
2 Supreme Court Database (SCDB)
Dataset Characteristics
| Attribute | Value |
|---|---|
| Source | Washington University Law |
| Original scope | 9,144 cases |
| Date range | 1946-2023 |
| Matched with text | 163 cases |
| Match rate | 1.8% (163/9,144) |
| Matched date range | 1947-2019 |
Variables Used
Outcome Distribution
3 Case Text Analysis
Text Statistics
| Metric | Value |
|---|---|
| Average length | 41547 characters |
| Minimum length | 611 characters |
| Maximum length | 220,773 characters |
| Estimated avg tokens | ~10,000 tokens |
| Median length | ~35,000 characters |
Text Preprocessing
Temporal Distribution
Distribution of matched cases by decade (1947-2019)
4 Citation Graph Analysis
Citation Network Construction
Watch how cases connect through citations, forming the legal precedent network.
Detailed Graph Metrics
| Metric | Value |
|---|---|
| Total edges | 226 |
| Unique source cases | 24 |
| Unique cited cases | 176 |
| Average out-degree | 9.4 |
| Maximum out-degree | 44 citations |
| Average in-degree | 1.28 |
| Maximum in-degree | 9 citations |
| Graph density | 0.0054 |
Citation Types
Citation Extraction Methodology
Citations extracted using regex patterns for standard legal citation formats:
5 Data Splits
Stratified Sampling
All splits maintain approximately 57/43 class balance through stratified random sampling with fixed seed (42) for reproducibility.
Temporal Considerations
Random splits used rather than temporal to maximize training data. Future work may explore temporal holdout evaluation.
6 Data Quality & Limitations
Match Rate Analysis
Reasons for Low Match Rate
- 1. Name variations: SCDB and CourtListener use different naming conventions
- 2. Missing full text: Many older cases lack digitized opinions
- 3. Per curiam decisions: Short procedural rulings excluded
- 4. API limitations: CourtListener coverage varies by era
Implications & Future Work
Generalization Caveat
Results may not generalize to the full SCDB population. The matched subset may have selection bias toward cases with more complete records.
Expansion Plans
- + Integrate additional legal databases (Westlaw, LexisNexis APIs)
- + Improve fuzzy matching with case citation linking
- + OCR historical case documents for broader coverage
- + Expand to lower federal courts (Circuit Courts)
7 Citation Network Visualization
Interactive citation graph visualization
Nodes represent cases, edges represent citations. Node color indicates outcome.