# Codebook: The Half-Life of History Dataset

## Dataset Overview

- **Periods**: 33 time periods from 3rd millennium BC to 21st century CE
- **Temporal resolution**: 2 millennia (aggregated), 31 centuries (individual)
- **Collection date**: 2026-04-26
- **License**: CC-BY-4.0

## Metric Definitions

### openalex_work_count
- **Human label**: Academic Works (OpenAlex)
- **Unit**: Count of scholarly works
- **Data type**: Integer
- **Source**: OpenAlex API (`https://api.openalex.org/works`)
- **Query parameters**: `search="[Nth century BC/CE]"&filter=concepts.id:C95457728` (History concept)
- **Collection date**: 2026-04-26
- **Tier**: T1 (API, reproducible)
- **Known biases**: Concept tagging may not perfectly map to centuries; search term matching is imprecise for overlapping terms; English-language bias in OpenAlex coverage
- **Value range**: 526 (10th century BC) to 156,016 (20th century)

### named_individuals
- **Human label**: Named Individuals (Wikipedia)
- **Unit**: Count of biographical Wikipedia articles
- **Data type**: Integer
- **Source**: PetScan (`https://petscan.wmflabs.org/`)
- **Query parameters**: Recursive category count of `[Nth-century]_births` with depth=10
- **Collection date**: 2026-04-26
- **Tier**: T1 (API, reproducible)
- **Known biases**: Wikipedia's systemic bias toward Western, male, and notable individuals; category completeness varies by era; ancient periods extremely sparse
- **Value range**: 11 (10th century BC) to 1,367,010 (20th century)
- **NOTE**: `wikipedia_people_count` is identical to this metric for all 33 periods (same PetScan query). `wikipedia_people_count` is excluded from analysis to avoid double-counting.

### wikipedia_people_count
- **Human label**: Wikipedia People Count
- **Unit**: Count of biographical Wikipedia articles
- **Data type**: Integer
- **Source**: PetScan (`https://petscan.wmflabs.org/`)
- **EXCLUDED FROM ANALYSIS**: Identical to `named_individuals` (same PetScan source query). Including both would double-weight Wikipedia biography data in the composite index.

### wikipedia_events_count
- **Human label**: Wikipedia Events Count
- **Unit**: Count (intended)
- **Data type**: Null for all periods
- **EXCLUDED FROM ANALYSIS**: All values are null. Wikipedia category structure for historical events does not exist in the expected format.

### source_proxy
- **Human label**: Source Proxy (era-specific)
- **Unit**: Count-estimate of surviving primary sources
- **Data type**: Integer
- **Source**: Multiple sources depending on era
- **Collection date**: 2026-04-26
- **Tier**: T3 (antiquity estimates), T2 (medieval/modern published data)
- **Measurement regime**:
  - **Antiquity (before 6th c. CE)**: Inscription/tablet counts from EDH, PHI, CDLI databases
  - **Medieval (6th-15th c.)**: Manuscript production estimates from Buringh & Van Zanden 2009 via Our World in Data
  - **Early Modern (16th-17th c.)**: Print title counts from USTC
  - **Modern (18th c. onward)**: Publication volume estimates from Britannica, UNESCO, Google
- **Known biases**: Cross-era comparisons are approximate due to changing measurement method; archaeological survival rates vary by material (clay vs papyrus vs paper); geographic coverage shifts from Near East to Mediterranean to Europe
- **Value range**: 500 (10th century BC) to 200,000,000 (21st century)

### ngram_discourse
- **Human label**: Ngram Discourse Frequency
- **Unit**: Frequency (proportion of n-grams in Google Books English corpus)
- **Data type**: Float
- **Source**: Google Books Ngram Viewer JSON API
- **Query parameters**: `content=[ordinal]+century+[BC]&year_start=2000&year_end=2019&corpus=en-2019&smoothing=3`
- **Collection date**: 2026-04-26
- **Tier**: T1 (API, reproducible)
- **Known biases**:
  - 1st century CE value INFLATED by non-historical usage ("first century of..." in modern contexts)
  - Measures contemporary discourse about a period, not historical knowledge per se
  - English-language corpus only; Anglophone perspective
  - Combined word form + numeric form frequencies
- **Value range**: 1.34e-08 (10th century BC) to 2.08e-05 (19th century)

### films_set_in_period
- **Human label**: Films Set in Period
- **Unit**: Count of Wikipedia-listed films
- **Data type**: Integer (or null for millennia)
- **Source**: Wikipedia Films categories API
- **Collection date**: 2026-04-26
- **Tier**: T1 (API, reproducible)
- **EXCLUDED FROM COMPOSITE**: Supplementary metric only. Not available for all periods (null for millennia). Included in raw data display but not in the composite index.
- **Known biases**: Wikipedia category coverage is incomplete; 20th century films are categorized by decade/conflict instead of century (undercount)
- **Value range**: 1 (9th century BC) to 905 (19th century); null for millennia

## Tier Definitions

| Tier | Definition |
|------|-----------|
| T1 | Direct API query, fully reproducible with the same parameters |
| T2 | Published research data, independently verifiable from cited sources |
| T3 | Manual estimate based on archaeological/bibliographic literature, approximate |

## Excluded Metrics Summary

| Metric | Reason for Exclusion |
|--------|---------------------|
| `wikipedia_people_count` | Identical to `named_individuals` (same PetScan source) |
| `wikipedia_events_count` | All null (Wikipedia category structure not as expected) |
| `films_set_in_period` | Supplementary only; not available for all periods |

## Active Metrics in Composite Index

The composite index uses exactly **four** metrics:
1. `openalex_work_count`
2. `named_individuals`
3. `source_proxy`
4. `ngram_discourse`
