Show code
# All paths are defined in code/00_setup.R
# Parent project path is set via QPP_DATA_DIR environment variable
# This project uses here::here() for relative pathsSource Files, Variable Dictionary, and Methodology
This codebook documents all data sources, variables, and methodological decisions underlying the descriptive analysis of LGBTQ+ candidates in Brazil’s 2024 municipal elections. It is intended as a reference for reproducibility and for researchers who wish to extend or replicate this work.
Code chunks in this document illustrate the logic of data construction. Most have eval: false and do not execute; the session info section at the end evaluates to capture the software environment.
The analysis draws on data from four sources: (1) the TSE (Tribunal Superior Eleitoral) official candidate registration files, (2) TSE campaign finance records, (3) TSE candidate manifesto PDFs (propostas de governo), and (4) the VOTE LGBT candidate identification project. Geographic boundaries come from the IBGE via the geobr R package.
| File | Path | Format | Encoding | Description |
|---|---|---|---|---|
candidates_2024_full_integrated.csv |
Data/Processed/ (parent) |
CSV | UTF-8 | Full integrated candidate dataset from TSE 2024, with LGBTQ+ flags from VOTE LGBT matching. ~464K rows, one per candidacy. |
lgbt_matched_details.csv |
Data/Processed/ (parent) |
CSV | UTF-8 | Detailed LGBTQ+ matching results: candidate ID, match method (exact/fuzzy), sexual orientation, gender identity, source (TSE disclosure vs VOTE LGBT list). |
party_ideology_reference.csv |
Data/Processed/ (parent) |
CSV | UTF-8 | Party-level ideology scores from Bolognesi et al. expert survey. Columns: party abbreviation, ideology score (0-10), ideology category (Left/Center/Right). |
receitas_candidatos_2024_BRASIL.csv |
Data/Brazil/Electoral data/ (parent) |
CSV (; delimited) |
Latin1 | Raw TSE campaign finance file. ~5.5M transactions, 1.3 GB. Comma decimal separator, semicolon field separator. |
campaign_finance_aggregated.csv |
Data/Processed/intermediate/ (parent) |
CSV | UTF-8 | Pre-existing candidate-level finance aggregation from parent project. Used for validation against our independent aggregation. |
crosswalk_tse_to_geobr.csv |
Data/Processed/intermediate/ (parent) |
CSV | UTF-8 | Crosswalk between TSE municipality codes and IBGE/geobr municipality codes. |
municipality_crosswalk_master.csv |
Data/Processed/intermediate/ (parent) |
CSV | UTF-8 | Master municipality crosswalk with TSE code, IBGE code, municipality name, state, and population estimates. |
| Candidate manifesto ZIPs (27 files) | Data/Brazil/Candidate manifestos/ (parent) |
ZIP/PDF | Latin1 | TSE propostas de governo for mayoral candidates. ~15,800 PDFs across 27 state-level ZIP archives (~11 GB). PDF filenames embed SQ_CANDIDATO. |
These files are created by this project’s data preparation scripts and stored in data/derived/.
| File | Script | Format | Description |
|---|---|---|---|
candidates_analysis.rds |
01_load_candidates.R |
RDS | Cleaned candidate dataset with recoded variables, LGBTQ+ flags, ideology scores, and simplified demographics. |
finance_by_candidate.rds |
02_load_finance_raw.R |
RDS | Candidate-level campaign finance summary: total revenue, funding source amounts and shares, donor counts. |
finance_transactions.rds |
02_load_finance_raw.R |
RDS | Transaction-level finance data with classified funding types. Large file (~200MB). |
manifestos_text.rds |
05_process_manifestos.R |
RDS | Extracted manifesto text from ~15,800 mayoral candidate PDFs. Includes text, page/word counts, and extraction success flag. |
analysis_full.rds |
04_build_analysis_data.R |
RDS | Final merged dataset: candidates + finance + geography + manifestos. This is the primary analysis file used by all QMD chapters. |
municipalities.rds |
03_prepare_geography.R |
RDS | Cached municipality boundary shapefiles from geobr::read_municipality(year = 2022). |
states.rds |
03_prepare_geography.R |
RDS | Cached state boundary shapefiles from geobr::read_state(year = 2020). |
| Directory | Contents |
|---|---|
output/tables/ |
CSV exports of key summary tables, including finance_revenue_source_categories.csv. |
output/figures/ |
PNG and PDF exports of all figures, generated by save_figure(). |
| Variable | Type | Description | Source | Example Values | Missing |
|---|---|---|---|---|---|
candidate_id |
character | TSE sequential candidate identifier (SQ_CANDIDATO) |
TSE registration | "280000000001" |
0% |
lgbtq_candidate |
logical | Whether candidate is identified as LGBTQ+ (TSE self-disclosure OR VOTE LGBT match) | TSE + VOTE LGBT | TRUE, FALSE |
0% |
trans_candidate |
logical | Whether candidate is identified as transgender (subset of LGBTQ+) | TSE + VOTE LGBT | TRUE, FALSE |
0% |
lgbtq_label |
factor | Binary label for plotting | Derived | "LGBTQ+", "Non-LGBTQ+" |
0% |
lgbt_category |
factor | Disaggregated identity category. Trans is prioritized over sexual orientation. | Derived | "Non-LGBTQ+", "Gay", "Lesbian", "Bisexual+", "Trans", "Asexual", "Other LGBTQ+" |
0% |
female |
logical | Whether candidate’s registered gender is female | TSE | TRUE, FALSE |
<1% |
nonwhite |
logical | Whether candidate’s race is not “Branca” (White) | TSE | TRUE, FALSE |
~1% |
race_simple |
factor | Simplified race category | Derived from TSE DS_COR_RACA |
"White", "Black", "Brown", "Other" |
~1% |
age |
numeric | Candidate age at election date | TSE | 18 to 100+ |
<1% |
age_group |
factor | Binned age categories | Derived | "18-29", "30-39", "40-49", "50-59", "60+" |
<1% |
education_simple |
factor | Simplified education level (3 categories) | Derived from TSE DS_GRAU_INSTRUCAO |
"Less than HS", "High School", "College+" |
<1% |
party_abbrev |
character | Party abbreviation | TSE | "PT", "PL", "MDB" |
0% |
ideology_score |
numeric | Party ideology score (0 = far left, 10 = far right) | Bolognesi et al. | 1.5 to 9.2 |
~2% (minor parties) |
ideology_category |
factor | Left/Center/Right classification | Derived from ideology_score |
"Left", "Center", "Right" |
~2% |
position_simple |
factor | Simplified position type | TSE | "City Councilor", "Mayor", "Vice Mayor" |
0% |
elected |
logical | Whether candidate was elected | TSE | TRUE, FALSE |
~5.4% |
state_abbrev |
character | Two-letter state abbreviation | TSE | "SP", "RJ", "BA" |
0% |
region |
factor | Brazilian macro-region | Derived from state_abbrev |
"North", "Northeast", "Center-West", "Southeast", "South" |
0% |
geobr_code |
numeric | IBGE municipality code (7 digits) compatible with geobr |
Crosswalk | 3550308 (Sao Paulo) |
<1% |
These variables are merged from finance_by_candidate.rds onto the candidate dataset. Candidates with no finance records have all finance variables set to 0.
| Variable | Type | Description | Source |
|---|---|---|---|
total_revenue |
numeric | Total campaign revenue in R$ | TSE receitas |
n_transactions |
integer | Number of revenue transactions | TSE receitas |
n_unique_donors |
integer | Number of distinct donor IDs | TSE receitas |
self_funding_amt |
numeric | Revenue from candidate’s own resources (R\() | Classified from `DS_ORIGEM_RECEITA` | | `party_funding_amt` | numeric | Revenue from party transfers (R\)) | Classified |
individual_funding_amt |
numeric | Revenue from individual donors (R\() | Classified | | `crowdfunding_amt` | numeric | Revenue from crowdfunding platforms (R\)) | Classified |
pct_self |
numeric | Self-funding as % of total revenue (0-100 scale) | Derived |
pct_party |
numeric | Party funding as % of total (0-100) | Derived |
pct_individual |
numeric | Individual funding as % of total (0-100) | Derived |
pct_crowdfunding |
numeric | Crowdfunding as % of total (0-100) | Derived |
financial_amt |
numeric | Total financial (cash) contributions (R\() | From `DS_NATUREZA_RECEITA == "FINANCEIRO"` | | `inkind_amt` | numeric | Total in-kind/estimated contributions (R\)) | From DS_NATUREZA_RECEITA != "FINANCEIRO" |
These variables are merged from manifestos_text.rds onto the candidate dataset. Only mayoral (PREFEITO) candidates have manifestos; all other candidates have has_manifesto = FALSE and NA for text fields.
| Variable | Type | Description | Source |
|---|---|---|---|
has_manifesto |
logical | Whether a manifesto PDF was found and text successfully extracted | 05_process_manifestos.R |
manifesto_text |
character | Full extracted text of the candidate’s proposta de governo | pdftotext extraction |
manifesto_n_pages |
integer | Page count of the manifesto PDF | pdfinfo |
manifesto_n_words |
integer | Word count of the extracted manifesto text | Derived |
Available in finance_transactions.rds for detailed analysis.
| Variable | Type | Description |
|---|---|---|
SQ_CANDIDATO |
character | Candidate sequential ID (join key) |
DS_ORIGEM_RECEITA |
character | TSE revenue source category (Portuguese) |
DS_NATUREZA_RECEITA |
character | Financial vs in-kind classification |
DS_ESPECIE_RECEITA |
character | Payment method (PIX, bank transfer, check, etc.) |
DS_NATUREZA_RECURSO_ESTIMAVEL |
character | Type of in-kind contribution (when applicable) |
VR_RECEITA |
numeric | Transaction amount in R$ |
funding_type |
character | Our classified funding type: self_funding, party_funding, individual_funding, crowdfunding, other_candidates, other |
NR_CPF_CNPJ_DOADOR |
character | Donor CPF (individual) or CNPJ (entity) |
NM_DOADOR |
character | Donor name |
DS_GENERO |
character | Donor gender (TSE categories) |
DS_COR_RACA |
character | Donor race/color (TSE categories) |
DT_RECEITA |
character | Transaction date |
LGBTQ+ candidates are identified through two complementary mechanisms:
For the first time, the TSE’s 2024 candidate registration form included optional fields for:
Candidates who reported any non-heterosexual orientation or non-cisgender identity are flagged as LGBTQ+. This is the primary identification source and captures the majority of identified LGBTQ+ candidates.
VOTE LGBT (Voto com Orgulho) is a civil society organization that identifies and supports LGBTQ+ candidates in Brazilian elections. Their list includes candidates who:
The matching between VOTE LGBT names and TSE candidate records uses a two-step process:
# Step 1: Exact match on candidate name (after normalization)
# - Remove accents, standardize spacing, uppercase
# - Match on full name, with state as secondary filter
# Step 2: Fuzzy match for remaining unmatched
# - String distance (Jaro-Winkler) with threshold
# - Manual review of ambiguous matches
# - State + party used as disambiguationWhen a candidate is identified through both sources, or when multiple identity labels apply, the following priority rules determine the lgbt_category assignment:
# Priority rules (implemented in make_lgbt_category()):
# 1. Non-LGBTQ+ candidates: lgbtq_candidate == FALSE -> "Non-LGBTQ+"
# 2. Trans identity is prioritized: trans_candidate == TRUE -> "Trans"
# (regardless of sexual orientation, since trans is a gender identity
# that cross-cuts sexual orientation)
# 3. Sexual orientation categories in order:
# - "Gay" (male homosexual)
# - "Lesbian" (female homosexual, "Lesbica" in Portuguese)
# - "Bisexual+" (bisexual or pansexual, collapsed)
# - "Asexual"
# - "Other LGBTQ+" (any remaining identified candidate)Transgender identity is a gender identity, not a sexual orientation. A trans candidate may also be gay, lesbian, or bisexual, but for the purpose of this analysis, their gender identity takes precedence in the primary categorization. This follows conventions in both the LGBTQ+ studies literature and in Brazilian activist nomenclature. The “Bisexual+” category collapses bisexual and pansexual identities, which share the key feature of attraction to more than one gender.
Party ideology scores come from:
Bolognesi, B., Ribeiro, E., & Codato, A. (2023). “Ideologia dos partidos politicos brasileiros.” Expert survey of Brazilian party ideology.
# Thresholds used in this analysis:
# Left: ideology_score < 4.0
# Center: ideology_score >= 4.0 and < 7.1
# Right: ideology_score >= 7.1
# These thresholds are based on:
# 1. Natural breaks in the expert survey distribution
# 2. Conventional placement of anchor parties:
# - PT (Workers' Party) ~ 2.0 (clearly Left)
# - MDB (centrist catch-all) ~ 5.5 (Center)
# - PL (Bolsonaro's party) ~ 8.5 (clearly Right)Approximately 2% of candidates belong to minor parties not covered by the expert survey. These candidates have ideology_score = NA and ideology_category = NA. They are excluded from ideology-stratified analyses but included in all other analyses.
The TSE uses its own municipality coding system, which differs from the IBGE’s official 7-digit municipality codes used by statistical agencies and the geobr package. The crosswalk resolves this mismatch.
# Crosswalk construction:
# 1. Start with TSE municipality code + name + state
# 2. Match to IBGE municipality code via:
# a. Exact name match within state (handles ~95% of cases)
# b. Fuzzy name match for municipalities with accent/spelling differences
# c. Manual resolution for remaining cases (mergers, name changes)
# 3. IBGE 7-digit code (geobr_code) enables spatial joins with geobr shapefiles
# The geobr package uses code_muni, which is the IBGE 7-digit code.
# Join: candidates$geobr_code == muni_sf$code_muniThe crosswalk covers all 5,570 Brazilian municipalities as of 2022. A small number of candidates (<1%) have missing geobr_code due to:
Approximately 5.4% of candidates have elected = NA. This occurs because:
DS_SIT_TOT_TURNO field in the TSE data has values that do not map cleanly to elected/not-electedHandling: These candidates are excluded from election rate calculations but included in all other analyses (demographics, finance, geography).
The raw TSE finance file (receitas_candidatos_2024_BRASIL.csv) is 1.3 GB with semicolon delimiters and Latin1 encoding. Known issues:
locale(decimal_mark = ",") in read_delim().The LGBTQ+ identification methodology has two important limitations:
TSE race categories use the Brazilian census classification: Branca (White), Preta (Black), Parda (Brown), Amarela (Asian), and Indigena (Indigenous). We simplify to:
The female variable is based on the TSE’s DS_GENERO field. For trans candidates, this may reflect their legal gender rather than their gender assigned at birth. We use the TSE-recorded gender throughout, which aligns with Brazilian legal identity norms post-2018 (when the Supremo Tribunal Federal allowed name/gender changes without surgery).
The unit of analysis is the candidacy, not the individual person. In rare cases, the same person may appear as a candidate in more than one municipality (e.g., if they withdrew and re-registered). We do not de-duplicate at the person level, as the candidacy is the relevant unit for campaign finance and electoral outcome analysis.
Analysis data path:
/Users/aloport/Library/CloudStorage/Dropbox/Research/Prep/brazil_lgbtq_descriptives/data/derived/analysis_full.rds
Exists: TRUE
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS 26.2
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] C.UTF-8/C.UTF-8/C.UTF-8/C/C.UTF-8/C.UTF-8
time zone: Europe/London
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] geobr_1.9.1 sf_1.0-20 kableExtra_1.4.0 knitr_1.48
[5] patchwork_1.3.0 gt_0.11.0 scales_1.4.0 lubridate_1.9.3
[9] forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.2.0
[13] readr_2.1.5 tidyr_1.3.1 tibble_3.2.1 ggplot2_4.0.0
[17] tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] utf8_1.2.4 generics_0.1.3 class_7.3-22 xml2_1.3.6
[5] KernSmooth_2.23-24 stringi_1.8.4 hms_1.1.3 digest_0.6.37
[9] magrittr_2.0.4 evaluate_1.0.3 grid_4.4.1 timechange_0.3.0
[13] RColorBrewer_1.1-3 fastmap_1.2.0 rprojroot_2.0.4 jsonlite_2.0.0
[17] e1071_1.7-16 DBI_1.2.3 fansi_1.0.6 viridisLite_0.4.2
[21] cli_3.6.5 rlang_1.1.6 units_0.8-7 withr_3.0.2
[25] yaml_2.3.10 tools_4.4.1 tzdb_0.5.0 here_1.0.1
[29] curl_7.0.0 vctrs_0.6.5 R6_2.6.1 proxy_0.4-27
[33] classInt_0.4-10 lifecycle_1.0.4 htmlwidgets_1.6.4 pkgconfig_2.0.3
[37] pillar_1.9.0 gtable_0.3.6 data.table_1.17.0 Rcpp_1.1.0
[41] glue_1.8.0 systemfonts_1.1.0 xfun_0.47 tidyselect_1.2.1
[45] rstudioapi_0.17.1 farver_2.1.2 htmltools_0.5.8.1 rmarkdown_2.28
[49] svglite_2.1.3 compiler_4.4.1 S7_0.2.0
The data preparation scripts must be run in order before rendering the QMD chapters:
# 1. code/01_load_candidates.R -> data/derived/candidates_analysis.rds
# 2. code/02_load_finance_raw.R -> data/derived/finance_by_candidate.rds
# data/derived/finance_transactions.rds
# output/tables/finance_revenue_source_categories.csv
# 3. code/03_prepare_geography.R -> data/geo_cache/municipalities.rds
# data/geo_cache/states.rds
# 4. code/05_process_manifestos.R -> data/derived/manifestos_text.rds
# 5. code/04_build_analysis_data.R -> data/derived/analysis_full.rds
#
# After all five scripts have run:
# 6. quarto render docs/ -> docs/_site/ (rendered website)The parent project data is stored on Dropbox and is not publicly available. Researchers wishing to replicate this analysis should:
All code in this project is designed to be portable given the correct paths configuration in code/00_setup.R.
---
title: "Data Codebook"
subtitle: "Source Files, Variable Dictionary, and Methodology"
---
# Overview
This codebook documents all data sources, variables, and methodological decisions underlying the descriptive analysis of LGBTQ+ candidates in Brazil's 2024 municipal elections. It is intended as a reference for reproducibility and for researchers who wish to extend or replicate this work.
Code chunks in this document illustrate the logic of data construction. Most have `eval: false` and do not execute; the session info section at the end evaluates to capture the software environment.
# Source Files
The analysis draws on data from four sources: (1) the TSE (*Tribunal Superior Eleitoral*) official candidate registration files, (2) TSE campaign finance records, (3) TSE candidate manifesto PDFs (*propostas de governo*), and (4) the VOTE LGBT candidate identification project. Geographic boundaries come from the IBGE via the `geobr` R package.
```{r source-files}
#| label: source-files
#| eval: false
# All paths are defined in code/00_setup.R
# Parent project path is set via QPP_DATA_DIR environment variable
# This project uses here::here() for relative paths
```
| File | Path | Format | Encoding | Description |
|------|------|--------|----------|-------------|
| `candidates_2024_full_integrated.csv` | `Data/Processed/` (parent) | CSV | UTF-8 | Full integrated candidate dataset from TSE 2024, with LGBTQ+ flags from VOTE LGBT matching. ~464K rows, one per candidacy. |
| `lgbt_matched_details.csv` | `Data/Processed/` (parent) | CSV | UTF-8 | Detailed LGBTQ+ matching results: candidate ID, match method (exact/fuzzy), sexual orientation, gender identity, source (TSE disclosure vs VOTE LGBT list). |
| `party_ideology_reference.csv` | `Data/Processed/` (parent) | CSV | UTF-8 | Party-level ideology scores from Bolognesi et al. expert survey. Columns: party abbreviation, ideology score (0-10), ideology category (Left/Center/Right). |
| `receitas_candidatos_2024_BRASIL.csv` | `Data/Brazil/Electoral data/` (parent) | CSV (`;` delimited) | Latin1 | Raw TSE campaign finance file. ~5.5M transactions, 1.3 GB. Comma decimal separator, semicolon field separator. |
| `campaign_finance_aggregated.csv` | `Data/Processed/intermediate/` (parent) | CSV | UTF-8 | Pre-existing candidate-level finance aggregation from parent project. Used for validation against our independent aggregation. |
| `crosswalk_tse_to_geobr.csv` | `Data/Processed/intermediate/` (parent) | CSV | UTF-8 | Crosswalk between TSE municipality codes and IBGE/geobr municipality codes. |
| `municipality_crosswalk_master.csv` | `Data/Processed/intermediate/` (parent) | CSV | UTF-8 | Master municipality crosswalk with TSE code, IBGE code, municipality name, state, and population estimates. |
| Candidate manifesto ZIPs (27 files) | `Data/Brazil/Candidate manifestos/` (parent) | ZIP/PDF | Latin1 | TSE *propostas de governo* for mayoral candidates. ~15,800 PDFs across 27 state-level ZIP archives (~11 GB). PDF filenames embed `SQ_CANDIDATO`. |
### Derived Data Files
These files are created by this project's data preparation scripts and stored in `data/derived/`.
| File | Script | Format | Description |
|------|--------|--------|-------------|
| `candidates_analysis.rds` | `01_load_candidates.R` | RDS | Cleaned candidate dataset with recoded variables, LGBTQ+ flags, ideology scores, and simplified demographics. |
| `finance_by_candidate.rds` | `02_load_finance_raw.R` | RDS | Candidate-level campaign finance summary: total revenue, funding source amounts and shares, donor counts. |
| `finance_transactions.rds` | `02_load_finance_raw.R` | RDS | Transaction-level finance data with classified funding types. Large file (~200MB). |
| `manifestos_text.rds` | `05_process_manifestos.R` | RDS | Extracted manifesto text from ~15,800 mayoral candidate PDFs. Includes text, page/word counts, and extraction success flag. |
| `analysis_full.rds` | `04_build_analysis_data.R` | RDS | Final merged dataset: candidates + finance + geography + manifestos. This is the primary analysis file used by all QMD chapters. |
| `municipalities.rds` | `03_prepare_geography.R` | RDS | Cached municipality boundary shapefiles from `geobr::read_municipality(year = 2022)`. |
| `states.rds` | `03_prepare_geography.R` | RDS | Cached state boundary shapefiles from `geobr::read_state(year = 2020)`. |
### Output Files
| Directory | Contents |
|-----------|----------|
| `output/tables/` | CSV exports of key summary tables, including `finance_revenue_source_categories.csv`. |
| `output/figures/` | PNG and PDF exports of all figures, generated by `save_figure()`. |
# Variable Dictionary
## Core Candidate Variables
| Variable | Type | Description | Source | Example Values | Missing |
|----------|------|-------------|--------|----------------|---------|
| `candidate_id` | character | TSE sequential candidate identifier (`SQ_CANDIDATO`) | TSE registration | `"280000000001"` | 0% |
| `lgbtq_candidate` | logical | Whether candidate is identified as LGBTQ+ (TSE self-disclosure OR VOTE LGBT match) | TSE + VOTE LGBT | `TRUE`, `FALSE` | 0% |
| `trans_candidate` | logical | Whether candidate is identified as transgender (subset of LGBTQ+) | TSE + VOTE LGBT | `TRUE`, `FALSE` | 0% |
| `lgbtq_label` | factor | Binary label for plotting | Derived | `"LGBTQ+"`, `"Non-LGBTQ+"` | 0% |
| `lgbt_category` | factor | Disaggregated identity category. Trans is prioritized over sexual orientation. | Derived | `"Non-LGBTQ+"`, `"Gay"`, `"Lesbian"`, `"Bisexual+"`, `"Trans"`, `"Asexual"`, `"Other LGBTQ+"` | 0% |
| `female` | logical | Whether candidate's registered gender is female | TSE | `TRUE`, `FALSE` | <1% |
| `nonwhite` | logical | Whether candidate's race is not "Branca" (White) | TSE | `TRUE`, `FALSE` | ~1% |
| `race_simple` | factor | Simplified race category | Derived from TSE `DS_COR_RACA` | `"White"`, `"Black"`, `"Brown"`, `"Other"` | ~1% |
| `age` | numeric | Candidate age at election date | TSE | `18` to `100+` | <1% |
| `age_group` | factor | Binned age categories | Derived | `"18-29"`, `"30-39"`, `"40-49"`, `"50-59"`, `"60+"` | <1% |
| `education_simple` | factor | Simplified education level (3 categories) | Derived from TSE `DS_GRAU_INSTRUCAO` | `"Less than HS"`, `"High School"`, `"College+"` | <1% |
| `party_abbrev` | character | Party abbreviation | TSE | `"PT"`, `"PL"`, `"MDB"` | 0% |
| `ideology_score` | numeric | Party ideology score (0 = far left, 10 = far right) | Bolognesi et al. | `1.5` to `9.2` | ~2% (minor parties) |
| `ideology_category` | factor | Left/Center/Right classification | Derived from `ideology_score` | `"Left"`, `"Center"`, `"Right"` | ~2% |
| `position_simple` | factor | Simplified position type | TSE | `"City Councilor"`, `"Mayor"`, `"Vice Mayor"` | 0% |
| `elected` | logical | Whether candidate was elected | TSE | `TRUE`, `FALSE` | ~5.4% |
| `state_abbrev` | character | Two-letter state abbreviation | TSE | `"SP"`, `"RJ"`, `"BA"` | 0% |
| `region` | factor | Brazilian macro-region | Derived from `state_abbrev` | `"North"`, `"Northeast"`, `"Center-West"`, `"Southeast"`, `"South"` | 0% |
| `geobr_code` | numeric | IBGE municipality code (7 digits) compatible with `geobr` | Crosswalk | `3550308` (Sao Paulo) | <1% |
## Campaign Finance Variables
These variables are merged from `finance_by_candidate.rds` onto the candidate dataset. Candidates with no finance records have all finance variables set to 0.
| Variable | Type | Description | Source |
|----------|------|-------------|--------|
| `total_revenue` | numeric | Total campaign revenue in R$ | TSE receitas |
| `n_transactions` | integer | Number of revenue transactions | TSE receitas |
| `n_unique_donors` | integer | Number of distinct donor IDs | TSE receitas |
| `self_funding_amt` | numeric | Revenue from candidate's own resources (R$) | Classified from `DS_ORIGEM_RECEITA` |
| `party_funding_amt` | numeric | Revenue from party transfers (R$) | Classified |
| `individual_funding_amt` | numeric | Revenue from individual donors (R$) | Classified |
| `crowdfunding_amt` | numeric | Revenue from crowdfunding platforms (R$) | Classified |
| `pct_self` | numeric | Self-funding as % of total revenue (0-100 scale) | Derived |
| `pct_party` | numeric | Party funding as % of total (0-100) | Derived |
| `pct_individual` | numeric | Individual funding as % of total (0-100) | Derived |
| `pct_crowdfunding` | numeric | Crowdfunding as % of total (0-100) | Derived |
| `financial_amt` | numeric | Total financial (cash) contributions (R$) | From `DS_NATUREZA_RECEITA == "FINANCEIRO"` |
| `inkind_amt` | numeric | Total in-kind/estimated contributions (R$) | From `DS_NATUREZA_RECEITA != "FINANCEIRO"` |
## Manifesto Variables
These variables are merged from `manifestos_text.rds` onto the candidate dataset. Only mayoral (PREFEITO) candidates have manifestos; all other candidates have `has_manifesto = FALSE` and `NA` for text fields.
| Variable | Type | Description | Source |
|----------|------|-------------|--------|
| `has_manifesto` | logical | Whether a manifesto PDF was found and text successfully extracted | `05_process_manifestos.R` |
| `manifesto_text` | character | Full extracted text of the candidate's *proposta de governo* | `pdftotext` extraction |
| `manifesto_n_pages` | integer | Page count of the manifesto PDF | `pdfinfo` |
| `manifesto_n_words` | integer | Word count of the extracted manifesto text | Derived |
## Transaction-Level Finance Variables
Available in `finance_transactions.rds` for detailed analysis.
| Variable | Type | Description |
|----------|------|-------------|
| `SQ_CANDIDATO` | character | Candidate sequential ID (join key) |
| `DS_ORIGEM_RECEITA` | character | TSE revenue source category (Portuguese) |
| `DS_NATUREZA_RECEITA` | character | Financial vs in-kind classification |
| `DS_ESPECIE_RECEITA` | character | Payment method (PIX, bank transfer, check, etc.) |
| `DS_NATUREZA_RECURSO_ESTIMAVEL` | character | Type of in-kind contribution (when applicable) |
| `VR_RECEITA` | numeric | Transaction amount in R$ |
| `funding_type` | character | Our classified funding type: `self_funding`, `party_funding`, `individual_funding`, `crowdfunding`, `other_candidates`, `other` |
| `NR_CPF_CNPJ_DOADOR` | character | Donor CPF (individual) or CNPJ (entity) |
| `NM_DOADOR` | character | Donor name |
| `DS_GENERO` | character | Donor gender (TSE categories) |
| `DS_COR_RACA` | character | Donor race/color (TSE categories) |
| `DT_RECEITA` | character | Transaction date |
# LGBTQ+ Identification Methodology
## Two Sources of Identification
LGBTQ+ candidates are identified through two complementary mechanisms:
### 1. TSE Self-Disclosure (New in 2024)
For the first time, the TSE's 2024 candidate registration form included optional fields for:
- **Sexual orientation**: Heterossexual, Gay, Lesbica, Bissexual, Pansexual, Assexual, or open text
- **Gender identity**: Cisgenero, Transgenero/Travesti, Nao-binario, or open text
Candidates who reported any non-heterosexual orientation or non-cisgender identity are flagged as LGBTQ+. This is the primary identification source and captures the majority of identified LGBTQ+ candidates.
```{r tse-disclosure}
#| label: tse-disclosure
#| eval: false
# TSE self-disclosure logic (simplified)
tse_lgbtq <- sexual_orientation != "Heterossexual" |
gender_identity != "Cisgenero"
```
### 2. VOTE LGBT Candidate List
VOTE LGBT (*Voto com Orgulho*) is a civil society organization that identifies and supports LGBTQ+ candidates in Brazilian elections. Their list includes candidates who:
- Self-identify to the organization as LGBTQ+
- Are publicly known LGBTQ+ community members
- Are nominated by LGBTQ+ organizations
The matching between VOTE LGBT names and TSE candidate records uses a two-step process:
```{r matching-logic}
#| label: matching-logic
#| eval: false
# Step 1: Exact match on candidate name (after normalization)
# - Remove accents, standardize spacing, uppercase
# - Match on full name, with state as secondary filter
# Step 2: Fuzzy match for remaining unmatched
# - String distance (Jaro-Winkler) with threshold
# - Manual review of ambiguous matches
# - State + party used as disambiguation
```
### Priority Rules for Identity Categories
When a candidate is identified through both sources, or when multiple identity labels apply, the following priority rules determine the `lgbt_category` assignment:
```{r priority-rules}
#| label: priority-rules
#| eval: false
# Priority rules (implemented in make_lgbt_category()):
# 1. Non-LGBTQ+ candidates: lgbtq_candidate == FALSE -> "Non-LGBTQ+"
# 2. Trans identity is prioritized: trans_candidate == TRUE -> "Trans"
# (regardless of sexual orientation, since trans is a gender identity
# that cross-cuts sexual orientation)
# 3. Sexual orientation categories in order:
# - "Gay" (male homosexual)
# - "Lesbian" (female homosexual, "Lesbica" in Portuguese)
# - "Bisexual+" (bisexual or pansexual, collapsed)
# - "Asexual"
# - "Other LGBTQ+" (any remaining identified candidate)
```
::: {.callout-note}
## Why Prioritize Trans?
Transgender identity is a gender identity, not a sexual orientation. A trans candidate may also be gay, lesbian, or bisexual, but for the purpose of this analysis, their gender identity takes precedence in the primary categorization. This follows conventions in both the LGBTQ+ studies literature and in Brazilian activist nomenclature. The "Bisexual+" category collapses bisexual and pansexual identities, which share the key feature of attraction to more than one gender.
:::
# Party Ideology Scores
## Source
Party ideology scores come from:
> Bolognesi, B., Ribeiro, E., & Codato, A. (2023). "Ideologia dos partidos politicos brasileiros." *Expert survey of Brazilian party ideology.*
## Scale
- **Range**: 0 (far left) to 10 (far right)
- **Method**: Expert survey with multiple respondents per party
- **Coverage**: All major and most minor parties in the Brazilian system
## Classification Thresholds
```{r ideology-thresholds}
#| label: ideology-thresholds
#| eval: false
# Thresholds used in this analysis:
# Left: ideology_score < 4.0
# Center: ideology_score >= 4.0 and < 7.1
# Right: ideology_score >= 7.1
# These thresholds are based on:
# 1. Natural breaks in the expert survey distribution
# 2. Conventional placement of anchor parties:
# - PT (Workers' Party) ~ 2.0 (clearly Left)
# - MDB (centrist catch-all) ~ 5.5 (Center)
# - PL (Bolsonaro's party) ~ 8.5 (clearly Right)
```
## Unscored Parties
Approximately 2% of candidates belong to minor parties not covered by the expert survey. These candidates have `ideology_score = NA` and `ideology_category = NA`. They are excluded from ideology-stratified analyses but included in all other analyses.
# Municipality Crosswalk
## The TSE-IBGE Code Problem
The TSE uses its own municipality coding system, which differs from the IBGE's official 7-digit municipality codes used by statistical agencies and the `geobr` package. The crosswalk resolves this mismatch.
```{r crosswalk-logic}
#| label: crosswalk-logic
#| eval: false
# Crosswalk construction:
# 1. Start with TSE municipality code + name + state
# 2. Match to IBGE municipality code via:
# a. Exact name match within state (handles ~95% of cases)
# b. Fuzzy name match for municipalities with accent/spelling differences
# c. Manual resolution for remaining cases (mergers, name changes)
# 3. IBGE 7-digit code (geobr_code) enables spatial joins with geobr shapefiles
# The geobr package uses code_muni, which is the IBGE 7-digit code.
# Join: candidates$geobr_code == muni_sf$code_muni
```
## Coverage
The crosswalk covers all 5,570 Brazilian municipalities as of 2022. A small number of candidates (<1%) have missing `geobr_code` due to:
- TSE municipality codes for overseas voting sections (not mappable to domestic municipalities)
- Rare municipality code changes between TSE and IBGE systems
# Data Quality Notes
## Known Issues
### 1. Missing Election Outcomes (~5.4%)
Approximately 5.4% of candidates have `elected = NA`. This occurs because:
- Some candidacies were annulled or withdrawn after registration but before results were finalized
- A small number of substitute candidates (*suplentes*) have ambiguous outcome status
- The `DS_SIT_TOT_TURNO` field in the TSE data has values that do not map cleanly to elected/not-elected
**Handling**: These candidates are excluded from election rate calculations but included in all other analyses (demographics, finance, geography).
### 2. Finance File Parsing
The raw TSE finance file (`receitas_candidatos_2024_BRASIL.csv`) is 1.3 GB with semicolon delimiters and Latin1 encoding. Known issues:
- **Decimal separator**: Commas are used as decimal separators (Brazilian convention). We specify `locale(decimal_mark = ",")` in `read_delim()`.
- **Parsing warnings**: A small number of rows (~0.01%) produce parsing warnings due to malformed fields. These are dropped silently.
- **Duplicate transactions**: We do not de-duplicate transactions, as the TSE may legitimately record corrections and adjustments as separate entries.
### 3. LGBTQ+ Identification Coverage
The LGBTQ+ identification methodology has two important limitations:
- **False negatives**: Candidates who are LGBTQ+ but did not disclose to the TSE and were not identified by VOTE LGBT are classified as Non-LGBTQ+. The true LGBTQ+ count is almost certainly higher than our identified count.
- **False positives**: Very rare, but possible through fuzzy name matching. All fuzzy matches were manually reviewed.
- **Regional variation in disclosure**: Self-disclosure rates likely vary by region, urbanization, and local political culture. The Southeast and South may have higher disclosure rates than the North and Northeast. This means geographic comparisons of LGBTQ+ candidate *rates* confound true prevalence with disclosure propensity.
### 4. Race and Gender Categories
TSE race categories use the Brazilian census classification: Branca (White), Preta (Black), Parda (Brown), Amarela (Asian), and Indigena (Indigenous). We simplify to:
- **White**: Branca
- **Nonwhite**: All other categories (for binary analyses)
- **Detailed**: White, Black, Brown, Other (collapsing Amarela, Indigena, and missing)
The `female` variable is based on the TSE's `DS_GENERO` field. For trans candidates, this may reflect their legal gender rather than their gender assigned at birth. We use the TSE-recorded gender throughout, which aligns with Brazilian legal identity norms post-2018 (when the *Supremo Tribunal Federal* allowed name/gender changes without surgery).
### 5. Candidate vs Person
The unit of analysis is the **candidacy**, not the individual person. In rare cases, the same person may appear as a candidate in more than one municipality (e.g., if they withdrew and re-registered). We do not de-duplicate at the person level, as the candidacy is the relevant unit for campaign finance and electoral outcome analysis.
# Reproducibility
## Software Environment
```{r session-info}
#| label: session-info
#| eval: true
source(here::here("code", "00_setup.R"))
library(sf)
library(geobr)
cat("Analysis data path:\n")
cat(" ", paths$analysis_full_rds, "\n")
cat(" Exists:", file.exists(paths$analysis_full_rds), "\n\n")
sessionInfo()
```
## Execution Order
The data preparation scripts must be run in order before rendering the QMD chapters:
```{r execution-order}
#| label: execution-order
#| eval: false
# 1. code/01_load_candidates.R -> data/derived/candidates_analysis.rds
# 2. code/02_load_finance_raw.R -> data/derived/finance_by_candidate.rds
# data/derived/finance_transactions.rds
# output/tables/finance_revenue_source_categories.csv
# 3. code/03_prepare_geography.R -> data/geo_cache/municipalities.rds
# data/geo_cache/states.rds
# 4. code/05_process_manifestos.R -> data/derived/manifestos_text.rds
# 5. code/04_build_analysis_data.R -> data/derived/analysis_full.rds
#
# After all five scripts have run:
# 6. quarto render docs/ -> docs/_site/ (rendered website)
```
## Data Access
The parent project data is stored on Dropbox and is not publicly available. Researchers wishing to replicate this analysis should:
1. Download the raw TSE candidate registration files from [https://dadosabertos.tse.jus.br/](https://dadosabertos.tse.jus.br/)
2. Download the raw TSE campaign finance files from the same portal
3. Contact VOTE LGBT for their candidate identification list
4. Run the parent project's integration scripts to produce the processed files
5. Run this project's data preparation scripts in order (01, 02, 03, 05, then 04)
All code in this project is designed to be portable given the correct `paths` configuration in `code/00_setup.R`.