Data Codebook

Source Files, Variable Dictionary, and Methodology

1 Overview

This codebook documents all data sources, variables, and methodological decisions underlying the descriptive analysis of LGBTQ+ candidates in Brazil’s 2024 municipal elections. It is intended as a reference for reproducibility and for researchers who wish to extend or replicate this work.

Code chunks in this document illustrate the logic of data construction. Most have eval: false and do not execute; the session info section at the end evaluates to capture the software environment.

2 Source Files

The analysis draws on data from four sources: (1) the TSE (Tribunal Superior Eleitoral) official candidate registration files, (2) TSE campaign finance records, (3) TSE candidate manifesto PDFs (propostas de governo), and (4) the VOTE LGBT candidate identification project. Geographic boundaries come from the IBGE via the geobr R package.

Show code

# All paths are defined in code/00_setup.R
# Parent project path is set via QPP_DATA_DIR environment variable
# This project uses here::here() for relative paths

File	Path	Format	Encoding	Description
`candidates_2024_full_integrated.csv`	`Data/Processed/` (parent)	CSV	UTF-8	Full integrated candidate dataset from TSE 2024, with LGBTQ+ flags from VOTE LGBT matching. ~464K rows, one per candidacy.
`lgbt_matched_details.csv`	`Data/Processed/` (parent)	CSV	UTF-8	Detailed LGBTQ+ matching results: candidate ID, match method (exact/fuzzy), sexual orientation, gender identity, source (TSE disclosure vs VOTE LGBT list).
`party_ideology_reference.csv`	`Data/Processed/` (parent)	CSV	UTF-8	Party-level ideology scores from Bolognesi et al. expert survey. Columns: party abbreviation, ideology score (0-10), ideology category (Left/Center/Right).
`receitas_candidatos_2024_BRASIL.csv`	`Data/Brazil/Electoral data/` (parent)	CSV (`;` delimited)	Latin1	Raw TSE campaign finance file. ~5.5M transactions, 1.3 GB. Comma decimal separator, semicolon field separator.
`campaign_finance_aggregated.csv`	`Data/Processed/intermediate/` (parent)	CSV	UTF-8	Pre-existing candidate-level finance aggregation from parent project. Used for validation against our independent aggregation.
`crosswalk_tse_to_geobr.csv`	`Data/Processed/intermediate/` (parent)	CSV	UTF-8	Crosswalk between TSE municipality codes and IBGE/geobr municipality codes.
`municipality_crosswalk_master.csv`	`Data/Processed/intermediate/` (parent)	CSV	UTF-8	Master municipality crosswalk with TSE code, IBGE code, municipality name, state, and population estimates.
Candidate manifesto ZIPs (27 files)	`Data/Brazil/Candidate manifestos/` (parent)	ZIP/PDF	Latin1	TSE propostas de governo for mayoral candidates. ~15,800 PDFs across 27 state-level ZIP archives (~11 GB). PDF filenames embed `SQ_CANDIDATO`.

2.0.1 Derived Data Files

These files are created by this project’s data preparation scripts and stored in data/derived/.

File	Script	Format	Description
`candidates_analysis.rds`	`01_load_candidates.R`	RDS	Cleaned candidate dataset with recoded variables, LGBTQ+ flags, ideology scores, and simplified demographics.
`finance_by_candidate.rds`	`02_load_finance_raw.R`	RDS	Candidate-level campaign finance summary: total revenue, funding source amounts and shares, donor counts.
`finance_transactions.rds`	`02_load_finance_raw.R`	RDS	Transaction-level finance data with classified funding types. Large file (~200MB).
`manifestos_text.rds`	`05_process_manifestos.R`	RDS	Extracted manifesto text from ~15,800 mayoral candidate PDFs. Includes text, page/word counts, and extraction success flag.
`analysis_full.rds`	`04_build_analysis_data.R`	RDS	Final merged dataset: candidates + finance + geography + manifestos. This is the primary analysis file used by all QMD chapters.
`municipalities.rds`	`03_prepare_geography.R`	RDS	Cached municipality boundary shapefiles from `geobr::read_municipality(year = 2022)`.
`states.rds`	`03_prepare_geography.R`	RDS	Cached state boundary shapefiles from `geobr::read_state(year = 2020)`.

2.0.2 Output Files

Directory	Contents
`output/tables/`	CSV exports of key summary tables, including `finance_revenue_source_categories.csv`.
`output/figures/`	PNG and PDF exports of all figures, generated by `save_figure()`.

3 Variable Dictionary

3.1 Core Candidate Variables

Variable	Type	Description	Source	Example Values	Missing
`candidate_id`	character	TSE sequential candidate identifier (`SQ_CANDIDATO`)	TSE registration	`"280000000001"`	0%
`lgbtq_candidate`	logical	Whether candidate is identified as LGBTQ+ (TSE self-disclosure OR VOTE LGBT match)	TSE + VOTE LGBT	`TRUE`, `FALSE`	0%
`trans_candidate`	logical	Whether candidate is identified as transgender (subset of LGBTQ+)	TSE + VOTE LGBT	`TRUE`, `FALSE`	0%
`lgbtq_label`	factor	Binary label for plotting	Derived	`"LGBTQ+"`, `"Non-LGBTQ+"`	0%
`lgbt_category`	factor	Disaggregated identity category. Trans is prioritized over sexual orientation.	Derived	`"Non-LGBTQ+"`, `"Gay"`, `"Lesbian"`, `"Bisexual+"`, `"Trans"`, `"Asexual"`, `"Other LGBTQ+"`	0%
`female`	logical	Whether candidate’s registered gender is female	TSE	`TRUE`, `FALSE`	<1%
`nonwhite`	logical	Whether candidate’s race is not “Branca” (White)	TSE	`TRUE`, `FALSE`	~1%
`race_simple`	factor	Simplified race category	Derived from TSE `DS_COR_RACA`	`"White"`, `"Black"`, `"Brown"`, `"Other"`	~1%
`age`	numeric	Candidate age at election date	TSE	`18` to `100+`	<1%
`age_group`	factor	Binned age categories	Derived	`"18-29"`, `"30-39"`, `"40-49"`, `"50-59"`, `"60+"`	<1%
`education_simple`	factor	Simplified education level (3 categories)	Derived from TSE `DS_GRAU_INSTRUCAO`	`"Less than HS"`, `"High School"`, `"College+"`	<1%
`party_abbrev`	character	Party abbreviation	TSE	`"PT"`, `"PL"`, `"MDB"`	0%
`ideology_score`	numeric	Party ideology score (0 = far left, 10 = far right)	Bolognesi et al.	`1.5` to `9.2`	~2% (minor parties)
`ideology_category`	factor	Left/Center/Right classification	Derived from `ideology_score`	`"Left"`, `"Center"`, `"Right"`	~2%
`position_simple`	factor	Simplified position type	TSE	`"City Councilor"`, `"Mayor"`, `"Vice Mayor"`	0%
`elected`	logical	Whether candidate was elected	TSE	`TRUE`, `FALSE`	~5.4%
`state_abbrev`	character	Two-letter state abbreviation	TSE	`"SP"`, `"RJ"`, `"BA"`	0%
`region`	factor	Brazilian macro-region	Derived from `state_abbrev`	`"North"`, `"Northeast"`, `"Center-West"`, `"Southeast"`, `"South"`	0%
`geobr_code`	numeric	IBGE municipality code (7 digits) compatible with `geobr`	Crosswalk	`3550308` (Sao Paulo)	<1%

3.2 Campaign Finance Variables

These variables are merged from finance_by_candidate.rds onto the candidate dataset. Candidates with no finance records have all finance variables set to 0.

Variable	Type	Description	Source
`total_revenue`	numeric	Total campaign revenue in R$	TSE receitas
`n_transactions`	integer	Number of revenue transactions	TSE receitas
`n_unique_donors`	integer	Number of distinct donor IDs	TSE receitas
`self_funding_amt`	numeric	Revenue from candidate’s own resources (R$) \| Classified from `DS_ORIGEM_RECEITA` \| \| `party_funding_amt` \| numeric \| Revenue from party transfers (R$)	Classified
`individual_funding_amt`	numeric	Revenue from individual donors (R$) \| Classified \| \| `crowdfunding_amt` \| numeric \| Revenue from crowdfunding platforms (R$)	Classified
`pct_self`	numeric	Self-funding as % of total revenue (0-100 scale)	Derived
`pct_party`	numeric	Party funding as % of total (0-100)	Derived
`pct_individual`	numeric	Individual funding as % of total (0-100)	Derived
`pct_crowdfunding`	numeric	Crowdfunding as % of total (0-100)	Derived
`financial_amt`	numeric	Total financial (cash) contributions (R$) \| From `DS_NATUREZA_RECEITA == "FINANCEIRO"` \| \| `inkind_amt` \| numeric \| Total in-kind/estimated contributions (R$)	From `DS_NATUREZA_RECEITA != "FINANCEIRO"`

3.3 Manifesto Variables

These variables are merged from manifestos_text.rds onto the candidate dataset. Only mayoral (PREFEITO) candidates have manifestos; all other candidates have has_manifesto = FALSE and NA for text fields.

Variable	Type	Description	Source
`has_manifesto`	logical	Whether a manifesto PDF was found and text successfully extracted	`05_process_manifestos.R`
`manifesto_text`	character	Full extracted text of the candidate’s proposta de governo	`pdftotext` extraction
`manifesto_n_pages`	integer	Page count of the manifesto PDF	`pdfinfo`
`manifesto_n_words`	integer	Word count of the extracted manifesto text	Derived

3.4 Transaction-Level Finance Variables

Available in finance_transactions.rds for detailed analysis.

Variable	Type	Description
`SQ_CANDIDATO`	character	Candidate sequential ID (join key)
`DS_ORIGEM_RECEITA`	character	TSE revenue source category (Portuguese)
`DS_NATUREZA_RECEITA`	character	Financial vs in-kind classification
`DS_ESPECIE_RECEITA`	character	Payment method (PIX, bank transfer, check, etc.)
`DS_NATUREZA_RECURSO_ESTIMAVEL`	character	Type of in-kind contribution (when applicable)
`VR_RECEITA`	numeric	Transaction amount in R$
`funding_type`	character	Our classified funding type: `self_funding`, `party_funding`, `individual_funding`, `crowdfunding`, `other_candidates`, `other`
`NR_CPF_CNPJ_DOADOR`	character	Donor CPF (individual) or CNPJ (entity)
`NM_DOADOR`	character	Donor name
`DS_GENERO`	character	Donor gender (TSE categories)
`DS_COR_RACA`	character	Donor race/color (TSE categories)
`DT_RECEITA`	character	Transaction date

4 LGBTQ+ Identification Methodology

4.1 Two Sources of Identification

LGBTQ+ candidates are identified through two complementary mechanisms:

4.1.1 1. TSE Self-Disclosure (New in 2024)

For the first time, the TSE’s 2024 candidate registration form included optional fields for:

Sexual orientation: Heterossexual, Gay, Lesbica, Bissexual, Pansexual, Assexual, or open text
Gender identity: Cisgenero, Transgenero/Travesti, Nao-binario, or open text

Candidates who reported any non-heterosexual orientation or non-cisgender identity are flagged as LGBTQ+. This is the primary identification source and captures the majority of identified LGBTQ+ candidates.

Show code

# TSE self-disclosure logic (simplified)
tse_lgbtq <- sexual_orientation != "Heterossexual" |
             gender_identity != "Cisgenero"

4.1.2 2. VOTE LGBT Candidate List

VOTE LGBT (Voto com Orgulho) is a civil society organization that identifies and supports LGBTQ+ candidates in Brazilian elections. Their list includes candidates who:

Self-identify to the organization as LGBTQ+
Are publicly known LGBTQ+ community members
Are nominated by LGBTQ+ organizations

The matching between VOTE LGBT names and TSE candidate records uses a two-step process:

Show code

# Step 1: Exact match on candidate name (after normalization)
# - Remove accents, standardize spacing, uppercase
# - Match on full name, with state as secondary filter

# Step 2: Fuzzy match for remaining unmatched
# - String distance (Jaro-Winkler) with threshold
# - Manual review of ambiguous matches
# - State + party used as disambiguation

4.1.3 Priority Rules for Identity Categories

When a candidate is identified through both sources, or when multiple identity labels apply, the following priority rules determine the lgbt_category assignment:

Show code

# Priority rules (implemented in make_lgbt_category()):
# 1. Non-LGBTQ+ candidates: lgbtq_candidate == FALSE -> "Non-LGBTQ+"
# 2. Trans identity is prioritized: trans_candidate == TRUE -> "Trans"
#    (regardless of sexual orientation, since trans is a gender identity
#     that cross-cuts sexual orientation)
# 3. Sexual orientation categories in order:
#    - "Gay" (male homosexual)
#    - "Lesbian" (female homosexual, "Lesbica" in Portuguese)
#    - "Bisexual+" (bisexual or pansexual, collapsed)
#    - "Asexual"
#    - "Other LGBTQ+" (any remaining identified candidate)

Why Prioritize Trans?

Transgender identity is a gender identity, not a sexual orientation. A trans candidate may also be gay, lesbian, or bisexual, but for the purpose of this analysis, their gender identity takes precedence in the primary categorization. This follows conventions in both the LGBTQ+ studies literature and in Brazilian activist nomenclature. The “Bisexual+” category collapses bisexual and pansexual identities, which share the key feature of attraction to more than one gender.

5 Party Ideology Scores

5.1 Source

Party ideology scores come from:

Bolognesi, B., Ribeiro, E., & Codato, A. (2023). “Ideologia dos partidos politicos brasileiros.” Expert survey of Brazilian party ideology.

5.2 Scale

Range: 0 (far left) to 10 (far right)
Method: Expert survey with multiple respondents per party
Coverage: All major and most minor parties in the Brazilian system

5.3 Classification Thresholds

Show code

# Thresholds used in this analysis:
# Left:   ideology_score < 4.0
# Center: ideology_score >= 4.0 and < 7.1
# Right:  ideology_score >= 7.1

# These thresholds are based on:
# 1. Natural breaks in the expert survey distribution
# 2. Conventional placement of anchor parties:
#    - PT (Workers' Party) ~ 2.0 (clearly Left)
#    - MDB (centrist catch-all) ~ 5.5 (Center)
#    - PL (Bolsonaro's party) ~ 8.5 (clearly Right)

5.4 Unscored Parties

Approximately 2% of candidates belong to minor parties not covered by the expert survey. These candidates have ideology_score = NA and ideology_category = NA. They are excluded from ideology-stratified analyses but included in all other analyses.

6 Municipality Crosswalk

6.1 The TSE-IBGE Code Problem

The TSE uses its own municipality coding system, which differs from the IBGE’s official 7-digit municipality codes used by statistical agencies and the geobr package. The crosswalk resolves this mismatch.

Show code

# Crosswalk construction:
# 1. Start with TSE municipality code + name + state
# 2. Match to IBGE municipality code via:
#    a. Exact name match within state (handles ~95% of cases)
#    b. Fuzzy name match for municipalities with accent/spelling differences
#    c. Manual resolution for remaining cases (mergers, name changes)
# 3. IBGE 7-digit code (geobr_code) enables spatial joins with geobr shapefiles

# The geobr package uses code_muni, which is the IBGE 7-digit code.
# Join: candidates$geobr_code == muni_sf$code_muni

6.2 Coverage

The crosswalk covers all 5,570 Brazilian municipalities as of 2022. A small number of candidates (<1%) have missing geobr_code due to:

TSE municipality codes for overseas voting sections (not mappable to domestic municipalities)
Rare municipality code changes between TSE and IBGE systems

7 Data Quality Notes

7.1 Known Issues

7.1.1 1. Missing Election Outcomes (~5.4%)

Approximately 5.4% of candidates have elected = NA. This occurs because:

Some candidacies were annulled or withdrawn after registration but before results were finalized
A small number of substitute candidates (suplentes) have ambiguous outcome status
The DS_SIT_TOT_TURNO field in the TSE data has values that do not map cleanly to elected/not-elected

Handling: These candidates are excluded from election rate calculations but included in all other analyses (demographics, finance, geography).

7.1.2 2. Finance File Parsing

The raw TSE finance file (receitas_candidatos_2024_BRASIL.csv) is 1.3 GB with semicolon delimiters and Latin1 encoding. Known issues:

Decimal separator: Commas are used as decimal separators (Brazilian convention). We specify locale(decimal_mark = ",") in read_delim().
Parsing warnings: A small number of rows (~0.01%) produce parsing warnings due to malformed fields. These are dropped silently.
Duplicate transactions: We do not de-duplicate transactions, as the TSE may legitimately record corrections and adjustments as separate entries.

7.1.3 3. LGBTQ+ Identification Coverage

The LGBTQ+ identification methodology has two important limitations:

False negatives: Candidates who are LGBTQ+ but did not disclose to the TSE and were not identified by VOTE LGBT are classified as Non-LGBTQ+. The true LGBTQ+ count is almost certainly higher than our identified count.
False positives: Very rare, but possible through fuzzy name matching. All fuzzy matches were manually reviewed.
Regional variation in disclosure: Self-disclosure rates likely vary by region, urbanization, and local political culture. The Southeast and South may have higher disclosure rates than the North and Northeast. This means geographic comparisons of LGBTQ+ candidate rates confound true prevalence with disclosure propensity.

7.1.4 4. Race and Gender Categories

TSE race categories use the Brazilian census classification: Branca (White), Preta (Black), Parda (Brown), Amarela (Asian), and Indigena (Indigenous). We simplify to:

White: Branca
Nonwhite: All other categories (for binary analyses)
Detailed: White, Black, Brown, Other (collapsing Amarela, Indigena, and missing)

The female variable is based on the TSE’s DS_GENERO field. For trans candidates, this may reflect their legal gender rather than their gender assigned at birth. We use the TSE-recorded gender throughout, which aligns with Brazilian legal identity norms post-2018 (when the Supremo Tribunal Federal allowed name/gender changes without surgery).

7.1.5 5. Candidate vs Person

The unit of analysis is the candidacy, not the individual person. In rare cases, the same person may appear as a candidate in more than one municipality (e.g., if they withdrew and re-registered). We do not de-duplicate at the person level, as the candidacy is the relevant unit for campaign finance and electoral outcome analysis.

8 Reproducibility

8.1 Software Environment

Show code

source(here::here("code", "00_setup.R"))
library(sf)
library(geobr)

cat("Analysis data path:\n")

Analysis data path:

Show code

cat("  ", paths$analysis_full_rds, "\n")

   /Users/aloport/Library/CloudStorage/Dropbox/Research/Prep/brazil_lgbtq_descriptives/data/derived/analysis_full.rds

Show code

cat("  Exists:", file.exists(paths$analysis_full_rds), "\n\n")

  Exists: TRUE

Show code

sessionInfo()

R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS 26.2

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] C.UTF-8/C.UTF-8/C.UTF-8/C/C.UTF-8/C.UTF-8

time zone: Europe/London
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] geobr_1.9.1      sf_1.0-20        kableExtra_1.4.0 knitr_1.48      
 [5] patchwork_1.3.0  gt_0.11.0        scales_1.4.0     lubridate_1.9.3 
 [9] forcats_1.0.0    stringr_1.5.1    dplyr_1.1.4      purrr_1.2.0     
[13] readr_2.1.5      tidyr_1.3.1      tibble_3.2.1     ggplot2_4.0.0   
[17] tidyverse_2.0.0 

loaded via a namespace (and not attached):
 [1] utf8_1.2.4         generics_0.1.3     class_7.3-22       xml2_1.3.6        
 [5] KernSmooth_2.23-24 stringi_1.8.4      hms_1.1.3          digest_0.6.37     
 [9] magrittr_2.0.4     evaluate_1.0.3     grid_4.4.1         timechange_0.3.0  
[13] RColorBrewer_1.1-3 fastmap_1.2.0      rprojroot_2.0.4    jsonlite_2.0.0    
[17] e1071_1.7-16       DBI_1.2.3          fansi_1.0.6        viridisLite_0.4.2 
[21] cli_3.6.5          rlang_1.1.6        units_0.8-7        withr_3.0.2       
[25] yaml_2.3.10        tools_4.4.1        tzdb_0.5.0         here_1.0.1        
[29] curl_7.0.0         vctrs_0.6.5        R6_2.6.1           proxy_0.4-27      
[33] classInt_0.4-10    lifecycle_1.0.4    htmlwidgets_1.6.4  pkgconfig_2.0.3   
[37] pillar_1.9.0       gtable_0.3.6       data.table_1.17.0  Rcpp_1.1.0        
[41] glue_1.8.0         systemfonts_1.1.0  xfun_0.47          tidyselect_1.2.1  
[45] rstudioapi_0.17.1  farver_2.1.2       htmltools_0.5.8.1  rmarkdown_2.28    
[49] svglite_2.1.3      compiler_4.4.1     S7_0.2.0

8.2 Execution Order

The data preparation scripts must be run in order before rendering the QMD chapters:

Show code

# 1. code/01_load_candidates.R       -> data/derived/candidates_analysis.rds
# 2. code/02_load_finance_raw.R      -> data/derived/finance_by_candidate.rds
#                                       data/derived/finance_transactions.rds
#                                       output/tables/finance_revenue_source_categories.csv
# 3. code/03_prepare_geography.R     -> data/geo_cache/municipalities.rds
#                                       data/geo_cache/states.rds
# 4. code/05_process_manifestos.R    -> data/derived/manifestos_text.rds
# 5. code/04_build_analysis_data.R   -> data/derived/analysis_full.rds
#
# After all five scripts have run:
# 6. quarto render docs/             -> docs/_site/ (rendered website)

8.3 Data Access

The parent project data is stored on Dropbox and is not publicly available. Researchers wishing to replicate this analysis should:

Download the raw TSE candidate registration files from https://dadosabertos.tse.jus.br/
Download the raw TSE campaign finance files from the same portal
Contact VOTE LGBT for their candidate identification list
Run the parent project’s integration scripts to produce the processed files
Run this project’s data preparation scripts in order (01, 02, 03, 05, then 04)

All code in this project is designed to be portable given the correct paths configuration in code/00_setup.R.

--- title: "Data Codebook" subtitle: "Source Files, Variable Dictionary, and Methodology" --- # Overview This codebook documents all data sources, variables, and methodological decisions underlying the descriptive analysis of LGBTQ+ candidates in Brazil's 2024 municipal elections. It is intended as a reference for reproducibility and for researchers who wish to extend or replicate this work. Code chunks in this document illustrate the logic of data construction. Most have `eval: false` and do not execute; the session info section at the end evaluates to capture the software environment. # Source Files The analysis draws on data from four sources: (1) the TSE (*Tribunal Superior Eleitoral*) official candidate registration files, (2) TSE campaign finance records, (3) TSE candidate manifesto PDFs (*propostas de governo*), and (4) the VOTE LGBT candidate identification project. Geographic boundaries come from the IBGE via the `geobr` R package. ```{r source-files} #| label: source-files #| eval: false # All paths are defined in code/00_setup.R # Parent project path is set via QPP_DATA_DIR environment variable # This project uses here::here() for relative paths ``` | File | Path | Format | Encoding | Description | |------|------|--------|----------|-------------| | `candidates_2024_full_integrated.csv` | `Data/Processed/` (parent) | CSV | UTF-8 | Full integrated candidate dataset from TSE 2024, with LGBTQ+ flags from VOTE LGBT matching. ~464K rows, one per candidacy. | | `lgbt_matched_details.csv` | `Data/Processed/` (parent) | CSV | UTF-8 | Detailed LGBTQ+ matching results: candidate ID, match method (exact/fuzzy), sexual orientation, gender identity, source (TSE disclosure vs VOTE LGBT list). | | `party_ideology_reference.csv` | `Data/Processed/` (parent) | CSV | UTF-8 | Party-level ideology scores from Bolognesi et al. expert survey. Columns: party abbreviation, ideology score (0-10), ideology category (Left/Center/Right). | | `receitas_candidatos_2024_BRASIL.csv` | `Data/Brazil/Electoral data/` (parent) | CSV (`;` delimited) | Latin1 | Raw TSE campaign finance file. ~5.5M transactions, 1.3 GB. Comma decimal separator, semicolon field separator. | | `campaign_finance_aggregated.csv` | `Data/Processed/intermediate/` (parent) | CSV | UTF-8 | Pre-existing candidate-level finance aggregation from parent project. Used for validation against our independent aggregation. | | `crosswalk_tse_to_geobr.csv` | `Data/Processed/intermediate/` (parent) | CSV | UTF-8 | Crosswalk between TSE municipality codes and IBGE/geobr municipality codes. | | `municipality_crosswalk_master.csv` | `Data/Processed/intermediate/` (parent) | CSV | UTF-8 | Master municipality crosswalk with TSE code, IBGE code, municipality name, state, and population estimates. | | Candidate manifesto ZIPs (27 files) | `Data/Brazil/Candidate manifestos/` (parent) | ZIP/PDF | Latin1 | TSE *propostas de governo* for mayoral candidates. ~15,800 PDFs across 27 state-level ZIP archives (~11 GB). PDF filenames embed `SQ_CANDIDATO`. | ### Derived Data Files These files are created by this project's data preparation scripts and stored in `data/derived/`. | File | Script | Format | Description | |------|--------|--------|-------------| | `candidates_analysis.rds` | `01_load_candidates.R` | RDS | Cleaned candidate dataset with recoded variables, LGBTQ+ flags, ideology scores, and simplified demographics. | | `finance_by_candidate.rds` | `02_load_finance_raw.R` | RDS | Candidate-level campaign finance summary: total revenue, funding source amounts and shares, donor counts. | | `finance_transactions.rds` | `02_load_finance_raw.R` | RDS | Transaction-level finance data with classified funding types. Large file (~200MB). | | `manifestos_text.rds` | `05_process_manifestos.R` | RDS | Extracted manifesto text from ~15,800 mayoral candidate PDFs. Includes text, page/word counts, and extraction success flag. | | `analysis_full.rds` | `04_build_analysis_data.R` | RDS | Final merged dataset: candidates + finance + geography + manifestos. This is the primary analysis file used by all QMD chapters. | | `municipalities.rds` | `03_prepare_geography.R` | RDS | Cached municipality boundary shapefiles from `geobr::read_municipality(year = 2022)`. | | `states.rds` | `03_prepare_geography.R` | RDS | Cached state boundary shapefiles from `geobr::read_state(year = 2020)`. | ### Output Files | Directory | Contents | |-----------|----------| | `output/tables/` | CSV exports of key summary tables, including `finance_revenue_source_categories.csv`. | | `output/figures/` | PNG and PDF exports of all figures, generated by `save_figure()`. | # Variable Dictionary ## Core Candidate Variables | Variable | Type | Description | Source | Example Values | Missing | |----------|------|-------------|--------|----------------|---------| | `candidate_id` | character | TSE sequential candidate identifier (`SQ_CANDIDATO`) | TSE registration | `"280000000001"` | 0% | | `lgbtq_candidate` | logical | Whether candidate is identified as LGBTQ+ (TSE self-disclosure OR VOTE LGBT match) | TSE + VOTE LGBT | `TRUE`, `FALSE` | 0% | | `trans_candidate` | logical | Whether candidate is identified as transgender (subset of LGBTQ+) | TSE + VOTE LGBT | `TRUE`, `FALSE` | 0% | | `lgbtq_label` | factor | Binary label for plotting | Derived | `"LGBTQ+"`, `"Non-LGBTQ+"` | 0% | | `lgbt_category` | factor | Disaggregated identity category. Trans is prioritized over sexual orientation. | Derived | `"Non-LGBTQ+"`, `"Gay"`, `"Lesbian"`, `"Bisexual+"`, `"Trans"`, `"Asexual"`, `"Other LGBTQ+"` | 0% | | `female` | logical | Whether candidate's registered gender is female | TSE | `TRUE`, `FALSE` | <1% | | `nonwhite` | logical | Whether candidate's race is not "Branca" (White) | TSE | `TRUE`, `FALSE` | ~1% | | `race_simple` | factor | Simplified race category | Derived from TSE `DS_COR_RACA` | `"White"`, `"Black"`, `"Brown"`, `"Other"` | ~1% | | `age` | numeric | Candidate age at election date | TSE | `18` to `100+` | <1% | | `age_group` | factor | Binned age categories | Derived | `"18-29"`, `"30-39"`, `"40-49"`, `"50-59"`, `"60+"` | <1% | | `education_simple` | factor | Simplified education level (3 categories) | Derived from TSE `DS_GRAU_INSTRUCAO` | `"Less than HS"`, `"High School"`, `"College+"` | <1% | | `party_abbrev` | character | Party abbreviation | TSE | `"PT"`, `"PL"`, `"MDB"` | 0% | | `ideology_score` | numeric | Party ideology score (0 = far left, 10 = far right) | Bolognesi et al. | `1.5` to `9.2` | ~2% (minor parties) | | `ideology_category` | factor | Left/Center/Right classification | Derived from `ideology_score` | `"Left"`, `"Center"`, `"Right"` | ~2% | | `position_simple` | factor | Simplified position type | TSE | `"City Councilor"`, `"Mayor"`, `"Vice Mayor"` | 0% | | `elected` | logical | Whether candidate was elected | TSE | `TRUE`, `FALSE` | ~5.4% | | `state_abbrev` | character | Two-letter state abbreviation | TSE | `"SP"`, `"RJ"`, `"BA"` | 0% | | `region` | factor | Brazilian macro-region | Derived from `state_abbrev` | `"North"`, `"Northeast"`, `"Center-West"`, `"Southeast"`, `"South"` | 0% | | `geobr_code` | numeric | IBGE municipality code (7 digits) compatible with `geobr` | Crosswalk | `3550308` (Sao Paulo) | <1% | ## Campaign Finance Variables These variables are merged from `finance_by_candidate.rds` onto the candidate dataset. Candidates with no finance records have all finance variables set to 0. | Variable | Type | Description | Source | |----------|------|-------------|--------| | `total_revenue` | numeric | Total campaign revenue in R$ | TSE receitas | | `n_transactions` | integer | Number of revenue transactions | TSE receitas | | `n_unique_donors` | integer | Number of distinct donor IDs | TSE receitas | | `self_funding_amt` | numeric | Revenue from candidate's own resources (R$) | Classified from `DS_ORIGEM_RECEITA` | | `party_funding_amt` | numeric | Revenue from party transfers (R$) | Classified | | `individual_funding_amt` | numeric | Revenue from individual donors (R$) | Classified | | `crowdfunding_amt` | numeric | Revenue from crowdfunding platforms (R$) | Classified | | `pct_self` | numeric | Self-funding as % of total revenue (0-100 scale) | Derived | | `pct_party` | numeric | Party funding as % of total (0-100) | Derived | | `pct_individual` | numeric | Individual funding as % of total (0-100) | Derived | | `pct_crowdfunding` | numeric | Crowdfunding as % of total (0-100) | Derived | | `financial_amt` | numeric | Total financial (cash) contributions (R$) | From `DS_NATUREZA_RECEITA == "FINANCEIRO"` | | `inkind_amt` | numeric | Total in-kind/estimated contributions (R$) | From `DS_NATUREZA_RECEITA != "FINANCEIRO"` | ## Manifesto Variables These variables are merged from `manifestos_text.rds` onto the candidate dataset. Only mayoral (PREFEITO) candidates have manifestos; all other candidates have `has_manifesto = FALSE` and `NA` for text fields. | Variable | Type | Description | Source | |----------|------|-------------|--------| | `has_manifesto` | logical | Whether a manifesto PDF was found and text successfully extracted | `05_process_manifestos.R` | | `manifesto_text` | character | Full extracted text of the candidate's *proposta de governo* | `pdftotext` extraction | | `manifesto_n_pages` | integer | Page count of the manifesto PDF | `pdfinfo` | | `manifesto_n_words` | integer | Word count of the extracted manifesto text | Derived | ## Transaction-Level Finance Variables Available in `finance_transactions.rds` for detailed analysis. | Variable | Type | Description | |----------|------|-------------| | `SQ_CANDIDATO` | character | Candidate sequential ID (join key) | | `DS_ORIGEM_RECEITA` | character | TSE revenue source category (Portuguese) | | `DS_NATUREZA_RECEITA` | character | Financial vs in-kind classification | | `DS_ESPECIE_RECEITA` | character | Payment method (PIX, bank transfer, check, etc.) | | `DS_NATUREZA_RECURSO_ESTIMAVEL` | character | Type of in-kind contribution (when applicable) | | `VR_RECEITA` | numeric | Transaction amount in R$ | | `funding_type` | character | Our classified funding type: `self_funding`, `party_funding`, `individual_funding`, `crowdfunding`, `other_candidates`, `other` | | `NR_CPF_CNPJ_DOADOR` | character | Donor CPF (individual) or CNPJ (entity) | | `NM_DOADOR` | character | Donor name | | `DS_GENERO` | character | Donor gender (TSE categories) | | `DS_COR_RACA` | character | Donor race/color (TSE categories) | | `DT_RECEITA` | character | Transaction date | # LGBTQ+ Identification Methodology ## Two Sources of Identification LGBTQ+ candidates are identified through two complementary mechanisms: ### 1. TSE Self-Disclosure (New in 2024) For the first time, the TSE's 2024 candidate registration form included optional fields for: - **Sexual orientation**: Heterossexual, Gay, Lesbica, Bissexual, Pansexual, Assexual, or open text - **Gender identity**: Cisgenero, Transgenero/Travesti, Nao-binario, or open text Candidates who reported any non-heterosexual orientation or non-cisgender identity are flagged as LGBTQ+. This is the primary identification source and captures the majority of identified LGBTQ+ candidates. ```{r tse-disclosure} #| label: tse-disclosure #| eval: false # TSE self-disclosure logic (simplified) tse_lgbtq <- sexual_orientation != "Heterossexual" | gender_identity != "Cisgenero" ``` ### 2. VOTE LGBT Candidate List VOTE LGBT (*Voto com Orgulho*) is a civil society organization that identifies and supports LGBTQ+ candidates in Brazilian elections. Their list includes candidates who: - Self-identify to the organization as LGBTQ+ - Are publicly known LGBTQ+ community members - Are nominated by LGBTQ+ organizations The matching between VOTE LGBT names and TSE candidate records uses a two-step process: ```{r matching-logic} #| label: matching-logic #| eval: false # Step 1: Exact match on candidate name (after normalization) # - Remove accents, standardize spacing, uppercase # - Match on full name, with state as secondary filter # Step 2: Fuzzy match for remaining unmatched # - String distance (Jaro-Winkler) with threshold # - Manual review of ambiguous matches # - State + party used as disambiguation ``` ### Priority Rules for Identity Categories When a candidate is identified through both sources, or when multiple identity labels apply, the following priority rules determine the `lgbt_category` assignment: ```{r priority-rules} #| label: priority-rules #| eval: false # Priority rules (implemented in make_lgbt_category()): # 1. Non-LGBTQ+ candidates: lgbtq_candidate == FALSE -> "Non-LGBTQ+" # 2. Trans identity is prioritized: trans_candidate == TRUE -> "Trans" # (regardless of sexual orientation, since trans is a gender identity # that cross-cuts sexual orientation) # 3. Sexual orientation categories in order: # - "Gay" (male homosexual) # - "Lesbian" (female homosexual, "Lesbica" in Portuguese) # - "Bisexual+" (bisexual or pansexual, collapsed) # - "Asexual" # - "Other LGBTQ+" (any remaining identified candidate) ``` ::: {.callout-note} ## Why Prioritize Trans? Transgender identity is a gender identity, not a sexual orientation. A trans candidate may also be gay, lesbian, or bisexual, but for the purpose of this analysis, their gender identity takes precedence in the primary categorization. This follows conventions in both the LGBTQ+ studies literature and in Brazilian activist nomenclature. The "Bisexual+" category collapses bisexual and pansexual identities, which share the key feature of attraction to more than one gender. ::: # Party Ideology Scores ## Source Party ideology scores come from: > Bolognesi, B., Ribeiro, E., & Codato, A. (2023). "Ideologia dos partidos politicos brasileiros." *Expert survey of Brazilian party ideology.* ## Scale - **Range**: 0 (far left) to 10 (far right) - **Method**: Expert survey with multiple respondents per party - **Coverage**: All major and most minor parties in the Brazilian system ## Classification Thresholds ```{r ideology-thresholds} #| label: ideology-thresholds #| eval: false # Thresholds used in this analysis: # Left: ideology_score < 4.0 # Center: ideology_score >= 4.0 and < 7.1 # Right: ideology_score >= 7.1 # These thresholds are based on: # 1. Natural breaks in the expert survey distribution # 2. Conventional placement of anchor parties: # - PT (Workers' Party) ~ 2.0 (clearly Left) # - MDB (centrist catch-all) ~ 5.5 (Center) # - PL (Bolsonaro's party) ~ 8.5 (clearly Right) ``` ## Unscored Parties Approximately 2% of candidates belong to minor parties not covered by the expert survey. These candidates have `ideology_score = NA` and `ideology_category = NA`. They are excluded from ideology-stratified analyses but included in all other analyses. # Municipality Crosswalk ## The TSE-IBGE Code Problem The TSE uses its own municipality coding system, which differs from the IBGE's official 7-digit municipality codes used by statistical agencies and the `geobr` package. The crosswalk resolves this mismatch. ```{r crosswalk-logic} #| label: crosswalk-logic #| eval: false # Crosswalk construction: # 1. Start with TSE municipality code + name + state # 2. Match to IBGE municipality code via: # a. Exact name match within state (handles ~95% of cases) # b. Fuzzy name match for municipalities with accent/spelling differences # c. Manual resolution for remaining cases (mergers, name changes) # 3. IBGE 7-digit code (geobr_code) enables spatial joins with geobr shapefiles # The geobr package uses code_muni, which is the IBGE 7-digit code. # Join: candidates$geobr_code == muni_sf$code_muni ``` ## Coverage The crosswalk covers all 5,570 Brazilian municipalities as of 2022. A small number of candidates (<1%) have missing `geobr_code` due to: - TSE municipality codes for overseas voting sections (not mappable to domestic municipalities) - Rare municipality code changes between TSE and IBGE systems # Data Quality Notes ## Known Issues ### 1. Missing Election Outcomes (~5.4%) Approximately 5.4% of candidates have `elected = NA`. This occurs because: - Some candidacies were annulled or withdrawn after registration but before results were finalized - A small number of substitute candidates (*suplentes*) have ambiguous outcome status - The `DS_SIT_TOT_TURNO` field in the TSE data has values that do not map cleanly to elected/not-elected **Handling**: These candidates are excluded from election rate calculations but included in all other analyses (demographics, finance, geography). ### 2. Finance File Parsing The raw TSE finance file (`receitas_candidatos_2024_BRASIL.csv`) is 1.3 GB with semicolon delimiters and Latin1 encoding. Known issues: - **Decimal separator**: Commas are used as decimal separators (Brazilian convention). We specify `locale(decimal_mark = ",")` in `read_delim()`. - **Parsing warnings**: A small number of rows (~0.01%) produce parsing warnings due to malformed fields. These are dropped silently. - **Duplicate transactions**: We do not de-duplicate transactions, as the TSE may legitimately record corrections and adjustments as separate entries. ### 3. LGBTQ+ Identification Coverage The LGBTQ+ identification methodology has two important limitations: - **False negatives**: Candidates who are LGBTQ+ but did not disclose to the TSE and were not identified by VOTE LGBT are classified as Non-LGBTQ+. The true LGBTQ+ count is almost certainly higher than our identified count. - **False positives**: Very rare, but possible through fuzzy name matching. All fuzzy matches were manually reviewed. - **Regional variation in disclosure**: Self-disclosure rates likely vary by region, urbanization, and local political culture. The Southeast and South may have higher disclosure rates than the North and Northeast. This means geographic comparisons of LGBTQ+ candidate *rates* confound true prevalence with disclosure propensity. ### 4. Race and Gender Categories TSE race categories use the Brazilian census classification: Branca (White), Preta (Black), Parda (Brown), Amarela (Asian), and Indigena (Indigenous). We simplify to: - **White**: Branca - **Nonwhite**: All other categories (for binary analyses) - **Detailed**: White, Black, Brown, Other (collapsing Amarela, Indigena, and missing) The `female` variable is based on the TSE's `DS_GENERO` field. For trans candidates, this may reflect their legal gender rather than their gender assigned at birth. We use the TSE-recorded gender throughout, which aligns with Brazilian legal identity norms post-2018 (when the *Supremo Tribunal Federal* allowed name/gender changes without surgery). ### 5. Candidate vs Person The unit of analysis is the **candidacy**, not the individual person. In rare cases, the same person may appear as a candidate in more than one municipality (e.g., if they withdrew and re-registered). We do not de-duplicate at the person level, as the candidacy is the relevant unit for campaign finance and electoral outcome analysis. # Reproducibility ## Software Environment ```{r session-info} #| label: session-info #| eval: true source(here::here("code", "00_setup.R")) library(sf) library(geobr) cat("Analysis data path:\n") cat(" ", paths$analysis_full_rds, "\n") cat(" Exists:", file.exists(paths$analysis_full_rds), "\n\n") sessionInfo() ``` ## Execution Order The data preparation scripts must be run in order before rendering the QMD chapters: ```{r execution-order} #| label: execution-order #| eval: false # 1. code/01_load_candidates.R -> data/derived/candidates_analysis.rds # 2. code/02_load_finance_raw.R -> data/derived/finance_by_candidate.rds # data/derived/finance_transactions.rds # output/tables/finance_revenue_source_categories.csv # 3. code/03_prepare_geography.R -> data/geo_cache/municipalities.rds # data/geo_cache/states.rds # 4. code/05_process_manifestos.R -> data/derived/manifestos_text.rds # 5. code/04_build_analysis_data.R -> data/derived/analysis_full.rds # # After all five scripts have run: # 6. quarto render docs/ -> docs/_site/ (rendered website) ``` ## Data Access The parent project data is stored on Dropbox and is not publicly available. Researchers wishing to replicate this analysis should: 1. Download the raw TSE candidate registration files from [https://dadosabertos.tse.jus.br/](https://dadosabertos.tse.jus.br/) 2. Download the raw TSE campaign finance files from the same portal 3. Contact VOTE LGBT for their candidate identification list 4. Run the parent project's integration scripts to produce the processed files 5. Run this project's data preparation scripts in order (01, 02, 03, 05, then 04) All code in this project is designed to be portable given the correct `paths` configuration in `code/00_setup.R`.