Source Files, Variable Dictionary, and Methodology

1 Overview

This codebook documents all data sources, variables, and methodological decisions underlying the descriptive analysis of LGBTQ+ candidates in Brazil’s 2024 municipal elections. It is intended as a reference for reproducibility and for researchers who wish to extend or replicate this work.

Code chunks in this document illustrate the logic of data construction. Most have eval: false and do not execute; the session info section at the end evaluates to capture the software environment.

2 Source Files

The analysis draws on data from four sources: (1) the TSE (Tribunal Superior Eleitoral) official candidate registration files, (2) TSE campaign finance records, (3) TSE candidate manifesto PDFs (propostas de governo), and (4) the VOTE LGBT candidate identification project. Geographic boundaries come from the IBGE via the geobr R package.

Show code
# All paths are defined in code/00_setup.R
# Parent project path is set via QPP_DATA_DIR environment variable
# This project uses here::here() for relative paths
File Path Format Encoding Description
candidates_2024_full_integrated.csv Data/Processed/ (parent) CSV UTF-8 Full integrated candidate dataset from TSE 2024, with LGBTQ+ flags from VOTE LGBT matching. ~464K rows, one per candidacy.
lgbt_matched_details.csv Data/Processed/ (parent) CSV UTF-8 Detailed LGBTQ+ matching results: candidate ID, match method (exact/fuzzy), sexual orientation, gender identity, source (TSE disclosure vs VOTE LGBT list).
party_ideology_reference.csv Data/Processed/ (parent) CSV UTF-8 Party-level ideology scores from Bolognesi et al. expert survey. Columns: party abbreviation, ideology score (0-10), ideology category (Left/Center/Right).
receitas_candidatos_2024_BRASIL.csv Data/Brazil/Electoral data/ (parent) CSV (; delimited) Latin1 Raw TSE campaign finance file. ~5.5M transactions, 1.3 GB. Comma decimal separator, semicolon field separator.
campaign_finance_aggregated.csv Data/Processed/intermediate/ (parent) CSV UTF-8 Pre-existing candidate-level finance aggregation from parent project. Used for validation against our independent aggregation.
crosswalk_tse_to_geobr.csv Data/Processed/intermediate/ (parent) CSV UTF-8 Crosswalk between TSE municipality codes and IBGE/geobr municipality codes.
municipality_crosswalk_master.csv Data/Processed/intermediate/ (parent) CSV UTF-8 Master municipality crosswalk with TSE code, IBGE code, municipality name, state, and population estimates.
Candidate manifesto ZIPs (27 files) Data/Brazil/Candidate manifestos/ (parent) ZIP/PDF Latin1 TSE propostas de governo for mayoral candidates. ~15,800 PDFs across 27 state-level ZIP archives (~11 GB). PDF filenames embed SQ_CANDIDATO.

2.0.1 Derived Data Files

These files are created by this project’s data preparation scripts and stored in data/derived/.

File Script Format Description
candidates_analysis.rds 01_load_candidates.R RDS Cleaned candidate dataset with recoded variables, LGBTQ+ flags, ideology scores, and simplified demographics.
finance_by_candidate.rds 02_load_finance_raw.R RDS Candidate-level campaign finance summary: total revenue, funding source amounts and shares, donor counts.
finance_transactions.rds 02_load_finance_raw.R RDS Transaction-level finance data with classified funding types. Large file (~200MB).
manifestos_text.rds 05_process_manifestos.R RDS Extracted manifesto text from ~15,800 mayoral candidate PDFs. Includes text, page/word counts, and extraction success flag.
analysis_full.rds 04_build_analysis_data.R RDS Final merged dataset: candidates + finance + geography + manifestos. This is the primary analysis file used by all QMD chapters.
municipalities.rds 03_prepare_geography.R RDS Cached municipality boundary shapefiles from geobr::read_municipality(year = 2022).
states.rds 03_prepare_geography.R RDS Cached state boundary shapefiles from geobr::read_state(year = 2020).

2.0.2 Output Files

Directory Contents
output/tables/ CSV exports of key summary tables, including finance_revenue_source_categories.csv.
output/figures/ PNG and PDF exports of all figures, generated by save_figure().

3 Variable Dictionary

3.1 Core Candidate Variables

Variable Type Description Source Example Values Missing
candidate_id character TSE sequential candidate identifier (SQ_CANDIDATO) TSE registration "280000000001" 0%
lgbtq_candidate logical Whether candidate is identified as LGBTQ+ (TSE self-disclosure OR VOTE LGBT match) TSE + VOTE LGBT TRUE, FALSE 0%
trans_candidate logical Whether candidate is identified as transgender (subset of LGBTQ+) TSE + VOTE LGBT TRUE, FALSE 0%
lgbtq_label factor Binary label for plotting Derived "LGBTQ+", "Non-LGBTQ+" 0%
lgbt_category factor Disaggregated identity category. Trans is prioritized over sexual orientation. Derived "Non-LGBTQ+", "Gay", "Lesbian", "Bisexual+", "Trans", "Asexual", "Other LGBTQ+" 0%
female logical Whether candidate’s registered gender is female TSE TRUE, FALSE <1%
nonwhite logical Whether candidate’s race is not “Branca” (White) TSE TRUE, FALSE ~1%
race_simple factor Simplified race category Derived from TSE DS_COR_RACA "White", "Black", "Brown", "Other" ~1%
age numeric Candidate age at election date TSE 18 to 100+ <1%
age_group factor Binned age categories Derived "18-29", "30-39", "40-49", "50-59", "60+" <1%
education_simple factor Simplified education level (3 categories) Derived from TSE DS_GRAU_INSTRUCAO "Less than HS", "High School", "College+" <1%
party_abbrev character Party abbreviation TSE "PT", "PL", "MDB" 0%
ideology_score numeric Party ideology score (0 = far left, 10 = far right) Bolognesi et al. 1.5 to 9.2 ~2% (minor parties)
ideology_category factor Left/Center/Right classification Derived from ideology_score "Left", "Center", "Right" ~2%
position_simple factor Simplified position type TSE "City Councilor", "Mayor", "Vice Mayor" 0%
elected logical Whether candidate was elected TSE TRUE, FALSE ~5.4%
state_abbrev character Two-letter state abbreviation TSE "SP", "RJ", "BA" 0%
region factor Brazilian macro-region Derived from state_abbrev "North", "Northeast", "Center-West", "Southeast", "South" 0%
geobr_code numeric IBGE municipality code (7 digits) compatible with geobr Crosswalk 3550308 (Sao Paulo) <1%

3.2 Campaign Finance Variables

These variables are merged from finance_by_candidate.rds onto the candidate dataset. Candidates with no finance records have all finance variables set to 0.

Variable Type Description Source
total_revenue numeric Total campaign revenue in R$ TSE receitas
n_transactions integer Number of revenue transactions TSE receitas
n_unique_donors integer Number of distinct donor IDs TSE receitas
self_funding_amt numeric Revenue from candidate’s own resources (R\() | Classified from `DS_ORIGEM_RECEITA` | | `party_funding_amt` | numeric | Revenue from party transfers (R\)) Classified
individual_funding_amt numeric Revenue from individual donors (R\() | Classified | | `crowdfunding_amt` | numeric | Revenue from crowdfunding platforms (R\)) Classified
pct_self numeric Self-funding as % of total revenue (0-100 scale) Derived
pct_party numeric Party funding as % of total (0-100) Derived
pct_individual numeric Individual funding as % of total (0-100) Derived
pct_crowdfunding numeric Crowdfunding as % of total (0-100) Derived
financial_amt numeric Total financial (cash) contributions (R\() | From `DS_NATUREZA_RECEITA == "FINANCEIRO"` | | `inkind_amt` | numeric | Total in-kind/estimated contributions (R\)) From DS_NATUREZA_RECEITA != "FINANCEIRO"

3.3 Manifesto Variables

These variables are merged from manifestos_text.rds onto the candidate dataset. Only mayoral (PREFEITO) candidates have manifestos; all other candidates have has_manifesto = FALSE and NA for text fields.

Variable Type Description Source
has_manifesto logical Whether a manifesto PDF was found and text successfully extracted 05_process_manifestos.R
manifesto_text character Full extracted text of the candidate’s proposta de governo pdftotext extraction
manifesto_n_pages integer Page count of the manifesto PDF pdfinfo
manifesto_n_words integer Word count of the extracted manifesto text Derived

3.4 Transaction-Level Finance Variables

Available in finance_transactions.rds for detailed analysis.

Variable Type Description
SQ_CANDIDATO character Candidate sequential ID (join key)
DS_ORIGEM_RECEITA character TSE revenue source category (Portuguese)
DS_NATUREZA_RECEITA character Financial vs in-kind classification
DS_ESPECIE_RECEITA character Payment method (PIX, bank transfer, check, etc.)
DS_NATUREZA_RECURSO_ESTIMAVEL character Type of in-kind contribution (when applicable)
VR_RECEITA numeric Transaction amount in R$
funding_type character Our classified funding type: self_funding, party_funding, individual_funding, crowdfunding, other_candidates, other
NR_CPF_CNPJ_DOADOR character Donor CPF (individual) or CNPJ (entity)
NM_DOADOR character Donor name
DS_GENERO character Donor gender (TSE categories)
DS_COR_RACA character Donor race/color (TSE categories)
DT_RECEITA character Transaction date

4 LGBTQ+ Identification Methodology

4.1 Two Sources of Identification

LGBTQ+ candidates are identified through two complementary mechanisms:

4.1.1 1. TSE Self-Disclosure (New in 2024)

For the first time, the TSE’s 2024 candidate registration form included optional fields for:

  • Sexual orientation: Heterossexual, Gay, Lesbica, Bissexual, Pansexual, Assexual, or open text
  • Gender identity: Cisgenero, Transgenero/Travesti, Nao-binario, or open text

Candidates who reported any non-heterosexual orientation or non-cisgender identity are flagged as LGBTQ+. This is the primary identification source and captures the majority of identified LGBTQ+ candidates.

Show code
# TSE self-disclosure logic (simplified)
tse_lgbtq <- sexual_orientation != "Heterossexual" |
             gender_identity != "Cisgenero"

4.1.2 2. VOTE LGBT Candidate List

VOTE LGBT (Voto com Orgulho) is a civil society organization that identifies and supports LGBTQ+ candidates in Brazilian elections. Their list includes candidates who:

  • Self-identify to the organization as LGBTQ+
  • Are publicly known LGBTQ+ community members
  • Are nominated by LGBTQ+ organizations

The matching between VOTE LGBT names and TSE candidate records uses a two-step process:

Show code
# Step 1: Exact match on candidate name (after normalization)
# - Remove accents, standardize spacing, uppercase
# - Match on full name, with state as secondary filter

# Step 2: Fuzzy match for remaining unmatched
# - String distance (Jaro-Winkler) with threshold
# - Manual review of ambiguous matches
# - State + party used as disambiguation

4.1.3 Priority Rules for Identity Categories

When a candidate is identified through both sources, or when multiple identity labels apply, the following priority rules determine the lgbt_category assignment:

Show code
# Priority rules (implemented in make_lgbt_category()):
# 1. Non-LGBTQ+ candidates: lgbtq_candidate == FALSE -> "Non-LGBTQ+"
# 2. Trans identity is prioritized: trans_candidate == TRUE -> "Trans"
#    (regardless of sexual orientation, since trans is a gender identity
#     that cross-cuts sexual orientation)
# 3. Sexual orientation categories in order:
#    - "Gay" (male homosexual)
#    - "Lesbian" (female homosexual, "Lesbica" in Portuguese)
#    - "Bisexual+" (bisexual or pansexual, collapsed)
#    - "Asexual"
#    - "Other LGBTQ+" (any remaining identified candidate)
Why Prioritize Trans?

Transgender identity is a gender identity, not a sexual orientation. A trans candidate may also be gay, lesbian, or bisexual, but for the purpose of this analysis, their gender identity takes precedence in the primary categorization. This follows conventions in both the LGBTQ+ studies literature and in Brazilian activist nomenclature. The “Bisexual+” category collapses bisexual and pansexual identities, which share the key feature of attraction to more than one gender.

5 Party Ideology Scores

5.1 Source

Party ideology scores come from:

Bolognesi, B., Ribeiro, E., & Codato, A. (2023). “Ideologia dos partidos politicos brasileiros.” Expert survey of Brazilian party ideology.

5.2 Scale

  • Range: 0 (far left) to 10 (far right)
  • Method: Expert survey with multiple respondents per party
  • Coverage: All major and most minor parties in the Brazilian system

5.3 Classification Thresholds

Show code
# Thresholds used in this analysis:
# Left:   ideology_score < 4.0
# Center: ideology_score >= 4.0 and < 7.1
# Right:  ideology_score >= 7.1

# These thresholds are based on:
# 1. Natural breaks in the expert survey distribution
# 2. Conventional placement of anchor parties:
#    - PT (Workers' Party) ~ 2.0 (clearly Left)
#    - MDB (centrist catch-all) ~ 5.5 (Center)
#    - PL (Bolsonaro's party) ~ 8.5 (clearly Right)

5.4 Unscored Parties

Approximately 2% of candidates belong to minor parties not covered by the expert survey. These candidates have ideology_score = NA and ideology_category = NA. They are excluded from ideology-stratified analyses but included in all other analyses.

6 Municipality Crosswalk

6.1 The TSE-IBGE Code Problem

The TSE uses its own municipality coding system, which differs from the IBGE’s official 7-digit municipality codes used by statistical agencies and the geobr package. The crosswalk resolves this mismatch.

Show code
# Crosswalk construction:
# 1. Start with TSE municipality code + name + state
# 2. Match to IBGE municipality code via:
#    a. Exact name match within state (handles ~95% of cases)
#    b. Fuzzy name match for municipalities with accent/spelling differences
#    c. Manual resolution for remaining cases (mergers, name changes)
# 3. IBGE 7-digit code (geobr_code) enables spatial joins with geobr shapefiles

# The geobr package uses code_muni, which is the IBGE 7-digit code.
# Join: candidates$geobr_code == muni_sf$code_muni

6.2 Coverage

The crosswalk covers all 5,570 Brazilian municipalities as of 2022. A small number of candidates (<1%) have missing geobr_code due to:

  • TSE municipality codes for overseas voting sections (not mappable to domestic municipalities)
  • Rare municipality code changes between TSE and IBGE systems

7 Data Quality Notes

7.1 Known Issues

7.1.1 1. Missing Election Outcomes (~5.4%)

Approximately 5.4% of candidates have elected = NA. This occurs because:

  • Some candidacies were annulled or withdrawn after registration but before results were finalized
  • A small number of substitute candidates (suplentes) have ambiguous outcome status
  • The DS_SIT_TOT_TURNO field in the TSE data has values that do not map cleanly to elected/not-elected

Handling: These candidates are excluded from election rate calculations but included in all other analyses (demographics, finance, geography).

7.1.2 2. Finance File Parsing

The raw TSE finance file (receitas_candidatos_2024_BRASIL.csv) is 1.3 GB with semicolon delimiters and Latin1 encoding. Known issues:

  • Decimal separator: Commas are used as decimal separators (Brazilian convention). We specify locale(decimal_mark = ",") in read_delim().
  • Parsing warnings: A small number of rows (~0.01%) produce parsing warnings due to malformed fields. These are dropped silently.
  • Duplicate transactions: We do not de-duplicate transactions, as the TSE may legitimately record corrections and adjustments as separate entries.

7.1.3 3. LGBTQ+ Identification Coverage

The LGBTQ+ identification methodology has two important limitations:

  • False negatives: Candidates who are LGBTQ+ but did not disclose to the TSE and were not identified by VOTE LGBT are classified as Non-LGBTQ+. The true LGBTQ+ count is almost certainly higher than our identified count.
  • False positives: Very rare, but possible through fuzzy name matching. All fuzzy matches were manually reviewed.
  • Regional variation in disclosure: Self-disclosure rates likely vary by region, urbanization, and local political culture. The Southeast and South may have higher disclosure rates than the North and Northeast. This means geographic comparisons of LGBTQ+ candidate rates confound true prevalence with disclosure propensity.

7.1.4 4. Race and Gender Categories

TSE race categories use the Brazilian census classification: Branca (White), Preta (Black), Parda (Brown), Amarela (Asian), and Indigena (Indigenous). We simplify to:

  • White: Branca
  • Nonwhite: All other categories (for binary analyses)
  • Detailed: White, Black, Brown, Other (collapsing Amarela, Indigena, and missing)

The female variable is based on the TSE’s DS_GENERO field. For trans candidates, this may reflect their legal gender rather than their gender assigned at birth. We use the TSE-recorded gender throughout, which aligns with Brazilian legal identity norms post-2018 (when the Supremo Tribunal Federal allowed name/gender changes without surgery).

7.1.5 5. Candidate vs Person

The unit of analysis is the candidacy, not the individual person. In rare cases, the same person may appear as a candidate in more than one municipality (e.g., if they withdrew and re-registered). We do not de-duplicate at the person level, as the candidacy is the relevant unit for campaign finance and electoral outcome analysis.

8 Reproducibility

8.1 Software Environment

Show code
source(here::here("code", "00_setup.R"))
library(sf)
library(geobr)

cat("Analysis data path:\n")
Analysis data path:
Show code
cat("  ", paths$analysis_full_rds, "\n")
   /Users/aloport/Library/CloudStorage/Dropbox/Research/Prep/brazil_lgbtq_descriptives/data/derived/analysis_full.rds 
Show code
cat("  Exists:", file.exists(paths$analysis_full_rds), "\n\n")
  Exists: TRUE 
Show code
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS 26.2

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] C.UTF-8/C.UTF-8/C.UTF-8/C/C.UTF-8/C.UTF-8

time zone: Europe/London
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] geobr_1.9.1      sf_1.0-20        kableExtra_1.4.0 knitr_1.48      
 [5] patchwork_1.3.0  gt_0.11.0        scales_1.4.0     lubridate_1.9.3 
 [9] forcats_1.0.0    stringr_1.5.1    dplyr_1.1.4      purrr_1.2.0     
[13] readr_2.1.5      tidyr_1.3.1      tibble_3.2.1     ggplot2_4.0.0   
[17] tidyverse_2.0.0 

loaded via a namespace (and not attached):
 [1] utf8_1.2.4         generics_0.1.3     class_7.3-22       xml2_1.3.6        
 [5] KernSmooth_2.23-24 stringi_1.8.4      hms_1.1.3          digest_0.6.37     
 [9] magrittr_2.0.4     evaluate_1.0.3     grid_4.4.1         timechange_0.3.0  
[13] RColorBrewer_1.1-3 fastmap_1.2.0      rprojroot_2.0.4    jsonlite_2.0.0    
[17] e1071_1.7-16       DBI_1.2.3          fansi_1.0.6        viridisLite_0.4.2 
[21] cli_3.6.5          rlang_1.1.6        units_0.8-7        withr_3.0.2       
[25] yaml_2.3.10        tools_4.4.1        tzdb_0.5.0         here_1.0.1        
[29] curl_7.0.0         vctrs_0.6.5        R6_2.6.1           proxy_0.4-27      
[33] classInt_0.4-10    lifecycle_1.0.4    htmlwidgets_1.6.4  pkgconfig_2.0.3   
[37] pillar_1.9.0       gtable_0.3.6       data.table_1.17.0  Rcpp_1.1.0        
[41] glue_1.8.0         systemfonts_1.1.0  xfun_0.47          tidyselect_1.2.1  
[45] rstudioapi_0.17.1  farver_2.1.2       htmltools_0.5.8.1  rmarkdown_2.28    
[49] svglite_2.1.3      compiler_4.4.1     S7_0.2.0          

8.2 Execution Order

The data preparation scripts must be run in order before rendering the QMD chapters:

Show code
# 1. code/01_load_candidates.R       -> data/derived/candidates_analysis.rds
# 2. code/02_load_finance_raw.R      -> data/derived/finance_by_candidate.rds
#                                       data/derived/finance_transactions.rds
#                                       output/tables/finance_revenue_source_categories.csv
# 3. code/03_prepare_geography.R     -> data/geo_cache/municipalities.rds
#                                       data/geo_cache/states.rds
# 4. code/05_process_manifestos.R    -> data/derived/manifestos_text.rds
# 5. code/04_build_analysis_data.R   -> data/derived/analysis_full.rds
#
# After all five scripts have run:
# 6. quarto render docs/             -> docs/_site/ (rendered website)

8.3 Data Access

The parent project data is stored on Dropbox and is not publicly available. Researchers wishing to replicate this analysis should:

  1. Download the raw TSE candidate registration files from https://dadosabertos.tse.jus.br/
  2. Download the raw TSE campaign finance files from the same portal
  3. Contact VOTE LGBT for their candidate identification list
  4. Run the parent project’s integration scripts to produce the processed files
  5. Run this project’s data preparation scripts in order (01, 02, 03, 05, then 04)

All code in this project is designed to be portable given the correct paths configuration in code/00_setup.R.