Show code
source(here::here("code", "00_setup.R"))
df <- readRDS(paths$analysis_full_rds)From Raw TSE Files to Analysis-Ready Data
This chapter documents the complete data pipeline that transforms raw electoral records into the analysis-ready dataset used throughout Chapters 1–6. Transparency and reproducibility are core objectives: every classification decision, matching algorithm, and variable derivation is described here so that other researchers can evaluate, replicate, or extend this work.
The pipeline consists of five R scripts, executed sequentially, each reading specific inputs and producing well-defined outputs. An upstream step in the parent research project handles the initial VOTE LGBT matching; all subsequent processing occurs within this descriptives project.
flowchart TD
A["TSE Candidate Registration\n~464K rows, 59 cols"] --> B["01_load_candidates.R"]
C["VOTE LGBT List\n~1,300 candidates"] --> D["07_merge_lgbtq_list.R\n(Parent Project)"]
D --> A
E["TSE Finance Raw\n~5.5M transactions, 1.3GB"] --> F["02_load_finance_raw.R"]
G["IBGE Shapefiles\ngeobr package"] --> H["03_prepare_geography.R"]
M["Municipality Characteristics\nIBGE Census, 5,570 rows"] --> I
N["2022 Presidential Voting\nTSE section-level data"] --> I
O["Candidate Manifestos\n27 state ZIPs, ~15.8K PDFs"] --> P["05_process_manifestos.R"]
B --> I["04_build_analysis_data.R"]
F --> I
H --> I
P --> I
I --> J["analysis_full.rds\n~464K rows, 112 cols"]
J --> K["Quarto Website\nChapters 1-8"]flowchart TD
A["TSE Candidate Registration\n~464K rows, 59 cols"] --> B["01_load_candidates.R"]
C["VOTE LGBT List\n~1,300 candidates"] --> D["07_merge_lgbtq_list.R\n(Parent Project)"]
D --> A
E["TSE Finance Raw\n~5.5M transactions, 1.3GB"] --> F["02_load_finance_raw.R"]
G["IBGE Shapefiles\ngeobr package"] --> H["03_prepare_geography.R"]
M["Municipality Characteristics\nIBGE Census, 5,570 rows"] --> I
N["2022 Presidential Voting\nTSE section-level data"] --> I
O["Candidate Manifestos\n27 state ZIPs, ~15.8K PDFs"] --> P["05_process_manifestos.R"]
B --> I["04_build_analysis_data.R"]
F --> I
H --> I
P --> I
I --> J["analysis_full.rds\n~464K rows, 112 cols"]
J --> K["Quarto Website\nChapters 1-8"]
01_load_candidates.R reads the integrated candidate CSV from the parent project (~210 MB, ~464K rows x 59 columns). It creates derived variables—simplified demographics, age groups, regional assignments, ideology categories, and the disaggregated LGBTQ+ identity classification—and saves the result as candidates_analysis.rds.
02_load_finance_raw.R reads the raw TSE campaign finance file (receitas_candidatos_2024_BRASIL.csv, ~1.3 GB, semicolon-delimited, Latin1 encoding). It documents every revenue source category in the raw data, classifies each transaction into funding types (self, party, individual, crowdfunding, other candidates, other), aggregates to the candidate level, validates the aggregation against a pre-existing aggregated file, and saves both candidate-level summaries (finance_by_candidate.rds) and the full transaction-level data (finance_transactions.rds).
03_prepare_geography.R downloads and caches municipality and state boundary shapefiles from the IBGE via the geobr R package. These are stored locally in data/geo_cache/ so that subsequent renders do not re-download. It also reads the municipality crosswalk that bridges TSE and IBGE coding systems.
05_process_manifestos.R extracts text from candidate manifesto PDFs (propostas de governo), which are mandatory filings for mayoral candidates. The TSE distributes these as 27 state-level ZIP archives (~11 GB total, ~15,800 PDFs). The script processes each state sequentially: unzipping to a temporary directory, parsing the SQ_CANDIDATO (candidate ID) from each filename, extracting text with pdftotext and page counts with pdfinfo (poppler), and cleaning up. Corrupt or image-only PDFs (~3.5%) are flagged as extraction_ok = FALSE with NA text. The output is manifestos_text.rds.
04_build_analysis_data.R merges the four preceding outputs—candidates, finance, geography, and manifestos—into a single analysis-ready file (analysis_full.rds, ~464K rows, 112 columns). In addition to filling financial variables with zeros for unmatched candidates, this script:
pop_bracket (5 population categories), bolsonaro_quartile (conservatism quartiles), and pct_nonwhite_muni (municipality racial composition)manifestos_text.rds (deduplicated to one manifesto per candidate, keeping the longest text when multiple PDFs exist). Non-mayor candidates and mayors without a manifesto get has_manifesto = FALSE.Identifying LGBTQ+ candidates in a population of nearly half a million candidacies is the central methodological challenge of this project. We draw on two complementary sources, then apply a three-stage matching algorithm to link the external VOTE LGBT list to official TSE records.
For the first time in Brazilian electoral history, the Tribunal Superior Eleitoral (TSE) allowed candidates to voluntarily declare their sexual orientation and gender identity on the official registration form for the 2024 municipal elections. This was a landmark step in institutional recognition of LGBTQ+ political participation.
The TSE registration data provides two relevant fields:
DS_ORIENTACAO_SEXUAL — sexual orientation (e.g., Heterossexual, Gay, Lesbica, Bissexual, Pansexual, Assexual, or open text)DS_IDENTIDADE_GENERO — gender identity (e.g., Cisgenero, Transgenero/Travesti, Nao-binario, or open text)Candidates who reported any non-heterosexual orientation or non-cisgender identity are flagged as LGBTQ+.
Self-disclosure was entirely voluntary. Many LGBTQ+ candidates—particularly those in conservative regions or parties—may have chosen not to declare, meaning the TSE data alone substantially undercounts the true LGBTQ+ candidate population. This is why the VOTE LGBT list is an essential complement.
VOTE LGBT (Voto com Orgulho) is a Brazilian civil society project that independently identifies LGBTQ+ candidates running in elections. Their identification methods include:
For the 2024 municipal elections, VOTE LGBT identified approximately 1,300 candidates. Their list provides: ballot name, state, municipality, position sought, party, gender identity, sexual orientation, and (for some candidates) vote counts.
The two sources have partially overlapping but distinct coverage. The TSE captures candidates who chose to declare on the official form regardless of public visibility, while VOTE LGBT captures publicly visible LGBTQ+ candidates regardless of whether they filed a TSE declaration. The union of both sources provides the most comprehensive identification achievable with available data.
Linking the VOTE LGBT candidate list to TSE registration records requires reconciling informal names, varying municipality spellings, and approximate string matches. The matching proceeds in three stages, each applied only to candidates not yet matched by a prior stage.
The first and most restrictive stage joins on three standardized keys: ballot name, state, and municipality. Both datasets are preprocessed by removing accents, collapsing whitespace, and converting to uppercase.
# Stage 1: Exact match (name + state + municipality)
lgbt_matched_exact <- lgbt_clean %>%
inner_join(
candidates_std %>%
select(candidate_id, ballot_name_std, state_std, municipality_std),
by = c("ballot_name_lgbt" = "ballot_name_std",
"state_lgbt" = "state_std",
"municipality_lgbt" = "municipality_std")
) %>%
mutate(match_type = "exact_full", match_quality = 1.0)ballot_name_lgbt = ballot_name_std, state_lgbt = state_std, municipality_lgbt = municipality_stdFor candidates unmatched after Stage 1, we relax the municipality requirement and match on ballot name and state alone.
# Stage 2: Exact match (name + state, relaxing municipality)
lgbt_unmatched_s1 <- lgbt_clean %>%
anti_join(lgbt_matched_exact, by = "row_id")
lgbt_matched_state <- lgbt_unmatched_s1 %>%
inner_join(
candidates_std %>%
select(candidate_id, ballot_name_std, state_std),
by = c("ballot_name_lgbt" = "ballot_name_std",
"state_lgbt" = "state_std")
) %>%
mutate(match_type = "exact_state", match_quality = 0.9)For remaining unmatched candidates, we apply fuzzy string matching within the same state using the Jaro-Winkler distance metric, which is well-suited for name matching because it gives extra weight to matching prefixes—a useful property when ballot names share common first elements.
# Stage 3: Fuzzy match using Jaro-Winkler distance within state
library(stringdist)
for (i in seq_len(nrow(lgbt_unmatched_s2))) {
state_candidates <- candidates_std %>%
filter(state_std == lgbt_unmatched_s2$state_lgbt[i])
distances <- stringdist(
lgbt_unmatched_s2$ballot_name_lgbt[i],
state_candidates$ballot_name_std,
method = "jw"
)
best_idx <- which.min(distances)
min_dist <- distances[best_idx]
# Accept if distance < 0.15 (high similarity)
if (min_dist < 0.15) {
# Record match with quality = 1 - distance
match_quality <- 1 - min_dist
# Match found: store candidate_id, match_type = "fuzzy_jw"
}
}1 - distance (ranges from 0.85 to ~0.99)The Jaro-Winkler distance is preferred over alternatives (Levenshtein, cosine similarity) for person-name matching because it (1) is normalized to [0, 1], (2) penalizes transpositions less harshly than insertions/deletions, and (3) applies a prefix bonus that rewards names sharing the same initial characters—common in Brazilian ballot names that begin with a first name or nickname.
Once a candidate is identified as LGBTQ+ (from either source), they are assigned to a disaggregated identity category. The key design decision is that trans identity is prioritized over sexual orientation, because gender identity is a distinct dimension that cross-cuts orientation. A trans lesbian, for example, is categorized as “Trans” rather than “Lesbian” in our primary classification.
The resulting categories are:
| Category | Definition |
|---|---|
| Trans | Any candidate with a transgender or travesti gender identity, regardless of sexual orientation |
| Gay | Cisgender male candidates with gay sexual orientation |
| Lesbian | Cisgender female candidates with lesbian (Lesbica) sexual orientation |
| Bisexual+ | Candidates with bisexual or pansexual orientation (collapsed because both describe attraction to more than one gender) |
| Asexual | Candidates with asexual orientation |
| Other LGBTQ+ | Identified LGBTQ+ candidates whose specific identity does not map to the above categories |
This follows conventions in both the LGBTQ+ studies literature and Brazilian activist nomenclature. Transgender identity is a gender identity, not a sexual orientation. A trans person may simultaneously be gay, lesbian, bisexual, or asexual. Collapsing these into a single “Trans” category for the primary classification preserves the analytically distinct dimension of gender identity, while the underlying data retains full detail for secondary analyses.
The following tables summarize the results of the matching process across the identified LGBTQ+ population.
| Disclosure Source | N | % |
|---|---|---|
| TSE | 2215 | 70.7% |
| VOTE + TSE | 628 | 20.0% |
| VOTE | 291 | 9.3% |
| Match Type | TSE | VOTE | VOTE + TSE |
|---|---|---|---|
| exact_full | 2215 | 289 | 628 |
| fuzzy | 0 | 2 | 0 |
| N | Mean | Median | Min | Max |
|---|---|---|---|---|
| 3,134 | 1.000 | 1.000 | 0.974 | 1.000 |
Chapter 5 presents the results of the campaign finance analysis; this section documents the classification methodology that underpins it. The goal is to transform the raw TSE revenue file—which contains Portuguese-language category labels and complex institutional source codes—into a clean, analytically useful typology of funding sources.
The TSE classifies each revenue transaction by DS_ORIGEM_RECEITA (revenue source). The following table maps the raw Portuguese categories to our English-language classification:
| Portuguese Category | English Translation | Our Classification |
|---|---|---|
| Recursos de pessoas fisicas | Own resources | self_funding |
| Fundo Partidario / Fundo Especial (partido politico) | Party fund / Special fund | party_funding |
| Doacoes de pessoas fisicas | Individual donations | individual_funding |
| Financiamento coletivo | Crowdfunding | crowdfunding |
| Doacoes de outros candidatos/comites | Other candidates/committees | other_candidates |
| Everything else | Various | other |
Self-funding is identified through two complementary methods, capturing cases that either source alone would miss:
DS_ORIGEM_RECEITA field contains “recursos pr” (own resources), matched case-insensitively.NR_CPF_CNPJ_DOADOR) matches the candidate’s own CPF (NR_CPF_CANDIDATO). This catches cases where a candidate donates to their own campaign but the transaction is categorized under a different revenue origin.A transaction satisfying either condition is classified as self_funding.
funding_type = case_when(
str_detect(DS_ORIGEM_RECEITA, "(?i)recursos pr") |
(NR_CPF_CNPJ_DOADOR == NR_CPF_CANDIDATO) ~ "self_funding",
str_detect(DS_ORIGEM_RECEITA, "(?i)partido pol") ~ "party_funding",
str_detect(DS_ORIGEM_RECEITA, "(?i)pessoas f") ~ "individual_funding",
str_detect(DS_ORIGEM_RECEITA, "(?i)financiamento coletivo") ~ "crowdfunding",
str_detect(DS_ORIGEM_RECEITA, "(?i)outros candidatos") ~ "other_candidates",
TRUE ~ "other"
)case_when()
The case_when() function evaluates conditions in order and assigns the first matching category. Self-funding is tested first so that a candidate’s own donation is always classified as self-funding, even if the TSE recorded the revenue origin under a different label. This is a deliberate design choice that prioritizes economic substance (who provided the money) over institutional labeling.
The TSE field DS_NATUREZA_RECEITA distinguishes between cash contributions ("FINANCEIRO") and estimated or in-kind contributions (services, materials, event spaces, etc.). Our dataset preserves both the financial amount (financial_amt) and in-kind amount (inkind_amt) at the candidate level, along with their percentage shares of total revenue.
Our raw aggregation from the ~5.5 million transaction-level records was validated against a pre-existing candidate-level aggregated file produced independently by the parent project. The two aggregations show near-perfect correlation, confirming that the raw processing pipeline faithfully reproduces the intended totals.
The TSE uses its own internal municipality coding system, which differs from the IBGE’s official 7-digit municipality codes used by statistical agencies, the census, and the geobr R package. To produce choropleth maps and spatial analyses, we need to bridge these two systems.
The crosswalk file (crosswalk_tse_to_geobr.csv) resolves this mismatch through name-based matching within state. The procedure is:
The resulting geobr_code (IBGE 7-digit code) enables spatial joins with geobr shapefiles via code_muni.
Municipality match rate: 99.9%
Candidates with geobr_code: 463,338 of 463,601
The small number of unmatched candidates (<1%) are primarily from overseas voting sections or reflect rare discrepancies between TSE and IBGE municipality code systems.
Municipality-level characteristics from the IBGE 2022 Census are merged via the parent project’s municipality_characteristics.csv intermediate file. Variables include population, GDP per capita (deflated), racial composition (% White, Black, Brown, Indigenous), poverty indicators, and state capital status. The merge key is geobr_code (IBGE 7-digit municipality code), matching 99.9% of candidates.
Local electorate conservatism is proxied by Bolsonaro’s first-round vote share in the 2022 presidential election. This is computed from TSE section-level voting data (votacao_secao_2022_BR.csv):
NR_TURNO == 1)CD_MUNICIPIO)100 * (votes for candidate 22) / (total valid votes, excluding codes 95 and 96)The result is cached as bolsonaro_share.rds to avoid re-processing the large file. Coverage is 88.8% of candidates; unmatched cases are primarily overseas voters or municipalities with TSE code mismatches.
Two additional variables are derived from the candidate data itself:
n_parties: number of distinct party abbreviations running candidates in each municipalityn_seats: number of elected city councilors per municipality (defaulting to 9, Brazil’s statutory minimum, when no elected councilors are found)The following table documents all derived variables in the analysis dataset, including the transformation logic and rationale for each.
| Variable | Formula / Logic | Rationale |
|---|---|---|
lgbt_category |
Trans > Sexual Orientation (via make_lgbt_category()) |
Gender identity cross-cuts orientation; trans identity is analytically distinct |
education_simple |
8 TSE codes collapsed to 3 levels: Less than HS, High School, College+ | Reduces sparse cells for meaningful group comparisons |
race_simple |
6 TSE codes collapsed to 4 levels: White, Brown, Black, Other | Aligns with census conventions; preserves key distinctions |
age_group |
5 bins: 18–29, 30–39, 40–49, 50–59, 60+ | Standard demographic grouping for age-stratified analysis |
region |
27 states mapped to 5 macro-regions | Standard Brazilian geographic division (North, Northeast, Center-West, Southeast, South) |
ideology_category |
<4.0 Left, 4.0–7.1 Center, >=7.1 Right | Bolognesi et al. (2023) expert survey thresholds, anchored on PT, MDB, PL |
female |
gender == "FEMININO" |
Binary indicator for two-group gender comparisons |
nonwhite |
race != "BRANCA" (NA-safe) |
Binary indicator for two-group racial comparisons |
lgbtq_label |
"LGBTQ+" / "Non-LGBTQ+" factor |
Clean label for plots and tables; ordered for consistent display |
position_simple |
TSE position codes translated to English: City Councilor, Mayor, Vice-Mayor | Readable labels for analysis and visualization |
populacao_2022 |
IBGE 2022 Census population | Municipality-level context for geographic analyses |
pop_bracket |
Population cut into 5 brackets: <10K, 10K-50K, 50K-200K, 200K-500K, 500K+ | Meaningful population categories for cross-tabulation |
capital_uf |
Binary: 1 = state capital, 0 = otherwise | Distinguishes capitals (typically larger, more progressive) |
pib_per_capita_defla |
GDP per capita, deflated (IBGE) | Economic development proxy |
bolsonaro_share |
Bolsonaro 2022 first-round vote share (%) per municipality | Proxy for local electorate conservatism |
bolsonaro_quartile |
Quartile of bolsonaro_share: Q1 (least conservative) through Q4 |
Categorical conservatism measure for cross-tabulation |
n_parties |
n_distinct(party_abbrev) per municipality |
Political fragmentation / competitiveness |
n_seats |
Elected councilors per municipality (min 9) | Council size / institutional context |
pct_nonwhite_muni |
perc_preta + perc_parda from Census |
Municipality racial composition |
has_manifesto |
Logical: whether a manifesto PDF was found and text extracted | Only available for mayoral candidates; ~96.5% extraction rate |
manifesto_text |
Full extracted text of the candidate’s manifesto (proposta de governo) | NA for non-mayors and image-only PDFs |
manifesto_n_pages |
Page count of the manifesto PDF (from pdfinfo) |
Typically 6–13 pages |
manifesto_n_words |
Word count of extracted manifesto text | Typically 1,500–2,000 words |
threeway_group |
Trans / LGB / Non-LGBTQ+ (trans_candidate > lgb_candidate) | Three-way comparison in Ch06 intersectional analysis |
The Left/Center/Right cutpoints are not arbitrary. They are derived from natural breaks in the Bolognesi et al. (2023) expert survey distribution and anchored on three parties that serve as ideological reference points: PT (Partido dos Trabalhadores, score ~2.0, clearly Left), MDB (Movimento Democratico Brasileiro, score ~5.5, centrist catch-all), and PL (Partido Liberal, score ~8.5, clearly Right under Bolsonaro). Approximately 2% of candidates belong to minor parties not covered by the survey and have NA ideology scores.
The data pipeline depends on source files from the parent research project. If running on a machine other than the original development environment, set the QPP_DATA_DIR environment variable to point to the parent project directory:
Run the five data preparation scripts in sequence before rendering the Quarto site:
# Step 1: Load and prepare candidate data
source(here::here("code", "01_load_candidates.R"))
# -> data/derived/candidates_analysis.rds
# Step 2: Process raw campaign finance data
source(here::here("code", "02_load_finance_raw.R"))
# -> data/derived/finance_by_candidate.rds
# -> data/derived/finance_transactions.rds
# -> output/tables/finance_revenue_source_categories.csv
# Step 3: Download and cache geographic boundaries
source(here::here("code", "03_prepare_geography.R"))
# -> data/geo_cache/municipalities.rds
# -> data/geo_cache/states.rds
# Step 4: Extract text from candidate manifesto PDFs (~15,800 PDFs)
source(here::here("code", "05_process_manifestos.R"))
# -> data/derived/manifestos_text.rds
# Step 5: Merge all components into analysis dataset
source(here::here("code", "04_build_analysis_data.R"))
# -> data/derived/analysis_full.rds
# -> data/derived/bolsonaro_share.rds (cached)
# Step 6: Render the Quarto website
# quarto render docs/Once analysis_full.rds exists, the Quarto site can be rendered with:
All chapters source code/00_setup.R for shared paths, palettes, themes, and helper functions. The freeze: auto setting in _quarto.yml means that chapters whose source code has not changed will not be re-executed on subsequent renders.
The parent project data is stored on Dropbox and is not publicly available. Researchers wishing to replicate this analysis should:
QPP_DATA_DIR and run this project’s scripts 01–05 in orderquarto render docs/---
title: "7. Data Pipeline & Methodology"
subtitle: "From Raw TSE Files to Analysis-Ready Data"
---
```{r setup}
source(here::here("code", "00_setup.R"))
df <- readRDS(paths$analysis_full_rds)
```
# Pipeline Overview
This chapter documents the complete data pipeline that transforms raw electoral records into the analysis-ready dataset used throughout Chapters 1--6. Transparency and reproducibility are core objectives: every classification decision, matching algorithm, and variable derivation is described here so that other researchers can evaluate, replicate, or extend this work.
The pipeline consists of five R scripts, executed sequentially, each reading specific inputs and producing well-defined outputs. An upstream step in the parent research project handles the initial VOTE LGBT matching; all subsequent processing occurs within this descriptives project.
## Data Flow
```{mermaid}
flowchart TD
A["TSE Candidate Registration\n~464K rows, 59 cols"] --> B["01_load_candidates.R"]
C["VOTE LGBT List\n~1,300 candidates"] --> D["07_merge_lgbtq_list.R\n(Parent Project)"]
D --> A
E["TSE Finance Raw\n~5.5M transactions, 1.3GB"] --> F["02_load_finance_raw.R"]
G["IBGE Shapefiles\ngeobr package"] --> H["03_prepare_geography.R"]
M["Municipality Characteristics\nIBGE Census, 5,570 rows"] --> I
N["2022 Presidential Voting\nTSE section-level data"] --> I
O["Candidate Manifestos\n27 state ZIPs, ~15.8K PDFs"] --> P["05_process_manifestos.R"]
B --> I["04_build_analysis_data.R"]
F --> I
H --> I
P --> I
I --> J["analysis_full.rds\n~464K rows, 112 cols"]
J --> K["Quarto Website\nChapters 1-8"]
```
## Script Descriptions
**`01_load_candidates.R`** reads the integrated candidate CSV from the parent project (~210 MB, ~464K rows x 59 columns). It creates derived variables---simplified demographics, age groups, regional assignments, ideology categories, and the disaggregated LGBTQ+ identity classification---and saves the result as `candidates_analysis.rds`.
**`02_load_finance_raw.R`** reads the raw TSE campaign finance file (`receitas_candidatos_2024_BRASIL.csv`, ~1.3 GB, semicolon-delimited, Latin1 encoding). It documents every revenue source category in the raw data, classifies each transaction into funding types (self, party, individual, crowdfunding, other candidates, other), aggregates to the candidate level, validates the aggregation against a pre-existing aggregated file, and saves both candidate-level summaries (`finance_by_candidate.rds`) and the full transaction-level data (`finance_transactions.rds`).
**`03_prepare_geography.R`** downloads and caches municipality and state boundary shapefiles from the IBGE via the `geobr` R package. These are stored locally in `data/geo_cache/` so that subsequent renders do not re-download. It also reads the municipality crosswalk that bridges TSE and IBGE coding systems.
**`05_process_manifestos.R`** extracts text from candidate manifesto PDFs (*propostas de governo*), which are mandatory filings for mayoral candidates. The TSE distributes these as 27 state-level ZIP archives (~11 GB total, ~15,800 PDFs). The script processes each state sequentially: unzipping to a temporary directory, parsing the `SQ_CANDIDATO` (candidate ID) from each filename, extracting text with `pdftotext` and page counts with `pdfinfo` (poppler), and cleaning up. Corrupt or image-only PDFs (~3.5%) are flagged as `extraction_ok = FALSE` with `NA` text. The output is `manifestos_text.rds`.
**`04_build_analysis_data.R`** merges the four preceding outputs---candidates, finance, geography, and manifestos---into a single analysis-ready file (`analysis_full.rds`, ~464K rows, 112 columns). In addition to filling financial variables with zeros for unmatched candidates, this script:
- Reads **municipality characteristics** from the IBGE Census (population, GDP per capita, racial composition, poverty indicators, state capital status) via the parent project's intermediate file
- Computes **n_parties** (number of distinct parties) and **n_seats** (city council seats) per municipality from the candidate data
- Computes **Bolsonaro 2022 vote share** per municipality from TSE section-level voting data (cached to avoid re-processing the large file)
- Creates derived variables: `pop_bracket` (5 population categories), `bolsonaro_quartile` (conservatism quartiles), and `pct_nonwhite_muni` (municipality racial composition)
- Merges **manifesto text** from `manifestos_text.rds` (deduplicated to one manifesto per candidate, keeping the longest text when multiple PDFs exist). Non-mayor candidates and mayors without a manifesto get `has_manifesto = FALSE`.
# LGBTQ+ Identification Methodology
Identifying LGBTQ+ candidates in a population of nearly half a million candidacies is the central methodological challenge of this project. We draw on two complementary sources, then apply a three-stage matching algorithm to link the external VOTE LGBT list to official TSE records.
## Two Sources of Identification
### TSE Self-Disclosure (New in 2024)
For the first time in Brazilian electoral history, the *Tribunal Superior Eleitoral* (TSE) allowed candidates to voluntarily declare their sexual orientation and gender identity on the official registration form for the 2024 municipal elections. This was a landmark step in institutional recognition of LGBTQ+ political participation.
The TSE registration data provides two relevant fields:
- **`DS_ORIENTACAO_SEXUAL`** --- sexual orientation (e.g., Heterossexual, Gay, Lesbica, Bissexual, Pansexual, Assexual, or open text)
- **`DS_IDENTIDADE_GENERO`** --- gender identity (e.g., Cisgenero, Transgenero/Travesti, Nao-binario, or open text)
Candidates who reported any non-heterosexual orientation or non-cisgender identity are flagged as LGBTQ+.
::: {.callout-important}
## Voluntary Disclosure
Self-disclosure was entirely voluntary. Many LGBTQ+ candidates---particularly those in conservative regions or parties---may have chosen not to declare, meaning the TSE data alone substantially undercounts the true LGBTQ+ candidate population. This is why the VOTE LGBT list is an essential complement.
:::
### VOTE LGBT Candidate List
VOTE LGBT (*Voto com Orgulho*) is a Brazilian civil society project that independently identifies LGBTQ+ candidates running in elections. Their identification methods include:
- Direct self-identification to the organization
- Public campaigning on LGBTQ+ issues or identity
- Social media analysis and community referrals
- Nominations from LGBTQ+ advocacy organizations
For the 2024 municipal elections, VOTE LGBT identified approximately 1,300 candidates. Their list provides: ballot name, state, municipality, position sought, party, gender identity, sexual orientation, and (for some candidates) vote counts.
::: {.callout-note}
## Complementary Coverage
The two sources have partially overlapping but distinct coverage. The TSE captures candidates who chose to declare on the official form regardless of public visibility, while VOTE LGBT captures publicly visible LGBTQ+ candidates regardless of whether they filed a TSE declaration. The union of both sources provides the most comprehensive identification achievable with available data.
:::
## Matching Algorithm (3 Stages)
Linking the VOTE LGBT candidate list to TSE registration records requires reconciling informal names, varying municipality spellings, and approximate string matches. The matching proceeds in three stages, each applied only to candidates not yet matched by a prior stage.
### Stage 1: Exact Match (Name + State + Municipality)
The first and most restrictive stage joins on three standardized keys: ballot name, state, and municipality. Both datasets are preprocessed by removing accents, collapsing whitespace, and converting to uppercase.
```{r stage-1-match}
#| eval: false
# Stage 1: Exact match (name + state + municipality)
lgbt_matched_exact <- lgbt_clean %>%
inner_join(
candidates_std %>%
select(candidate_id, ballot_name_std, state_std, municipality_std),
by = c("ballot_name_lgbt" = "ballot_name_std",
"state_lgbt" = "state_std",
"municipality_lgbt" = "municipality_std")
) %>%
mutate(match_type = "exact_full", match_quality = 1.0)
```
- **Join keys**: `ballot_name_lgbt = ballot_name_std`, `state_lgbt = state_std`, `municipality_lgbt = municipality_std`
- **Quality score**: 1.0 (perfect match on all three dimensions)
- **Expectation**: Captures the majority of matches, as most VOTE LGBT entries use the official ballot name and standard municipality spelling
### Stage 2: Exact Match (Name + State Only)
For candidates unmatched after Stage 1, we relax the municipality requirement and match on ballot name and state alone.
```{r stage-2-match}
#| eval: false
# Stage 2: Exact match (name + state, relaxing municipality)
lgbt_unmatched_s1 <- lgbt_clean %>%
anti_join(lgbt_matched_exact, by = "row_id")
lgbt_matched_state <- lgbt_unmatched_s1 %>%
inner_join(
candidates_std %>%
select(candidate_id, ballot_name_std, state_std),
by = c("ballot_name_lgbt" = "ballot_name_std",
"state_lgbt" = "state_std")
) %>%
mutate(match_type = "exact_state", match_quality = 0.9)
```
- **Rationale**: The VOTE LGBT list sometimes records informal municipality names, abbreviated forms, or the name of a metropolitan area rather than the specific municipality where the candidate is registered. Matching on name + state resolves these cases while still requiring an exact name match as a safeguard.
- **Quality score**: 0.9
### Stage 3: Fuzzy Matching (Jaro-Winkler Distance)
For remaining unmatched candidates, we apply fuzzy string matching within the same state using the Jaro-Winkler distance metric, which is well-suited for name matching because it gives extra weight to matching prefixes---a useful property when ballot names share common first elements.
```{r stage-3-match}
#| eval: false
# Stage 3: Fuzzy match using Jaro-Winkler distance within state
library(stringdist)
for (i in seq_len(nrow(lgbt_unmatched_s2))) {
state_candidates <- candidates_std %>%
filter(state_std == lgbt_unmatched_s2$state_lgbt[i])
distances <- stringdist(
lgbt_unmatched_s2$ballot_name_lgbt[i],
state_candidates$ballot_name_std,
method = "jw"
)
best_idx <- which.min(distances)
min_dist <- distances[best_idx]
# Accept if distance < 0.15 (high similarity)
if (min_dist < 0.15) {
# Record match with quality = 1 - distance
match_quality <- 1 - min_dist
# Match found: store candidate_id, match_type = "fuzzy_jw"
}
}
```
- **Threshold**: Matches are accepted only when the Jaro-Winkler distance is less than 0.15, corresponding to a similarity of at least 0.85. This conservative threshold minimizes false positives.
- **Quality score**: `1 - distance` (ranges from 0.85 to ~0.99)
- **Manual review**: All fuzzy matches were manually inspected. State and party information served as additional disambiguation criteria in ambiguous cases.
::: {.callout-tip}
## Why Jaro-Winkler?
The Jaro-Winkler distance is preferred over alternatives (Levenshtein, cosine similarity) for person-name matching because it (1) is normalized to [0, 1], (2) penalizes transpositions less harshly than insertions/deletions, and (3) applies a prefix bonus that rewards names sharing the same initial characters---common in Brazilian ballot names that begin with a first name or nickname.
:::
## Identity Categorization
Once a candidate is identified as LGBTQ+ (from either source), they are assigned to a disaggregated identity category. The key design decision is that **trans identity is prioritized over sexual orientation**, because gender identity is a distinct dimension that cross-cuts orientation. A trans lesbian, for example, is categorized as "Trans" rather than "Lesbian" in our primary classification.
```{r identity-categorization}
#| eval: false
lcd_category = case_when(
trans_candidate ~ "Trans",
sexual_orientation == "Gay" ~ "Gay",
sexual_orientation == "Lésbica" ~ "Lesbian",
sexual_orientation %in% c("Bissexual", "Pansexual") ~ "Bisexual+",
sexual_orientation == "Assexual" ~ "Asexual",
TRUE ~ "Other LGBTQ+"
)
```
The resulting categories are:
| Category | Definition |
|----------|-----------|
| **Trans** | Any candidate with a transgender or travesti gender identity, regardless of sexual orientation |
| **Gay** | Cisgender male candidates with gay sexual orientation |
| **Lesbian** | Cisgender female candidates with lesbian (*Lesbica*) sexual orientation |
| **Bisexual+** | Candidates with bisexual or pansexual orientation (collapsed because both describe attraction to more than one gender) |
| **Asexual** | Candidates with asexual orientation |
| **Other LGBTQ+** | Identified LGBTQ+ candidates whose specific identity does not map to the above categories |
::: {.callout-note}
## Why Prioritize Trans?
This follows conventions in both the LGBTQ+ studies literature and Brazilian activist nomenclature. Transgender identity is a gender identity, not a sexual orientation. A trans person may simultaneously be gay, lesbian, bisexual, or asexual. Collapsing these into a single "Trans" category for the primary classification preserves the analytically distinct dimension of gender identity, while the underlying data retains full detail for secondary analyses.
:::
## Match Quality Statistics
The following tables summarize the results of the matching process across the identified LGBTQ+ population.
```{r match-quality-data}
lgbtq <- df %>% filter(lgbtq_candidate)
```
### Match Type Distribution
```{r tbl-match-type}
#| label: tbl-match-type
#| tbl-cap: "LGBTQ+ Candidates by Match Type"
lgbtq %>%
count(match_type, sort = TRUE) %>%
mutate(pct = format_pct(n / sum(n))) %>%
rename(`Match Type` = match_type, N = n, `%` = pct) %>%
kable(align = c("l", "r", "r"))
```
### Disclosure Source Distribution
```{r tbl-disclosure-source}
#| label: tbl-disclosure-source
#| tbl-cap: "LGBTQ+ Candidates by Disclosure Source"
lgbtq %>%
count(disclosure_source, sort = TRUE) %>%
mutate(pct = format_pct(n / sum(n))) %>%
rename(`Disclosure Source` = disclosure_source, N = n, `%` = pct) %>%
kable(align = c("l", "r", "r"))
```
### Cross-Tabulation: Match Type x Disclosure Source
```{r tbl-match-source-cross}
#| label: tbl-match-source-cross
#| tbl-cap: "Match Type by Disclosure Source"
lgbtq %>%
count(match_type, disclosure_source) %>%
pivot_wider(names_from = disclosure_source, values_from = n, values_fill = 0) %>%
rename(`Match Type` = match_type) %>%
kable(align = c("l", rep("r", ncol(.) - 1)))
```
### Match Quality Summary
```{r tbl-match-quality-stats}
#| label: tbl-match-quality-stats
#| tbl-cap: "Match Quality Score Summary (for Matched Candidates)"
lgbtq %>%
filter(!is.na(match_quality)) %>%
summarise(
N = format_n(n()),
Mean = sprintf("%.3f", mean(match_quality)),
Median = sprintf("%.3f", median(match_quality)),
Min = sprintf("%.3f", min(match_quality)),
Max = sprintf("%.3f", max(match_quality))
) %>%
kable(align = rep("r", 5))
```
# Campaign Finance Classification
Chapter 5 presents the results of the campaign finance analysis; this section documents the classification methodology that underpins it. The goal is to transform the raw TSE revenue file---which contains Portuguese-language category labels and complex institutional source codes---into a clean, analytically useful typology of funding sources.
## Raw TSE Categories
The TSE classifies each revenue transaction by `DS_ORIGEM_RECEITA` (revenue source). The following table maps the raw Portuguese categories to our English-language classification:
| Portuguese Category | English Translation | Our Classification |
|---|---|---|
| Recursos de pessoas fisicas | Own resources | `self_funding` |
| Fundo Partidario / Fundo Especial (partido politico) | Party fund / Special fund | `party_funding` |
| Doacoes de pessoas fisicas | Individual donations | `individual_funding` |
| Financiamento coletivo | Crowdfunding | `crowdfunding` |
| Doacoes de outros candidatos/comites | Other candidates/committees | `other_candidates` |
| Everything else | Various | `other` |
## Self-Funding Detection
Self-funding is identified through two complementary methods, capturing cases that either source alone would miss:
1. **Revenue origin**: The `DS_ORIGEM_RECEITA` field contains "recursos pr" (own resources), matched case-insensitively.
2. **CPF match**: The donor's CPF (`NR_CPF_CNPJ_DOADOR`) matches the candidate's own CPF (`NR_CPF_CANDIDATO`). This catches cases where a candidate donates to their own campaign but the transaction is categorized under a different revenue origin.
A transaction satisfying either condition is classified as `self_funding`.
## Classification Code
```{r finance-classification}
#| eval: false
funding_type = case_when(
str_detect(DS_ORIGEM_RECEITA, "(?i)recursos pr") |
(NR_CPF_CNPJ_DOADOR == NR_CPF_CANDIDATO) ~ "self_funding",
str_detect(DS_ORIGEM_RECEITA, "(?i)partido pol") ~ "party_funding",
str_detect(DS_ORIGEM_RECEITA, "(?i)pessoas f") ~ "individual_funding",
str_detect(DS_ORIGEM_RECEITA, "(?i)financiamento coletivo") ~ "crowdfunding",
str_detect(DS_ORIGEM_RECEITA, "(?i)outros candidatos") ~ "other_candidates",
TRUE ~ "other"
)
```
::: {.callout-note}
## Order Matters in `case_when()`
The `case_when()` function evaluates conditions in order and assigns the first matching category. Self-funding is tested first so that a candidate's own donation is always classified as self-funding, even if the TSE recorded the revenue origin under a different label. This is a deliberate design choice that prioritizes economic substance (who provided the money) over institutional labeling.
:::
## Financial vs. In-Kind Contributions
The TSE field `DS_NATUREZA_RECEITA` distinguishes between cash contributions (`"FINANCEIRO"`) and estimated or in-kind contributions (services, materials, event spaces, etc.). Our dataset preserves both the financial amount (`financial_amt`) and in-kind amount (`inkind_amt`) at the candidate level, along with their percentage shares of total revenue.
## Validation
Our raw aggregation from the ~5.5 million transaction-level records was validated against a pre-existing candidate-level aggregated file produced independently by the parent project. The two aggregations show near-perfect correlation, confirming that the raw processing pipeline faithfully reproduces the intended totals.
# Municipality Crosswalk
## The TSE-IBGE Code Problem
The TSE uses its own internal municipality coding system, which differs from the IBGE's official 7-digit municipality codes used by statistical agencies, the census, and the `geobr` R package. To produce choropleth maps and spatial analyses, we need to bridge these two systems.
## Solution
The crosswalk file (`crosswalk_tse_to_geobr.csv`) resolves this mismatch through name-based matching within state. The procedure is:
1. Start with the TSE municipality code, name, and state
2. Match to IBGE municipality codes via exact name match within state (~95% of cases)
3. Apply fuzzy name matching for municipalities with accent or spelling differences
4. Manually resolve remaining cases (municipality mergers, name changes)
The resulting `geobr_code` (IBGE 7-digit code) enables spatial joins with `geobr` shapefiles via `code_muni`.
## Match Rate
```{r muni-match-rate}
#| label: muni-match-rate
cat("Municipality match rate:",
format_pct(mean(!is.na(df$geobr_code))), "\n")
cat("Candidates with geobr_code:",
format_n(sum(!is.na(df$geobr_code))), "of",
format_n(nrow(df)), "\n")
```
The small number of unmatched candidates (<1%) are primarily from overseas voting sections or reflect rare discrepancies between TSE and IBGE municipality code systems.
# Municipality Context Variables
## IBGE Census Data
Municipality-level characteristics from the IBGE 2022 Census are merged via the parent project's `municipality_characteristics.csv` intermediate file. Variables include population, GDP per capita (deflated), racial composition (% White, Black, Brown, Indigenous), poverty indicators, and state capital status. The merge key is `geobr_code` (IBGE 7-digit municipality code), matching 99.9% of candidates.
## Bolsonaro 2022 Vote Share
Local electorate conservatism is proxied by Bolsonaro's first-round vote share in the 2022 presidential election. This is computed from TSE section-level voting data (`votacao_secao_2022_BR.csv`):
1. Filter to first round (`NR_TURNO == 1`)
2. Aggregate by municipality (`CD_MUNICIPIO`)
3. Compute: `100 * (votes for candidate 22) / (total valid votes, excluding codes 95 and 96)`
The result is cached as `bolsonaro_share.rds` to avoid re-processing the large file. Coverage is 88.8% of candidates; unmatched cases are primarily overseas voters or municipalities with TSE code mismatches.
## Municipality Political Context
Two additional variables are derived from the candidate data itself:
- **`n_parties`**: number of distinct party abbreviations running candidates in each municipality
- **`n_seats`**: number of elected city councilors per municipality (defaulting to 9, Brazil's statutory minimum, when no elected councilors are found)
# Variable Construction Reference
The following table documents all derived variables in the analysis dataset, including the transformation logic and rationale for each.
| Variable | Formula / Logic | Rationale |
|---|---|---|
| `lgbt_category` | Trans > Sexual Orientation (via `make_lgbt_category()`) | Gender identity cross-cuts orientation; trans identity is analytically distinct |
| `education_simple` | 8 TSE codes collapsed to 3 levels: Less than HS, High School, College+ | Reduces sparse cells for meaningful group comparisons |
| `race_simple` | 6 TSE codes collapsed to 4 levels: White, Brown, Black, Other | Aligns with census conventions; preserves key distinctions |
| `age_group` | 5 bins: 18--29, 30--39, 40--49, 50--59, 60+ | Standard demographic grouping for age-stratified analysis |
| `region` | 27 states mapped to 5 macro-regions | Standard Brazilian geographic division (North, Northeast, Center-West, Southeast, South) |
| `ideology_category` | <4.0 Left, 4.0--7.1 Center, >=7.1 Right | Bolognesi et al. (2023) expert survey thresholds, anchored on PT, MDB, PL |
| `female` | `gender == "FEMININO"` | Binary indicator for two-group gender comparisons |
| `nonwhite` | `race != "BRANCA"` (NA-safe) | Binary indicator for two-group racial comparisons |
| `lgbtq_label` | `"LGBTQ+"` / `"Non-LGBTQ+"` factor | Clean label for plots and tables; ordered for consistent display |
| `position_simple` | TSE position codes translated to English: City Councilor, Mayor, Vice-Mayor | Readable labels for analysis and visualization |
| `populacao_2022` | IBGE 2022 Census population | Municipality-level context for geographic analyses |
| `pop_bracket` | Population cut into 5 brackets: <10K, 10K-50K, 50K-200K, 200K-500K, 500K+ | Meaningful population categories for cross-tabulation |
| `capital_uf` | Binary: 1 = state capital, 0 = otherwise | Distinguishes capitals (typically larger, more progressive) |
| `pib_per_capita_defla` | GDP per capita, deflated (IBGE) | Economic development proxy |
| `bolsonaro_share` | Bolsonaro 2022 first-round vote share (%) per municipality | Proxy for local electorate conservatism |
| `bolsonaro_quartile` | Quartile of `bolsonaro_share`: Q1 (least conservative) through Q4 | Categorical conservatism measure for cross-tabulation |
| `n_parties` | `n_distinct(party_abbrev)` per municipality | Political fragmentation / competitiveness |
| `n_seats` | Elected councilors per municipality (min 9) | Council size / institutional context |
| `pct_nonwhite_muni` | `perc_preta + perc_parda` from Census | Municipality racial composition |
| `has_manifesto` | Logical: whether a manifesto PDF was found and text extracted | Only available for mayoral candidates; ~96.5% extraction rate |
| `manifesto_text` | Full extracted text of the candidate's manifesto (*proposta de governo*) | `NA` for non-mayors and image-only PDFs |
| `manifesto_n_pages` | Page count of the manifesto PDF (from `pdfinfo`) | Typically 6--13 pages |
| `manifesto_n_words` | Word count of extracted manifesto text | Typically 1,500--2,000 words |
| `threeway_group` | Trans / LGB / Non-LGBTQ+ (trans_candidate > lgb_candidate) | Three-way comparison in Ch06 intersectional analysis |
::: {.callout-note}
## On Ideology Thresholds
The Left/Center/Right cutpoints are not arbitrary. They are derived from natural breaks in the Bolognesi et al. (2023) expert survey distribution and anchored on three parties that serve as ideological reference points: PT (*Partido dos Trabalhadores*, score ~2.0, clearly Left), MDB (*Movimento Democratico Brasileiro*, score ~5.5, centrist catch-all), and PL (*Partido Liberal*, score ~8.5, clearly Right under Bolsonaro). Approximately 2% of candidates belong to minor parties not covered by the survey and have `NA` ideology scores.
:::
# Reproducibility
## Prerequisites
The data pipeline depends on source files from the parent research project. If running on a machine other than the original development environment, set the `QPP_DATA_DIR` environment variable to point to the parent project directory:
```r
Sys.setenv(QPP_DATA_DIR = "/path/to/Queer_politicians_project/")
```
## Execution Order
Run the five data preparation scripts in sequence before rendering the Quarto site:
```{r execution-order}
#| eval: false
# Step 1: Load and prepare candidate data
source(here::here("code", "01_load_candidates.R"))
# -> data/derived/candidates_analysis.rds
# Step 2: Process raw campaign finance data
source(here::here("code", "02_load_finance_raw.R"))
# -> data/derived/finance_by_candidate.rds
# -> data/derived/finance_transactions.rds
# -> output/tables/finance_revenue_source_categories.csv
# Step 3: Download and cache geographic boundaries
source(here::here("code", "03_prepare_geography.R"))
# -> data/geo_cache/municipalities.rds
# -> data/geo_cache/states.rds
# Step 4: Extract text from candidate manifesto PDFs (~15,800 PDFs)
source(here::here("code", "05_process_manifestos.R"))
# -> data/derived/manifestos_text.rds
# Step 5: Merge all components into analysis dataset
source(here::here("code", "04_build_analysis_data.R"))
# -> data/derived/analysis_full.rds
# -> data/derived/bolsonaro_share.rds (cached)
# Step 6: Render the Quarto website
# quarto render docs/
```
## Rendering
Once `analysis_full.rds` exists, the Quarto site can be rendered with:
```bash
quarto render docs/
```
All chapters source `code/00_setup.R` for shared paths, palettes, themes, and helper functions. The `freeze: auto` setting in `_quarto.yml` means that chapters whose source code has not changed will not be re-executed on subsequent renders.
## Data Access
The parent project data is stored on Dropbox and is not publicly available. Researchers wishing to replicate this analysis should:
1. Download raw TSE candidate registration files from [https://dadosabertos.tse.jus.br/](https://dadosabertos.tse.jus.br/)
2. Download raw TSE campaign finance files from the same portal
3. Download candidate manifesto PDFs (*propostas de governo*) from the same portal
4. Contact VOTE LGBT for their candidate identification list
5. Run the parent project's integration scripts to produce the processed files
6. Set `QPP_DATA_DIR` and run this project's scripts 01--05 in order
7. Render with `quarto render docs/`