7. Data Pipeline & Methodology

From Raw TSE Files to Analysis-Ready Data

Show code

source(here::here("code", "00_setup.R"))
df <- readRDS(paths$analysis_full_rds)

1 Pipeline Overview

This chapter documents the complete data pipeline that transforms raw electoral records into the analysis-ready dataset used throughout Chapters 1–6. Transparency and reproducibility are core objectives: every classification decision, matching algorithm, and variable derivation is described here so that other researchers can evaluate, replicate, or extend this work.

The pipeline consists of five R scripts, executed sequentially, each reading specific inputs and producing well-defined outputs. An upstream step in the parent research project handles the initial VOTE LGBT matching; all subsequent processing occurs within this descriptives project.

1.1 Data Flow

Show code

flowchart TD
    A["TSE Candidate Registration\n~464K rows, 59 cols"] --> B["01_load_candidates.R"]
    C["VOTE LGBT List\n~1,300 candidates"] --> D["07_merge_lgbtq_list.R\n(Parent Project)"]
    D --> A
    E["TSE Finance Raw\n~5.5M transactions, 1.3GB"] --> F["02_load_finance_raw.R"]
    G["IBGE Shapefiles\ngeobr package"] --> H["03_prepare_geography.R"]
    M["Municipality Characteristics\nIBGE Census, 5,570 rows"] --> I
    N["2022 Presidential Voting\nTSE section-level data"] --> I
    O["Candidate Manifestos\n27 state ZIPs, ~15.8K PDFs"] --> P["05_process_manifestos.R"]
    B --> I["04_build_analysis_data.R"]
    F --> I
    H --> I
    P --> I
    I --> J["analysis_full.rds\n~464K rows, 112 cols"]
    J --> K["Quarto Website\nChapters 1-8"]

flowchart TD
    A["TSE Candidate Registration\n~464K rows, 59 cols"] --> B["01_load_candidates.R"]
    C["VOTE LGBT List\n~1,300 candidates"] --> D["07_merge_lgbtq_list.R\n(Parent Project)"]
    D --> A
    E["TSE Finance Raw\n~5.5M transactions, 1.3GB"] --> F["02_load_finance_raw.R"]
    G["IBGE Shapefiles\ngeobr package"] --> H["03_prepare_geography.R"]
    M["Municipality Characteristics\nIBGE Census, 5,570 rows"] --> I
    N["2022 Presidential Voting\nTSE section-level data"] --> I
    O["Candidate Manifestos\n27 state ZIPs, ~15.8K PDFs"] --> P["05_process_manifestos.R"]
    B --> I["04_build_analysis_data.R"]
    F --> I
    H --> I
    P --> I
    I --> J["analysis_full.rds\n~464K rows, 112 cols"]
    J --> K["Quarto Website\nChapters 1-8"]

1.2 Script Descriptions

01_load_candidates.R reads the integrated candidate CSV from the parent project (~210 MB, ~464K rows x 59 columns). It creates derived variables—simplified demographics, age groups, regional assignments, ideology categories, and the disaggregated LGBTQ+ identity classification—and saves the result as candidates_analysis.rds.

02_load_finance_raw.R reads the raw TSE campaign finance file (receitas_candidatos_2024_BRASIL.csv, ~1.3 GB, semicolon-delimited, Latin1 encoding). It documents every revenue source category in the raw data, classifies each transaction into funding types (self, party, individual, crowdfunding, other candidates, other), aggregates to the candidate level, validates the aggregation against a pre-existing aggregated file, and saves both candidate-level summaries (finance_by_candidate.rds) and the full transaction-level data (finance_transactions.rds).

03_prepare_geography.R downloads and caches municipality and state boundary shapefiles from the IBGE via the geobr R package. These are stored locally in data/geo_cache/ so that subsequent renders do not re-download. It also reads the municipality crosswalk that bridges TSE and IBGE coding systems.

05_process_manifestos.R extracts text from candidate manifesto PDFs (propostas de governo), which are mandatory filings for mayoral candidates. The TSE distributes these as 27 state-level ZIP archives (~11 GB total, ~15,800 PDFs). The script processes each state sequentially: unzipping to a temporary directory, parsing the SQ_CANDIDATO (candidate ID) from each filename, extracting text with pdftotext and page counts with pdfinfo (poppler), and cleaning up. Corrupt or image-only PDFs (~3.5%) are flagged as extraction_ok = FALSE with NA text. The output is manifestos_text.rds.

04_build_analysis_data.R merges the four preceding outputs—candidates, finance, geography, and manifestos—into a single analysis-ready file (analysis_full.rds, ~464K rows, 112 columns). In addition to filling financial variables with zeros for unmatched candidates, this script:

Reads municipality characteristics from the IBGE Census (population, GDP per capita, racial composition, poverty indicators, state capital status) via the parent project’s intermediate file
Computes n_parties (number of distinct parties) and n_seats (city council seats) per municipality from the candidate data
Computes Bolsonaro 2022 vote share per municipality from TSE section-level voting data (cached to avoid re-processing the large file)
Creates derived variables: pop_bracket (5 population categories), bolsonaro_quartile (conservatism quartiles), and pct_nonwhite_muni (municipality racial composition)
Merges manifesto text from manifestos_text.rds (deduplicated to one manifesto per candidate, keeping the longest text when multiple PDFs exist). Non-mayor candidates and mayors without a manifesto get has_manifesto = FALSE.

2 LGBTQ+ Identification Methodology

Identifying LGBTQ+ candidates in a population of nearly half a million candidacies is the central methodological challenge of this project. We draw on two complementary sources, then apply a three-stage matching algorithm to link the external VOTE LGBT list to official TSE records.

2.1 Two Sources of Identification

2.1.1 TSE Self-Disclosure (New in 2024)

For the first time in Brazilian electoral history, the Tribunal Superior Eleitoral (TSE) allowed candidates to voluntarily declare their sexual orientation and gender identity on the official registration form for the 2024 municipal elections. This was a landmark step in institutional recognition of LGBTQ+ political participation.

The TSE registration data provides two relevant fields:

DS_ORIENTACAO_SEXUAL — sexual orientation (e.g., Heterossexual, Gay, Lesbica, Bissexual, Pansexual, Assexual, or open text)
DS_IDENTIDADE_GENERO — gender identity (e.g., Cisgenero, Transgenero/Travesti, Nao-binario, or open text)

Candidates who reported any non-heterosexual orientation or non-cisgender identity are flagged as LGBTQ+.

Voluntary Disclosure

Self-disclosure was entirely voluntary. Many LGBTQ+ candidates—particularly those in conservative regions or parties—may have chosen not to declare, meaning the TSE data alone substantially undercounts the true LGBTQ+ candidate population. This is why the VOTE LGBT list is an essential complement.

2.1.2 VOTE LGBT Candidate List

VOTE LGBT (Voto com Orgulho) is a Brazilian civil society project that independently identifies LGBTQ+ candidates running in elections. Their identification methods include:

Direct self-identification to the organization
Public campaigning on LGBTQ+ issues or identity
Social media analysis and community referrals
Nominations from LGBTQ+ advocacy organizations

For the 2024 municipal elections, VOTE LGBT identified approximately 1,300 candidates. Their list provides: ballot name, state, municipality, position sought, party, gender identity, sexual orientation, and (for some candidates) vote counts.

Complementary Coverage

The two sources have partially overlapping but distinct coverage. The TSE captures candidates who chose to declare on the official form regardless of public visibility, while VOTE LGBT captures publicly visible LGBTQ+ candidates regardless of whether they filed a TSE declaration. The union of both sources provides the most comprehensive identification achievable with available data.

2.2 Matching Algorithm (3 Stages)

Linking the VOTE LGBT candidate list to TSE registration records requires reconciling informal names, varying municipality spellings, and approximate string matches. The matching proceeds in three stages, each applied only to candidates not yet matched by a prior stage.

2.2.1 Stage 1: Exact Match (Name + State + Municipality)

The first and most restrictive stage joins on three standardized keys: ballot name, state, and municipality. Both datasets are preprocessed by removing accents, collapsing whitespace, and converting to uppercase.

Show code

# Stage 1: Exact match (name + state + municipality)
lgbt_matched_exact <- lgbt_clean %>%
  inner_join(
    candidates_std %>%
      select(candidate_id, ballot_name_std, state_std, municipality_std),
    by = c("ballot_name_lgbt" = "ballot_name_std",
           "state_lgbt" = "state_std",
           "municipality_lgbt" = "municipality_std")
  ) %>%
  mutate(match_type = "exact_full", match_quality = 1.0)

Join keys: ballot_name_lgbt = ballot_name_std, state_lgbt = state_std, municipality_lgbt = municipality_std
Quality score: 1.0 (perfect match on all three dimensions)
Expectation: Captures the majority of matches, as most VOTE LGBT entries use the official ballot name and standard municipality spelling

2.2.2 Stage 2: Exact Match (Name + State Only)

For candidates unmatched after Stage 1, we relax the municipality requirement and match on ballot name and state alone.

Show code

# Stage 2: Exact match (name + state, relaxing municipality)
lgbt_unmatched_s1 <- lgbt_clean %>%
  anti_join(lgbt_matched_exact, by = "row_id")

lgbt_matched_state <- lgbt_unmatched_s1 %>%
  inner_join(
    candidates_std %>%
      select(candidate_id, ballot_name_std, state_std),
    by = c("ballot_name_lgbt" = "ballot_name_std",
           "state_lgbt" = "state_std")
  ) %>%
  mutate(match_type = "exact_state", match_quality = 0.9)

Rationale: The VOTE LGBT list sometimes records informal municipality names, abbreviated forms, or the name of a metropolitan area rather than the specific municipality where the candidate is registered. Matching on name + state resolves these cases while still requiring an exact name match as a safeguard.
Quality score: 0.9

2.2.3 Stage 3: Fuzzy Matching (Jaro-Winkler Distance)

For remaining unmatched candidates, we apply fuzzy string matching within the same state using the Jaro-Winkler distance metric, which is well-suited for name matching because it gives extra weight to matching prefixes—a useful property when ballot names share common first elements.

Show code

# Stage 3: Fuzzy match using Jaro-Winkler distance within state
library(stringdist)

for (i in seq_len(nrow(lgbt_unmatched_s2))) {
  state_candidates <- candidates_std %>%
    filter(state_std == lgbt_unmatched_s2$state_lgbt[i])

  distances <- stringdist(
    lgbt_unmatched_s2$ballot_name_lgbt[i],
    state_candidates$ballot_name_std,
    method = "jw"
  )

  best_idx <- which.min(distances)
  min_dist <- distances[best_idx]

  # Accept if distance < 0.15 (high similarity)
  if (min_dist < 0.15) {
    # Record match with quality = 1 - distance
    match_quality <- 1 - min_dist
    # Match found: store candidate_id, match_type = "fuzzy_jw"
  }
}

Threshold: Matches are accepted only when the Jaro-Winkler distance is less than 0.15, corresponding to a similarity of at least 0.85. This conservative threshold minimizes false positives.
Quality score: 1 - distance (ranges from 0.85 to ~0.99)
Manual review: All fuzzy matches were manually inspected. State and party information served as additional disambiguation criteria in ambiguous cases.

Why Jaro-Winkler?

The Jaro-Winkler distance is preferred over alternatives (Levenshtein, cosine similarity) for person-name matching because it (1) is normalized to [0, 1], (2) penalizes transpositions less harshly than insertions/deletions, and (3) applies a prefix bonus that rewards names sharing the same initial characters—common in Brazilian ballot names that begin with a first name or nickname.

2.3 Identity Categorization

Once a candidate is identified as LGBTQ+ (from either source), they are assigned to a disaggregated identity category. The key design decision is that trans identity is prioritized over sexual orientation, because gender identity is a distinct dimension that cross-cuts orientation. A trans lesbian, for example, is categorized as “Trans” rather than “Lesbian” in our primary classification.

Show code

lcd_category = case_when(
  trans_candidate ~ "Trans",
  sexual_orientation == "Gay" ~ "Gay",
  sexual_orientation == "Lésbica" ~ "Lesbian",
  sexual_orientation %in% c("Bissexual", "Pansexual") ~ "Bisexual+",
  sexual_orientation == "Assexual" ~ "Asexual",
  TRUE ~ "Other LGBTQ+"
)

The resulting categories are:

Category	Definition
Trans	Any candidate with a transgender or travesti gender identity, regardless of sexual orientation
Gay	Cisgender male candidates with gay sexual orientation
Lesbian	Cisgender female candidates with lesbian (Lesbica) sexual orientation
Bisexual+	Candidates with bisexual or pansexual orientation (collapsed because both describe attraction to more than one gender)
Asexual	Candidates with asexual orientation
Other LGBTQ+	Identified LGBTQ+ candidates whose specific identity does not map to the above categories

Why Prioritize Trans?

This follows conventions in both the LGBTQ+ studies literature and Brazilian activist nomenclature. Transgender identity is a gender identity, not a sexual orientation. A trans person may simultaneously be gay, lesbian, bisexual, or asexual. Collapsing these into a single “Trans” category for the primary classification preserves the analytically distinct dimension of gender identity, while the underlying data retains full detail for secondary analyses.

2.4 Match Quality Statistics

The following tables summarize the results of the matching process across the identified LGBTQ+ population.

Show code

lgbtq <- df %>% filter(lgbtq_candidate)

2.4.1 Match Type Distribution

Show code

lgbtq %>%
  count(match_type, sort = TRUE) %>%
  mutate(pct = format_pct(n / sum(n))) %>%
  rename(`Match Type` = match_type, N = n, `%` = pct) %>%
  kable(align = c("l", "r", "r"))

Table 1: LGBTQ+ Candidates by Match Type

Match Type	N	%
exact_full	3132	99.9%
fuzzy	2	0.1%

2.4.2 Disclosure Source Distribution

Show code

lgbtq %>%
  count(disclosure_source, sort = TRUE) %>%
  mutate(pct = format_pct(n / sum(n))) %>%
  rename(`Disclosure Source` = disclosure_source, N = n, `%` = pct) %>%
  kable(align = c("l", "r", "r"))

Table 2: LGBTQ+ Candidates by Disclosure Source

Disclosure Source	N	%
TSE	2215	70.7%
VOTE + TSE	628	20.0%
VOTE	291	9.3%

2.4.3 Cross-Tabulation: Match Type x Disclosure Source

Show code

lgbtq %>%
  count(match_type, disclosure_source) %>%
  pivot_wider(names_from = disclosure_source, values_from = n, values_fill = 0) %>%
  rename(`Match Type` = match_type) %>%
  kable(align = c("l", rep("r", ncol(.) - 1)))

Table 3: Match Type by Disclosure Source

Match Type	TSE	VOTE	VOTE + TSE
exact_full	2215	289	628
fuzzy	0	2	0

2.4.4 Match Quality Summary

Show code

lgbtq %>%
  filter(!is.na(match_quality)) %>%
  summarise(
    N = format_n(n()),
    Mean = sprintf("%.3f", mean(match_quality)),
    Median = sprintf("%.3f", median(match_quality)),
    Min = sprintf("%.3f", min(match_quality)),
    Max = sprintf("%.3f", max(match_quality))
  ) %>%
  kable(align = rep("r", 5))

Table 4: Match Quality Score Summary (for Matched Candidates)

N	Mean	Median	Min	Max
3,134	1.000	1.000	0.974	1.000

3 Campaign Finance Classification

Chapter 5 presents the results of the campaign finance analysis; this section documents the classification methodology that underpins it. The goal is to transform the raw TSE revenue file—which contains Portuguese-language category labels and complex institutional source codes—into a clean, analytically useful typology of funding sources.

3.1 Raw TSE Categories

The TSE classifies each revenue transaction by DS_ORIGEM_RECEITA (revenue source). The following table maps the raw Portuguese categories to our English-language classification:

Portuguese Category	English Translation	Our Classification
Recursos de pessoas fisicas	Own resources	`self_funding`
Fundo Partidario / Fundo Especial (partido politico)	Party fund / Special fund	`party_funding`
Doacoes de pessoas fisicas	Individual donations	`individual_funding`
Financiamento coletivo	Crowdfunding	`crowdfunding`
Doacoes de outros candidatos/comites	Other candidates/committees	`other_candidates`
Everything else	Various	`other`

3.2 Self-Funding Detection

Self-funding is identified through two complementary methods, capturing cases that either source alone would miss:

Revenue origin: The DS_ORIGEM_RECEITA field contains “recursos pr” (own resources), matched case-insensitively.
CPF match: The donor’s CPF (NR_CPF_CNPJ_DOADOR) matches the candidate’s own CPF (NR_CPF_CANDIDATO). This catches cases where a candidate donates to their own campaign but the transaction is categorized under a different revenue origin.

A transaction satisfying either condition is classified as self_funding.

3.3 Classification Code

Show code

funding_type = case_when(
  str_detect(DS_ORIGEM_RECEITA, "(?i)recursos pr") |
    (NR_CPF_CNPJ_DOADOR == NR_CPF_CANDIDATO) ~ "self_funding",
  str_detect(DS_ORIGEM_RECEITA, "(?i)partido pol") ~ "party_funding",
  str_detect(DS_ORIGEM_RECEITA, "(?i)pessoas f") ~ "individual_funding",
  str_detect(DS_ORIGEM_RECEITA, "(?i)financiamento coletivo") ~ "crowdfunding",
  str_detect(DS_ORIGEM_RECEITA, "(?i)outros candidatos") ~ "other_candidates",
  TRUE ~ "other"
)

Order Matters in case_when()

The case_when() function evaluates conditions in order and assigns the first matching category. Self-funding is tested first so that a candidate’s own donation is always classified as self-funding, even if the TSE recorded the revenue origin under a different label. This is a deliberate design choice that prioritizes economic substance (who provided the money) over institutional labeling.

3.4 Financial vs. In-Kind Contributions

The TSE field DS_NATUREZA_RECEITA distinguishes between cash contributions ("FINANCEIRO") and estimated or in-kind contributions (services, materials, event spaces, etc.). Our dataset preserves both the financial amount (financial_amt) and in-kind amount (inkind_amt) at the candidate level, along with their percentage shares of total revenue.

3.5 Validation

Our raw aggregation from the ~5.5 million transaction-level records was validated against a pre-existing candidate-level aggregated file produced independently by the parent project. The two aggregations show near-perfect correlation, confirming that the raw processing pipeline faithfully reproduces the intended totals.

4 Municipality Crosswalk

4.1 The TSE-IBGE Code Problem

The TSE uses its own internal municipality coding system, which differs from the IBGE’s official 7-digit municipality codes used by statistical agencies, the census, and the geobr R package. To produce choropleth maps and spatial analyses, we need to bridge these two systems.

4.2 Solution

The crosswalk file (crosswalk_tse_to_geobr.csv) resolves this mismatch through name-based matching within state. The procedure is:

Start with the TSE municipality code, name, and state
Match to IBGE municipality codes via exact name match within state (~95% of cases)
Apply fuzzy name matching for municipalities with accent or spelling differences
Manually resolve remaining cases (municipality mergers, name changes)

The resulting geobr_code (IBGE 7-digit code) enables spatial joins with geobr shapefiles via code_muni.

4.3 Match Rate

Show code

cat("Municipality match rate:",
    format_pct(mean(!is.na(df$geobr_code))), "\n")

Municipality match rate: 99.9%

Show code

cat("Candidates with geobr_code:",
    format_n(sum(!is.na(df$geobr_code))), "of",
    format_n(nrow(df)), "\n")

Candidates with geobr_code: 463,338 of 463,601

The small number of unmatched candidates (<1%) are primarily from overseas voting sections or reflect rare discrepancies between TSE and IBGE municipality code systems.

5 Municipality Context Variables

5.1 IBGE Census Data

Municipality-level characteristics from the IBGE 2022 Census are merged via the parent project’s municipality_characteristics.csv intermediate file. Variables include population, GDP per capita (deflated), racial composition (% White, Black, Brown, Indigenous), poverty indicators, and state capital status. The merge key is geobr_code (IBGE 7-digit municipality code), matching 99.9% of candidates.

5.3 Municipality Political Context

Two additional variables are derived from the candidate data itself:

n_parties: number of distinct party abbreviations running candidates in each municipality
n_seats: number of elected city councilors per municipality (defaulting to 9, Brazil’s statutory minimum, when no elected councilors are found)

6 Variable Construction Reference

The following table documents all derived variables in the analysis dataset, including the transformation logic and rationale for each.

Variable	Formula / Logic	Rationale
`lgbt_category`	Trans > Sexual Orientation (via `make_lgbt_category()`)	Gender identity cross-cuts orientation; trans identity is analytically distinct
`education_simple`	8 TSE codes collapsed to 3 levels: Less than HS, High School, College+	Reduces sparse cells for meaningful group comparisons
`race_simple`	6 TSE codes collapsed to 4 levels: White, Brown, Black, Other	Aligns with census conventions; preserves key distinctions
`age_group`	5 bins: 18–29, 30–39, 40–49, 50–59, 60+	Standard demographic grouping for age-stratified analysis
`region`	27 states mapped to 5 macro-regions	Standard Brazilian geographic division (North, Northeast, Center-West, Southeast, South)
`ideology_category`	<4.0 Left, 4.0–7.1 Center, >=7.1 Right	Bolognesi et al. (2023) expert survey thresholds, anchored on PT, MDB, PL
`female`	`gender == "FEMININO"`	Binary indicator for two-group gender comparisons
`nonwhite`	`race != "BRANCA"` (NA-safe)	Binary indicator for two-group racial comparisons
`lgbtq_label`	`"LGBTQ+"` / `"Non-LGBTQ+"` factor	Clean label for plots and tables; ordered for consistent display
`position_simple`	TSE position codes translated to English: City Councilor, Mayor, Vice-Mayor	Readable labels for analysis and visualization
`populacao_2022`	IBGE 2022 Census population	Municipality-level context for geographic analyses
`pop_bracket`	Population cut into 5 brackets: <10K, 10K-50K, 50K-200K, 200K-500K, 500K+	Meaningful population categories for cross-tabulation
`capital_uf`	Binary: 1 = state capital, 0 = otherwise	Distinguishes capitals (typically larger, more progressive)
`pib_per_capita_defla`	GDP per capita, deflated (IBGE)	Economic development proxy
`bolsonaro_share`	Bolsonaro 2022 first-round vote share (%) per municipality	Proxy for local electorate conservatism
`bolsonaro_quartile`	Quartile of `bolsonaro_share`: Q1 (least conservative) through Q4	Categorical conservatism measure for cross-tabulation
`n_parties`	`n_distinct(party_abbrev)` per municipality	Political fragmentation / competitiveness
`n_seats`	Elected councilors per municipality (min 9)	Council size / institutional context
`pct_nonwhite_muni`	`perc_preta + perc_parda` from Census	Municipality racial composition
`has_manifesto`	Logical: whether a manifesto PDF was found and text extracted	Only available for mayoral candidates; ~96.5% extraction rate
`manifesto_text`	Full extracted text of the candidate’s manifesto (proposta de governo)	`NA` for non-mayors and image-only PDFs
`manifesto_n_pages`	Page count of the manifesto PDF (from `pdfinfo`)	Typically 6–13 pages
`manifesto_n_words`	Word count of extracted manifesto text	Typically 1,500–2,000 words
`threeway_group`	Trans / LGB / Non-LGBTQ+ (trans_candidate > lgb_candidate)	Three-way comparison in Ch06 intersectional analysis

On Ideology Thresholds

The Left/Center/Right cutpoints are not arbitrary. They are derived from natural breaks in the Bolognesi et al. (2023) expert survey distribution and anchored on three parties that serve as ideological reference points: PT (Partido dos Trabalhadores, score ~2.0, clearly Left), MDB (Movimento Democratico Brasileiro, score ~5.5, centrist catch-all), and PL (Partido Liberal, score ~8.5, clearly Right under Bolsonaro). Approximately 2% of candidates belong to minor parties not covered by the survey and have NA ideology scores.

7 Reproducibility

7.1 Prerequisites

The data pipeline depends on source files from the parent research project. If running on a machine other than the original development environment, set the QPP_DATA_DIR environment variable to point to the parent project directory:

Sys.setenv(QPP_DATA_DIR = "/path/to/Queer_politicians_project/")

7.2 Execution Order

Run the five data preparation scripts in sequence before rendering the Quarto site:

Show code

# Step 1: Load and prepare candidate data
source(here::here("code", "01_load_candidates.R"))
#   -> data/derived/candidates_analysis.rds

# Step 2: Process raw campaign finance data
source(here::here("code", "02_load_finance_raw.R"))
#   -> data/derived/finance_by_candidate.rds
#   -> data/derived/finance_transactions.rds
#   -> output/tables/finance_revenue_source_categories.csv

# Step 3: Download and cache geographic boundaries
source(here::here("code", "03_prepare_geography.R"))
#   -> data/geo_cache/municipalities.rds
#   -> data/geo_cache/states.rds

# Step 4: Extract text from candidate manifesto PDFs (~15,800 PDFs)
source(here::here("code", "05_process_manifestos.R"))
#   -> data/derived/manifestos_text.rds

# Step 5: Merge all components into analysis dataset
source(here::here("code", "04_build_analysis_data.R"))
#   -> data/derived/analysis_full.rds
#   -> data/derived/bolsonaro_share.rds  (cached)

# Step 6: Render the Quarto website
# quarto render docs/

7.3 Rendering

Once analysis_full.rds exists, the Quarto site can be rendered with:

quarto render docs/

All chapters source code/00_setup.R for shared paths, palettes, themes, and helper functions. The freeze: auto setting in _quarto.yml means that chapters whose source code has not changed will not be re-executed on subsequent renders.

7.4 Data Access

The parent project data is stored on Dropbox and is not publicly available. Researchers wishing to replicate this analysis should:

Download raw TSE candidate registration files from https://dadosabertos.tse.jus.br/
Download raw TSE campaign finance files from the same portal
Download candidate manifesto PDFs (propostas de governo) from the same portal
Contact VOTE LGBT for their candidate identification list
Run the parent project’s integration scripts to produce the processed files
Set QPP_DATA_DIR and run this project’s scripts 01–05 in order
Render with quarto render docs/

--- title: "7. Data Pipeline & Methodology" subtitle: "From Raw TSE Files to Analysis-Ready Data" --- ```{r setup} source(here::here("code", "00_setup.R")) df <- readRDS(paths$analysis_full_rds) ``` # Pipeline Overview This chapter documents the complete data pipeline that transforms raw electoral records into the analysis-ready dataset used throughout Chapters 1--6. Transparency and reproducibility are core objectives: every classification decision, matching algorithm, and variable derivation is described here so that other researchers can evaluate, replicate, or extend this work. The pipeline consists of five R scripts, executed sequentially, each reading specific inputs and producing well-defined outputs. An upstream step in the parent research project handles the initial VOTE LGBT matching; all subsequent processing occurs within this descriptives project. ## Data Flow ```{mermaid} flowchart TD A["TSE Candidate Registration\n~464K rows, 59 cols"] --> B["01_load_candidates.R"] C["VOTE LGBT List\n~1,300 candidates"] --> D["07_merge_lgbtq_list.R\n(Parent Project)"] D --> A E["TSE Finance Raw\n~5.5M transactions, 1.3GB"] --> F["02_load_finance_raw.R"] G["IBGE Shapefiles\ngeobr package"] --> H["03_prepare_geography.R"] M["Municipality Characteristics\nIBGE Census, 5,570 rows"] --> I N["2022 Presidential Voting\nTSE section-level data"] --> I O["Candidate Manifestos\n27 state ZIPs, ~15.8K PDFs"] --> P["05_process_manifestos.R"] B --> I["04_build_analysis_data.R"] F --> I H --> I P --> I I --> J["analysis_full.rds\n~464K rows, 112 cols"] J --> K["Quarto Website\nChapters 1-8"] ``` ## Script Descriptions **`01_load_candidates.R`** reads the integrated candidate CSV from the parent project (~210 MB, ~464K rows x 59 columns). It creates derived variables---simplified demographics, age groups, regional assignments, ideology categories, and the disaggregated LGBTQ+ identity classification---and saves the result as `candidates_analysis.rds`. **`02_load_finance_raw.R`** reads the raw TSE campaign finance file (`receitas_candidatos_2024_BRASIL.csv`, ~1.3 GB, semicolon-delimited, Latin1 encoding). It documents every revenue source category in the raw data, classifies each transaction into funding types (self, party, individual, crowdfunding, other candidates, other), aggregates to the candidate level, validates the aggregation against a pre-existing aggregated file, and saves both candidate-level summaries (`finance_by_candidate.rds`) and the full transaction-level data (`finance_transactions.rds`). **`03_prepare_geography.R`** downloads and caches municipality and state boundary shapefiles from the IBGE via the `geobr` R package. These are stored locally in `data/geo_cache/` so that subsequent renders do not re-download. It also reads the municipality crosswalk that bridges TSE and IBGE coding systems. **`05_process_manifestos.R`** extracts text from candidate manifesto PDFs (*propostas de governo*), which are mandatory filings for mayoral candidates. The TSE distributes these as 27 state-level ZIP archives (~11 GB total, ~15,800 PDFs). The script processes each state sequentially: unzipping to a temporary directory, parsing the `SQ_CANDIDATO` (candidate ID) from each filename, extracting text with `pdftotext` and page counts with `pdfinfo` (poppler), and cleaning up. Corrupt or image-only PDFs (~3.5%) are flagged as `extraction_ok = FALSE` with `NA` text. The output is `manifestos_text.rds`. **`04_build_analysis_data.R`** merges the four preceding outputs---candidates, finance, geography, and manifestos---into a single analysis-ready file (`analysis_full.rds`, ~464K rows, 112 columns). In addition to filling financial variables with zeros for unmatched candidates, this script: - Reads **municipality characteristics** from the IBGE Census (population, GDP per capita, racial composition, poverty indicators, state capital status) via the parent project's intermediate file - Computes **n_parties** (number of distinct parties) and **n_seats** (city council seats) per municipality from the candidate data - Computes **Bolsonaro 2022 vote share** per municipality from TSE section-level voting data (cached to avoid re-processing the large file) - Creates derived variables: `pop_bracket` (5 population categories), `bolsonaro_quartile` (conservatism quartiles), and `pct_nonwhite_muni` (municipality racial composition) - Merges **manifesto text** from `manifestos_text.rds` (deduplicated to one manifesto per candidate, keeping the longest text when multiple PDFs exist). Non-mayor candidates and mayors without a manifesto get `has_manifesto = FALSE`. # LGBTQ+ Identification Methodology Identifying LGBTQ+ candidates in a population of nearly half a million candidacies is the central methodological challenge of this project. We draw on two complementary sources, then apply a three-stage matching algorithm to link the external VOTE LGBT list to official TSE records. ## Two Sources of Identification ### TSE Self-Disclosure (New in 2024) For the first time in Brazilian electoral history, the *Tribunal Superior Eleitoral* (TSE) allowed candidates to voluntarily declare their sexual orientation and gender identity on the official registration form for the 2024 municipal elections. This was a landmark step in institutional recognition of LGBTQ+ political participation. The TSE registration data provides two relevant fields: - **`DS_ORIENTACAO_SEXUAL`** --- sexual orientation (e.g., Heterossexual, Gay, Lesbica, Bissexual, Pansexual, Assexual, or open text) - **`DS_IDENTIDADE_GENERO`** --- gender identity (e.g., Cisgenero, Transgenero/Travesti, Nao-binario, or open text) Candidates who reported any non-heterosexual orientation or non-cisgender identity are flagged as LGBTQ+. ::: {.callout-important} ## Voluntary Disclosure Self-disclosure was entirely voluntary. Many LGBTQ+ candidates---particularly those in conservative regions or parties---may have chosen not to declare, meaning the TSE data alone substantially undercounts the true LGBTQ+ candidate population. This is why the VOTE LGBT list is an essential complement. ::: ### VOTE LGBT Candidate List VOTE LGBT (*Voto com Orgulho*) is a Brazilian civil society project that independently identifies LGBTQ+ candidates running in elections. Their identification methods include: - Direct self-identification to the organization - Public campaigning on LGBTQ+ issues or identity - Social media analysis and community referrals - Nominations from LGBTQ+ advocacy organizations For the 2024 municipal elections, VOTE LGBT identified approximately 1,300 candidates. Their list provides: ballot name, state, municipality, position sought, party, gender identity, sexual orientation, and (for some candidates) vote counts. ::: {.callout-note} ## Complementary Coverage The two sources have partially overlapping but distinct coverage. The TSE captures candidates who chose to declare on the official form regardless of public visibility, while VOTE LGBT captures publicly visible LGBTQ+ candidates regardless of whether they filed a TSE declaration. The union of both sources provides the most comprehensive identification achievable with available data. ::: ## Matching Algorithm (3 Stages) Linking the VOTE LGBT candidate list to TSE registration records requires reconciling informal names, varying municipality spellings, and approximate string matches. The matching proceeds in three stages, each applied only to candidates not yet matched by a prior stage. ### Stage 1: Exact Match (Name + State + Municipality) The first and most restrictive stage joins on three standardized keys: ballot name, state, and municipality. Both datasets are preprocessed by removing accents, collapsing whitespace, and converting to uppercase. ```{r stage-1-match} #| eval: false # Stage 1: Exact match (name + state + municipality) lgbt_matched_exact <- lgbt_clean %>% inner_join( candidates_std %>% select(candidate_id, ballot_name_std, state_std, municipality_std), by = c("ballot_name_lgbt" = "ballot_name_std", "state_lgbt" = "state_std", "municipality_lgbt" = "municipality_std") ) %>% mutate(match_type = "exact_full", match_quality = 1.0) ``` - **Join keys**: `ballot_name_lgbt = ballot_name_std`, `state_lgbt = state_std`, `municipality_lgbt = municipality_std` - **Quality score**: 1.0 (perfect match on all three dimensions) - **Expectation**: Captures the majority of matches, as most VOTE LGBT entries use the official ballot name and standard municipality spelling ### Stage 2: Exact Match (Name + State Only) For candidates unmatched after Stage 1, we relax the municipality requirement and match on ballot name and state alone. ```{r stage-2-match} #| eval: false # Stage 2: Exact match (name + state, relaxing municipality) lgbt_unmatched_s1 <- lgbt_clean %>% anti_join(lgbt_matched_exact, by = "row_id") lgbt_matched_state <- lgbt_unmatched_s1 %>% inner_join( candidates_std %>% select(candidate_id, ballot_name_std, state_std), by = c("ballot_name_lgbt" = "ballot_name_std", "state_lgbt" = "state_std") ) %>% mutate(match_type = "exact_state", match_quality = 0.9) ``` - **Rationale**: The VOTE LGBT list sometimes records informal municipality names, abbreviated forms, or the name of a metropolitan area rather than the specific municipality where the candidate is registered. Matching on name + state resolves these cases while still requiring an exact name match as a safeguard. - **Quality score**: 0.9 ### Stage 3: Fuzzy Matching (Jaro-Winkler Distance) For remaining unmatched candidates, we apply fuzzy string matching within the same state using the Jaro-Winkler distance metric, which is well-suited for name matching because it gives extra weight to matching prefixes---a useful property when ballot names share common first elements. ```{r stage-3-match} #| eval: false # Stage 3: Fuzzy match using Jaro-Winkler distance within state library(stringdist) for (i in seq_len(nrow(lgbt_unmatched_s2))) { state_candidates <- candidates_std %>% filter(state_std == lgbt_unmatched_s2$state_lgbt[i]) distances <- stringdist( lgbt_unmatched_s2$ballot_name_lgbt[i], state_candidates$ballot_name_std, method = "jw" ) best_idx <- which.min(distances) min_dist <- distances[best_idx] # Accept if distance < 0.15 (high similarity) if (min_dist < 0.15) { # Record match with quality = 1 - distance match_quality <- 1 - min_dist # Match found: store candidate_id, match_type = "fuzzy_jw" } } ``` - **Threshold**: Matches are accepted only when the Jaro-Winkler distance is less than 0.15, corresponding to a similarity of at least 0.85. This conservative threshold minimizes false positives. - **Quality score**: `1 - distance` (ranges from 0.85 to ~0.99) - **Manual review**: All fuzzy matches were manually inspected. State and party information served as additional disambiguation criteria in ambiguous cases. ::: {.callout-tip} ## Why Jaro-Winkler? The Jaro-Winkler distance is preferred over alternatives (Levenshtein, cosine similarity) for person-name matching because it (1) is normalized to [0, 1], (2) penalizes transpositions less harshly than insertions/deletions, and (3) applies a prefix bonus that rewards names sharing the same initial characters---common in Brazilian ballot names that begin with a first name or nickname. ::: ## Identity Categorization Once a candidate is identified as LGBTQ+ (from either source), they are assigned to a disaggregated identity category. The key design decision is that **trans identity is prioritized over sexual orientation**, because gender identity is a distinct dimension that cross-cuts orientation. A trans lesbian, for example, is categorized as "Trans" rather than "Lesbian" in our primary classification. ```{r identity-categorization} #| eval: false lcd_category = case_when( trans_candidate ~ "Trans", sexual_orientation == "Gay" ~ "Gay", sexual_orientation == "Lésbica" ~ "Lesbian", sexual_orientation %in% c("Bissexual", "Pansexual") ~ "Bisexual+", sexual_orientation == "Assexual" ~ "Asexual", TRUE ~ "Other LGBTQ+" ) ``` The resulting categories are: | Category | Definition | |----------|-----------| | **Trans** | Any candidate with a transgender or travesti gender identity, regardless of sexual orientation | | **Gay** | Cisgender male candidates with gay sexual orientation | | **Lesbian** | Cisgender female candidates with lesbian (*Lesbica*) sexual orientation | | **Bisexual+** | Candidates with bisexual or pansexual orientation (collapsed because both describe attraction to more than one gender) | | **Asexual** | Candidates with asexual orientation | | **Other LGBTQ+** | Identified LGBTQ+ candidates whose specific identity does not map to the above categories | ::: {.callout-note} ## Why Prioritize Trans? This follows conventions in both the LGBTQ+ studies literature and Brazilian activist nomenclature. Transgender identity is a gender identity, not a sexual orientation. A trans person may simultaneously be gay, lesbian, bisexual, or asexual. Collapsing these into a single "Trans" category for the primary classification preserves the analytically distinct dimension of gender identity, while the underlying data retains full detail for secondary analyses. ::: ## Match Quality Statistics The following tables summarize the results of the matching process across the identified LGBTQ+ population. ```{r match-quality-data} lgbtq <- df %>% filter(lgbtq_candidate) ``` ### Match Type Distribution ```{r tbl-match-type} #| label: tbl-match-type #| tbl-cap: "LGBTQ+ Candidates by Match Type" lgbtq %>% count(match_type, sort = TRUE) %>% mutate(pct = format_pct(n / sum(n))) %>% rename(`Match Type` = match_type, N = n, `%` = pct) %>% kable(align = c("l", "r", "r")) ``` ### Disclosure Source Distribution ```{r tbl-disclosure-source} #| label: tbl-disclosure-source #| tbl-cap: "LGBTQ+ Candidates by Disclosure Source" lgbtq %>% count(disclosure_source, sort = TRUE) %>% mutate(pct = format_pct(n / sum(n))) %>% rename(`Disclosure Source` = disclosure_source, N = n, `%` = pct) %>% kable(align = c("l", "r", "r")) ``` ### Cross-Tabulation: Match Type x Disclosure Source ```{r tbl-match-source-cross} #| label: tbl-match-source-cross #| tbl-cap: "Match Type by Disclosure Source" lgbtq %>% count(match_type, disclosure_source) %>% pivot_wider(names_from = disclosure_source, values_from = n, values_fill = 0) %>% rename(`Match Type` = match_type) %>% kable(align = c("l", rep("r", ncol(.) - 1))) ``` ### Match Quality Summary ```{r tbl-match-quality-stats} #| label: tbl-match-quality-stats #| tbl-cap: "Match Quality Score Summary (for Matched Candidates)" lgbtq %>% filter(!is.na(match_quality)) %>% summarise( N = format_n(n()), Mean = sprintf("%.3f", mean(match_quality)), Median = sprintf("%.3f", median(match_quality)), Min = sprintf("%.3f", min(match_quality)), Max = sprintf("%.3f", max(match_quality)) ) %>% kable(align = rep("r", 5)) ``` # Campaign Finance Classification Chapter 5 presents the results of the campaign finance analysis; this section documents the classification methodology that underpins it. The goal is to transform the raw TSE revenue file---which contains Portuguese-language category labels and complex institutional source codes---into a clean, analytically useful typology of funding sources. ## Raw TSE Categories The TSE classifies each revenue transaction by `DS_ORIGEM_RECEITA` (revenue source). The following table maps the raw Portuguese categories to our English-language classification: | Portuguese Category | English Translation | Our Classification | |---|---|---| | Recursos de pessoas fisicas | Own resources | `self_funding` | | Fundo Partidario / Fundo Especial (partido politico) | Party fund / Special fund | `party_funding` | | Doacoes de pessoas fisicas | Individual donations | `individual_funding` | | Financiamento coletivo | Crowdfunding | `crowdfunding` | | Doacoes de outros candidatos/comites | Other candidates/committees | `other_candidates` | | Everything else | Various | `other` | ## Self-Funding Detection Self-funding is identified through two complementary methods, capturing cases that either source alone would miss: 1. **Revenue origin**: The `DS_ORIGEM_RECEITA` field contains "recursos pr" (own resources), matched case-insensitively. 2. **CPF match**: The donor's CPF (`NR_CPF_CNPJ_DOADOR`) matches the candidate's own CPF (`NR_CPF_CANDIDATO`). This catches cases where a candidate donates to their own campaign but the transaction is categorized under a different revenue origin. A transaction satisfying either condition is classified as `self_funding`. ## Classification Code ```{r finance-classification} #| eval: false funding_type = case_when( str_detect(DS_ORIGEM_RECEITA, "(?i)recursos pr") | (NR_CPF_CNPJ_DOADOR == NR_CPF_CANDIDATO) ~ "self_funding", str_detect(DS_ORIGEM_RECEITA, "(?i)partido pol") ~ "party_funding", str_detect(DS_ORIGEM_RECEITA, "(?i)pessoas f") ~ "individual_funding", str_detect(DS_ORIGEM_RECEITA, "(?i)financiamento coletivo") ~ "crowdfunding", str_detect(DS_ORIGEM_RECEITA, "(?i)outros candidatos") ~ "other_candidates", TRUE ~ "other" ) ``` ::: {.callout-note} ## Order Matters in `case_when()` The `case_when()` function evaluates conditions in order and assigns the first matching category. Self-funding is tested first so that a candidate's own donation is always classified as self-funding, even if the TSE recorded the revenue origin under a different label. This is a deliberate design choice that prioritizes economic substance (who provided the money) over institutional labeling. ::: ## Financial vs. In-Kind Contributions The TSE field `DS_NATUREZA_RECEITA` distinguishes between cash contributions (`"FINANCEIRO"`) and estimated or in-kind contributions (services, materials, event spaces, etc.). Our dataset preserves both the financial amount (`financial_amt`) and in-kind amount (`inkind_amt`) at the candidate level, along with their percentage shares of total revenue. ## Validation Our raw aggregation from the ~5.5 million transaction-level records was validated against a pre-existing candidate-level aggregated file produced independently by the parent project. The two aggregations show near-perfect correlation, confirming that the raw processing pipeline faithfully reproduces the intended totals. # Municipality Crosswalk ## The TSE-IBGE Code Problem The TSE uses its own internal municipality coding system, which differs from the IBGE's official 7-digit municipality codes used by statistical agencies, the census, and the `geobr` R package. To produce choropleth maps and spatial analyses, we need to bridge these two systems. ## Solution The crosswalk file (`crosswalk_tse_to_geobr.csv`) resolves this mismatch through name-based matching within state. The procedure is: 1. Start with the TSE municipality code, name, and state 2. Match to IBGE municipality codes via exact name match within state (~95% of cases) 3. Apply fuzzy name matching for municipalities with accent or spelling differences 4. Manually resolve remaining cases (municipality mergers, name changes) The resulting `geobr_code` (IBGE 7-digit code) enables spatial joins with `geobr` shapefiles via `code_muni`. ## Match Rate ```{r muni-match-rate} #| label: muni-match-rate cat("Municipality match rate:", format_pct(mean(!is.na(df$geobr_code))), "\n") cat("Candidates with geobr_code:", format_n(sum(!is.na(df$geobr_code))), "of", format_n(nrow(df)), "\n") ``` The small number of unmatched candidates (<1%) are primarily from overseas voting sections or reflect rare discrepancies between TSE and IBGE municipality code systems. # Municipality Context Variables ## IBGE Census Data Municipality-level characteristics from the IBGE 2022 Census are merged via the parent project's `municipality_characteristics.csv` intermediate file. Variables include population, GDP per capita (deflated), racial composition (% White, Black, Brown, Indigenous), poverty indicators, and state capital status. The merge key is `geobr_code` (IBGE 7-digit municipality code), matching 99.9% of candidates. ## Bolsonaro 2022 Vote Share Local electorate conservatism is proxied by Bolsonaro's first-round vote share in the 2022 presidential election. This is computed from TSE section-level voting data (`votacao_secao_2022_BR.csv`): 1. Filter to first round (`NR_TURNO == 1`) 2. Aggregate by municipality (`CD_MUNICIPIO`) 3. Compute: `100 * (votes for candidate 22) / (total valid votes, excluding codes 95 and 96)` The result is cached as `bolsonaro_share.rds` to avoid re-processing the large file. Coverage is 88.8% of candidates; unmatched cases are primarily overseas voters or municipalities with TSE code mismatches. ## Municipality Political Context Two additional variables are derived from the candidate data itself: - **`n_parties`**: number of distinct party abbreviations running candidates in each municipality - **`n_seats`**: number of elected city councilors per municipality (defaulting to 9, Brazil's statutory minimum, when no elected councilors are found) # Variable Construction Reference The following table documents all derived variables in the analysis dataset, including the transformation logic and rationale for each. | Variable | Formula / Logic | Rationale | |---|---|---| | `lgbt_category` | Trans > Sexual Orientation (via `make_lgbt_category()`) | Gender identity cross-cuts orientation; trans identity is analytically distinct | | `education_simple` | 8 TSE codes collapsed to 3 levels: Less than HS, High School, College+ | Reduces sparse cells for meaningful group comparisons | | `race_simple` | 6 TSE codes collapsed to 4 levels: White, Brown, Black, Other | Aligns with census conventions; preserves key distinctions | | `age_group` | 5 bins: 18--29, 30--39, 40--49, 50--59, 60+ | Standard demographic grouping for age-stratified analysis | | `region` | 27 states mapped to 5 macro-regions | Standard Brazilian geographic division (North, Northeast, Center-West, Southeast, South) | | `ideology_category` | <4.0 Left, 4.0--7.1 Center, >=7.1 Right | Bolognesi et al. (2023) expert survey thresholds, anchored on PT, MDB, PL | | `female` | `gender == "FEMININO"` | Binary indicator for two-group gender comparisons | | `nonwhite` | `race != "BRANCA"` (NA-safe) | Binary indicator for two-group racial comparisons | | `lgbtq_label` | `"LGBTQ+"` / `"Non-LGBTQ+"` factor | Clean label for plots and tables; ordered for consistent display | | `position_simple` | TSE position codes translated to English: City Councilor, Mayor, Vice-Mayor | Readable labels for analysis and visualization | | `populacao_2022` | IBGE 2022 Census population | Municipality-level context for geographic analyses | | `pop_bracket` | Population cut into 5 brackets: <10K, 10K-50K, 50K-200K, 200K-500K, 500K+ | Meaningful population categories for cross-tabulation | | `capital_uf` | Binary: 1 = state capital, 0 = otherwise | Distinguishes capitals (typically larger, more progressive) | | `pib_per_capita_defla` | GDP per capita, deflated (IBGE) | Economic development proxy | | `bolsonaro_share` | Bolsonaro 2022 first-round vote share (%) per municipality | Proxy for local electorate conservatism | | `bolsonaro_quartile` | Quartile of `bolsonaro_share`: Q1 (least conservative) through Q4 | Categorical conservatism measure for cross-tabulation | | `n_parties` | `n_distinct(party_abbrev)` per municipality | Political fragmentation / competitiveness | | `n_seats` | Elected councilors per municipality (min 9) | Council size / institutional context | | `pct_nonwhite_muni` | `perc_preta + perc_parda` from Census | Municipality racial composition | | `has_manifesto` | Logical: whether a manifesto PDF was found and text extracted | Only available for mayoral candidates; ~96.5% extraction rate | | `manifesto_text` | Full extracted text of the candidate's manifesto (*proposta de governo*) | `NA` for non-mayors and image-only PDFs | | `manifesto_n_pages` | Page count of the manifesto PDF (from `pdfinfo`) | Typically 6--13 pages | | `manifesto_n_words` | Word count of extracted manifesto text | Typically 1,500--2,000 words | | `threeway_group` | Trans / LGB / Non-LGBTQ+ (trans_candidate > lgb_candidate) | Three-way comparison in Ch06 intersectional analysis | ::: {.callout-note} ## On Ideology Thresholds The Left/Center/Right cutpoints are not arbitrary. They are derived from natural breaks in the Bolognesi et al. (2023) expert survey distribution and anchored on three parties that serve as ideological reference points: PT (*Partido dos Trabalhadores*, score ~2.0, clearly Left), MDB (*Movimento Democratico Brasileiro*, score ~5.5, centrist catch-all), and PL (*Partido Liberal*, score ~8.5, clearly Right under Bolsonaro). Approximately 2% of candidates belong to minor parties not covered by the survey and have `NA` ideology scores. ::: # Reproducibility ## Prerequisites The data pipeline depends on source files from the parent research project. If running on a machine other than the original development environment, set the `QPP_DATA_DIR` environment variable to point to the parent project directory: ```r Sys.setenv(QPP_DATA_DIR = "/path/to/Queer_politicians_project/") ``` ## Execution Order Run the five data preparation scripts in sequence before rendering the Quarto site: ```{r execution-order} #| eval: false # Step 1: Load and prepare candidate data source(here::here("code", "01_load_candidates.R")) # -> data/derived/candidates_analysis.rds # Step 2: Process raw campaign finance data source(here::here("code", "02_load_finance_raw.R")) # -> data/derived/finance_by_candidate.rds # -> data/derived/finance_transactions.rds # -> output/tables/finance_revenue_source_categories.csv # Step 3: Download and cache geographic boundaries source(here::here("code", "03_prepare_geography.R")) # -> data/geo_cache/municipalities.rds # -> data/geo_cache/states.rds # Step 4: Extract text from candidate manifesto PDFs (~15,800 PDFs) source(here::here("code", "05_process_manifestos.R")) # -> data/derived/manifestos_text.rds # Step 5: Merge all components into analysis dataset source(here::here("code", "04_build_analysis_data.R")) # -> data/derived/analysis_full.rds # -> data/derived/bolsonaro_share.rds (cached) # Step 6: Render the Quarto website # quarto render docs/ ``` ## Rendering Once `analysis_full.rds` exists, the Quarto site can be rendered with: ```bash quarto render docs/ ``` All chapters source `code/00_setup.R` for shared paths, palettes, themes, and helper functions. The `freeze: auto` setting in `_quarto.yml` means that chapters whose source code has not changed will not be re-executed on subsequent renders. ## Data Access The parent project data is stored on Dropbox and is not publicly available. Researchers wishing to replicate this analysis should: 1. Download raw TSE candidate registration files from [https://dadosabertos.tse.jus.br/](https://dadosabertos.tse.jus.br/) 2. Download raw TSE campaign finance files from the same portal 3. Download candidate manifesto PDFs (*propostas de governo*) from the same portal 4. Contact VOTE LGBT for their candidate identification list 5. Run the parent project's integration scripts to produce the processed files 6. Set `QPP_DATA_DIR` and run this project's scripts 01--05 in order 7. Render with `quarto render docs/`

1 Pipeline Overview

1.1 Data Flow

1.2 Script Descriptions

2 LGBTQ+ Identification Methodology

2.1 Two Sources of Identification

2.1.1 TSE Self-Disclosure (New in 2024)

2.1.2 VOTE LGBT Candidate List

2.2 Matching Algorithm (3 Stages)

2.2.1 Stage 1: Exact Match (Name + State + Municipality)

2.2.2 Stage 2: Exact Match (Name + State Only)

2.2.3 Stage 3: Fuzzy Matching (Jaro-Winkler Distance)

2.3 Identity Categorization

2.4 Match Quality Statistics

2.4.1 Match Type Distribution

2.4.2 Disclosure Source Distribution

2.4.3 Cross-Tabulation: Match Type x Disclosure Source

2.4.4 Match Quality Summary

3 Campaign Finance Classification

3.1 Raw TSE Categories

3.2 Self-Funding Detection

3.3 Classification Code

3.4 Financial vs. In-Kind Contributions

3.5 Validation

4 Municipality Crosswalk

4.1 The TSE-IBGE Code Problem

4.2 Solution

4.3 Match Rate

5 Municipality Context Variables

5.1 IBGE Census Data

5.2 Bolsonaro 2022 Vote Share

5.3 Municipality Political Context

6 Variable Construction Reference

7 Reproducibility

7.1 Prerequisites

7.2 Execution Order

7.3 Rendering

7.4 Data Access