7. Data Pipeline & Methodology

From Raw TSE Files to Analysis-Ready Data

Show code
source(here::here("code", "00_setup.R"))
df <- readRDS(paths$analysis_full_rds)

1 Pipeline Overview

This chapter documents the complete data pipeline that transforms raw electoral records into the analysis-ready dataset used throughout Chapters 1–6. Transparency and reproducibility are core objectives: every classification decision, matching algorithm, and variable derivation is described here so that other researchers can evaluate, replicate, or extend this work.

The pipeline consists of five R scripts, executed sequentially, each reading specific inputs and producing well-defined outputs. An upstream step in the parent research project handles the initial VOTE LGBT matching; all subsequent processing occurs within this descriptives project.

1.1 Data Flow

Show code
flowchart TD
    A["TSE Candidate Registration\n~464K rows, 59 cols"] --> B["01_load_candidates.R"]
    C["VOTE LGBT List\n~1,300 candidates"] --> D["07_merge_lgbtq_list.R\n(Parent Project)"]
    D --> A
    E["TSE Finance Raw\n~5.5M transactions, 1.3GB"] --> F["02_load_finance_raw.R"]
    G["IBGE Shapefiles\ngeobr package"] --> H["03_prepare_geography.R"]
    M["Municipality Characteristics\nIBGE Census, 5,570 rows"] --> I
    N["2022 Presidential Voting\nTSE section-level data"] --> I
    O["Candidate Manifestos\n27 state ZIPs, ~15.8K PDFs"] --> P["05_process_manifestos.R"]
    B --> I["04_build_analysis_data.R"]
    F --> I
    H --> I
    P --> I
    I --> J["analysis_full.rds\n~464K rows, 112 cols"]
    J --> K["Quarto Website\nChapters 1-8"]

flowchart TD
    A["TSE Candidate Registration\n~464K rows, 59 cols"] --> B["01_load_candidates.R"]
    C["VOTE LGBT List\n~1,300 candidates"] --> D["07_merge_lgbtq_list.R\n(Parent Project)"]
    D --> A
    E["TSE Finance Raw\n~5.5M transactions, 1.3GB"] --> F["02_load_finance_raw.R"]
    G["IBGE Shapefiles\ngeobr package"] --> H["03_prepare_geography.R"]
    M["Municipality Characteristics\nIBGE Census, 5,570 rows"] --> I
    N["2022 Presidential Voting\nTSE section-level data"] --> I
    O["Candidate Manifestos\n27 state ZIPs, ~15.8K PDFs"] --> P["05_process_manifestos.R"]
    B --> I["04_build_analysis_data.R"]
    F --> I
    H --> I
    P --> I
    I --> J["analysis_full.rds\n~464K rows, 112 cols"]
    J --> K["Quarto Website\nChapters 1-8"]

1.2 Script Descriptions

01_load_candidates.R reads the integrated candidate CSV from the parent project (~210 MB, ~464K rows x 59 columns). It creates derived variables—simplified demographics, age groups, regional assignments, ideology categories, and the disaggregated LGBTQ+ identity classification—and saves the result as candidates_analysis.rds.

02_load_finance_raw.R reads the raw TSE campaign finance file (receitas_candidatos_2024_BRASIL.csv, ~1.3 GB, semicolon-delimited, Latin1 encoding). It documents every revenue source category in the raw data, classifies each transaction into funding types (self, party, individual, crowdfunding, other candidates, other), aggregates to the candidate level, validates the aggregation against a pre-existing aggregated file, and saves both candidate-level summaries (finance_by_candidate.rds) and the full transaction-level data (finance_transactions.rds).

03_prepare_geography.R downloads and caches municipality and state boundary shapefiles from the IBGE via the geobr R package. These are stored locally in data/geo_cache/ so that subsequent renders do not re-download. It also reads the municipality crosswalk that bridges TSE and IBGE coding systems.

05_process_manifestos.R extracts text from candidate manifesto PDFs (propostas de governo), which are mandatory filings for mayoral candidates. The TSE distributes these as 27 state-level ZIP archives (~11 GB total, ~15,800 PDFs). The script processes each state sequentially: unzipping to a temporary directory, parsing the SQ_CANDIDATO (candidate ID) from each filename, extracting text with pdftotext and page counts with pdfinfo (poppler), and cleaning up. Corrupt or image-only PDFs (~3.5%) are flagged as extraction_ok = FALSE with NA text. The output is manifestos_text.rds.

04_build_analysis_data.R merges the four preceding outputs—candidates, finance, geography, and manifestos—into a single analysis-ready file (analysis_full.rds, ~464K rows, 112 columns). In addition to filling financial variables with zeros for unmatched candidates, this script:

  • Reads municipality characteristics from the IBGE Census (population, GDP per capita, racial composition, poverty indicators, state capital status) via the parent project’s intermediate file
  • Computes n_parties (number of distinct parties) and n_seats (city council seats) per municipality from the candidate data
  • Computes Bolsonaro 2022 vote share per municipality from TSE section-level voting data (cached to avoid re-processing the large file)
  • Creates derived variables: pop_bracket (5 population categories), bolsonaro_quartile (conservatism quartiles), and pct_nonwhite_muni (municipality racial composition)
  • Merges manifesto text from manifestos_text.rds (deduplicated to one manifesto per candidate, keeping the longest text when multiple PDFs exist). Non-mayor candidates and mayors without a manifesto get has_manifesto = FALSE.

2 LGBTQ+ Identification Methodology

Identifying LGBTQ+ candidates in a population of nearly half a million candidacies is the central methodological challenge of this project. We draw on two complementary sources, then apply a three-stage matching algorithm to link the external VOTE LGBT list to official TSE records.

2.1 Two Sources of Identification

2.1.1 TSE Self-Disclosure (New in 2024)

For the first time in Brazilian electoral history, the Tribunal Superior Eleitoral (TSE) allowed candidates to voluntarily declare their sexual orientation and gender identity on the official registration form for the 2024 municipal elections. This was a landmark step in institutional recognition of LGBTQ+ political participation.

The TSE registration data provides two relevant fields:

  • DS_ORIENTACAO_SEXUAL — sexual orientation (e.g., Heterossexual, Gay, Lesbica, Bissexual, Pansexual, Assexual, or open text)
  • DS_IDENTIDADE_GENERO — gender identity (e.g., Cisgenero, Transgenero/Travesti, Nao-binario, or open text)

Candidates who reported any non-heterosexual orientation or non-cisgender identity are flagged as LGBTQ+.

Voluntary Disclosure

Self-disclosure was entirely voluntary. Many LGBTQ+ candidates—particularly those in conservative regions or parties—may have chosen not to declare, meaning the TSE data alone substantially undercounts the true LGBTQ+ candidate population. This is why the VOTE LGBT list is an essential complement.

2.1.2 VOTE LGBT Candidate List

VOTE LGBT (Voto com Orgulho) is a Brazilian civil society project that independently identifies LGBTQ+ candidates running in elections. Their identification methods include:

  • Direct self-identification to the organization
  • Public campaigning on LGBTQ+ issues or identity
  • Social media analysis and community referrals
  • Nominations from LGBTQ+ advocacy organizations

For the 2024 municipal elections, VOTE LGBT identified approximately 1,300 candidates. Their list provides: ballot name, state, municipality, position sought, party, gender identity, sexual orientation, and (for some candidates) vote counts.

Complementary Coverage

The two sources have partially overlapping but distinct coverage. The TSE captures candidates who chose to declare on the official form regardless of public visibility, while VOTE LGBT captures publicly visible LGBTQ+ candidates regardless of whether they filed a TSE declaration. The union of both sources provides the most comprehensive identification achievable with available data.

2.2 Matching Algorithm (3 Stages)

Linking the VOTE LGBT candidate list to TSE registration records requires reconciling informal names, varying municipality spellings, and approximate string matches. The matching proceeds in three stages, each applied only to candidates not yet matched by a prior stage.

2.2.1 Stage 1: Exact Match (Name + State + Municipality)

The first and most restrictive stage joins on three standardized keys: ballot name, state, and municipality. Both datasets are preprocessed by removing accents, collapsing whitespace, and converting to uppercase.

Show code
# Stage 1: Exact match (name + state + municipality)
lgbt_matched_exact <- lgbt_clean %>%
  inner_join(
    candidates_std %>%
      select(candidate_id, ballot_name_std, state_std, municipality_std),
    by = c("ballot_name_lgbt" = "ballot_name_std",
           "state_lgbt" = "state_std",
           "municipality_lgbt" = "municipality_std")
  ) %>%
  mutate(match_type = "exact_full", match_quality = 1.0)
  • Join keys: ballot_name_lgbt = ballot_name_std, state_lgbt = state_std, municipality_lgbt = municipality_std
  • Quality score: 1.0 (perfect match on all three dimensions)
  • Expectation: Captures the majority of matches, as most VOTE LGBT entries use the official ballot name and standard municipality spelling

2.2.2 Stage 2: Exact Match (Name + State Only)

For candidates unmatched after Stage 1, we relax the municipality requirement and match on ballot name and state alone.

Show code
# Stage 2: Exact match (name + state, relaxing municipality)
lgbt_unmatched_s1 <- lgbt_clean %>%
  anti_join(lgbt_matched_exact, by = "row_id")

lgbt_matched_state <- lgbt_unmatched_s1 %>%
  inner_join(
    candidates_std %>%
      select(candidate_id, ballot_name_std, state_std),
    by = c("ballot_name_lgbt" = "ballot_name_std",
           "state_lgbt" = "state_std")
  ) %>%
  mutate(match_type = "exact_state", match_quality = 0.9)
  • Rationale: The VOTE LGBT list sometimes records informal municipality names, abbreviated forms, or the name of a metropolitan area rather than the specific municipality where the candidate is registered. Matching on name + state resolves these cases while still requiring an exact name match as a safeguard.
  • Quality score: 0.9

2.2.3 Stage 3: Fuzzy Matching (Jaro-Winkler Distance)

For remaining unmatched candidates, we apply fuzzy string matching within the same state using the Jaro-Winkler distance metric, which is well-suited for name matching because it gives extra weight to matching prefixes—a useful property when ballot names share common first elements.

Show code
# Stage 3: Fuzzy match using Jaro-Winkler distance within state
library(stringdist)

for (i in seq_len(nrow(lgbt_unmatched_s2))) {
  state_candidates <- candidates_std %>%
    filter(state_std == lgbt_unmatched_s2$state_lgbt[i])

  distances <- stringdist(
    lgbt_unmatched_s2$ballot_name_lgbt[i],
    state_candidates$ballot_name_std,
    method = "jw"
  )

  best_idx <- which.min(distances)
  min_dist <- distances[best_idx]

  # Accept if distance < 0.15 (high similarity)
  if (min_dist < 0.15) {
    # Record match with quality = 1 - distance
    match_quality <- 1 - min_dist
    # Match found: store candidate_id, match_type = "fuzzy_jw"
  }
}
  • Threshold: Matches are accepted only when the Jaro-Winkler distance is less than 0.15, corresponding to a similarity of at least 0.85. This conservative threshold minimizes false positives.
  • Quality score: 1 - distance (ranges from 0.85 to ~0.99)
  • Manual review: All fuzzy matches were manually inspected. State and party information served as additional disambiguation criteria in ambiguous cases.
Why Jaro-Winkler?

The Jaro-Winkler distance is preferred over alternatives (Levenshtein, cosine similarity) for person-name matching because it (1) is normalized to [0, 1], (2) penalizes transpositions less harshly than insertions/deletions, and (3) applies a prefix bonus that rewards names sharing the same initial characters—common in Brazilian ballot names that begin with a first name or nickname.

2.3 Identity Categorization

Once a candidate is identified as LGBTQ+ (from either source), they are assigned to a disaggregated identity category. The key design decision is that trans identity is prioritized over sexual orientation, because gender identity is a distinct dimension that cross-cuts orientation. A trans lesbian, for example, is categorized as “Trans” rather than “Lesbian” in our primary classification.

Show code
lcd_category = case_when(
  trans_candidate ~ "Trans",
  sexual_orientation == "Gay" ~ "Gay",
  sexual_orientation == "Lésbica" ~ "Lesbian",
  sexual_orientation %in% c("Bissexual", "Pansexual") ~ "Bisexual+",
  sexual_orientation == "Assexual" ~ "Asexual",
  TRUE ~ "Other LGBTQ+"
)

The resulting categories are:

Category Definition
Trans Any candidate with a transgender or travesti gender identity, regardless of sexual orientation
Gay Cisgender male candidates with gay sexual orientation
Lesbian Cisgender female candidates with lesbian (Lesbica) sexual orientation
Bisexual+ Candidates with bisexual or pansexual orientation (collapsed because both describe attraction to more than one gender)
Asexual Candidates with asexual orientation
Other LGBTQ+ Identified LGBTQ+ candidates whose specific identity does not map to the above categories
Why Prioritize Trans?

This follows conventions in both the LGBTQ+ studies literature and Brazilian activist nomenclature. Transgender identity is a gender identity, not a sexual orientation. A trans person may simultaneously be gay, lesbian, bisexual, or asexual. Collapsing these into a single “Trans” category for the primary classification preserves the analytically distinct dimension of gender identity, while the underlying data retains full detail for secondary analyses.

2.4 Match Quality Statistics

The following tables summarize the results of the matching process across the identified LGBTQ+ population.

Show code
lgbtq <- df %>% filter(lgbtq_candidate)

2.4.1 Match Type Distribution

Show code
lgbtq %>%
  count(match_type, sort = TRUE) %>%
  mutate(pct = format_pct(n / sum(n))) %>%
  rename(`Match Type` = match_type, N = n, `%` = pct) %>%
  kable(align = c("l", "r", "r"))
Table 1: LGBTQ+ Candidates by Match Type
Match Type N %
exact_full 3132 99.9%
fuzzy 2 0.1%

2.4.2 Disclosure Source Distribution

Show code
lgbtq %>%
  count(disclosure_source, sort = TRUE) %>%
  mutate(pct = format_pct(n / sum(n))) %>%
  rename(`Disclosure Source` = disclosure_source, N = n, `%` = pct) %>%
  kable(align = c("l", "r", "r"))
Table 2: LGBTQ+ Candidates by Disclosure Source
Disclosure Source N %
TSE 2215 70.7%
VOTE + TSE 628 20.0%
VOTE 291 9.3%

2.4.3 Cross-Tabulation: Match Type x Disclosure Source

Show code
lgbtq %>%
  count(match_type, disclosure_source) %>%
  pivot_wider(names_from = disclosure_source, values_from = n, values_fill = 0) %>%
  rename(`Match Type` = match_type) %>%
  kable(align = c("l", rep("r", ncol(.) - 1)))
Table 3: Match Type by Disclosure Source
Match Type TSE VOTE VOTE + TSE
exact_full 2215 289 628
fuzzy 0 2 0

2.4.4 Match Quality Summary

Show code
lgbtq %>%
  filter(!is.na(match_quality)) %>%
  summarise(
    N = format_n(n()),
    Mean = sprintf("%.3f", mean(match_quality)),
    Median = sprintf("%.3f", median(match_quality)),
    Min = sprintf("%.3f", min(match_quality)),
    Max = sprintf("%.3f", max(match_quality))
  ) %>%
  kable(align = rep("r", 5))
Table 4: Match Quality Score Summary (for Matched Candidates)
N Mean Median Min Max
3,134 1.000 1.000 0.974 1.000

3 Campaign Finance Classification

Chapter 5 presents the results of the campaign finance analysis; this section documents the classification methodology that underpins it. The goal is to transform the raw TSE revenue file—which contains Portuguese-language category labels and complex institutional source codes—into a clean, analytically useful typology of funding sources.

3.1 Raw TSE Categories

The TSE classifies each revenue transaction by DS_ORIGEM_RECEITA (revenue source). The following table maps the raw Portuguese categories to our English-language classification:

Portuguese Category English Translation Our Classification
Recursos de pessoas fisicas Own resources self_funding
Fundo Partidario / Fundo Especial (partido politico) Party fund / Special fund party_funding
Doacoes de pessoas fisicas Individual donations individual_funding
Financiamento coletivo Crowdfunding crowdfunding
Doacoes de outros candidatos/comites Other candidates/committees other_candidates
Everything else Various other

3.2 Self-Funding Detection

Self-funding is identified through two complementary methods, capturing cases that either source alone would miss:

  1. Revenue origin: The DS_ORIGEM_RECEITA field contains “recursos pr” (own resources), matched case-insensitively.
  2. CPF match: The donor’s CPF (NR_CPF_CNPJ_DOADOR) matches the candidate’s own CPF (NR_CPF_CANDIDATO). This catches cases where a candidate donates to their own campaign but the transaction is categorized under a different revenue origin.

A transaction satisfying either condition is classified as self_funding.

3.3 Classification Code

Show code
funding_type = case_when(
  str_detect(DS_ORIGEM_RECEITA, "(?i)recursos pr") |
    (NR_CPF_CNPJ_DOADOR == NR_CPF_CANDIDATO) ~ "self_funding",
  str_detect(DS_ORIGEM_RECEITA, "(?i)partido pol") ~ "party_funding",
  str_detect(DS_ORIGEM_RECEITA, "(?i)pessoas f") ~ "individual_funding",
  str_detect(DS_ORIGEM_RECEITA, "(?i)financiamento coletivo") ~ "crowdfunding",
  str_detect(DS_ORIGEM_RECEITA, "(?i)outros candidatos") ~ "other_candidates",
  TRUE ~ "other"
)
Order Matters in case_when()

The case_when() function evaluates conditions in order and assigns the first matching category. Self-funding is tested first so that a candidate’s own donation is always classified as self-funding, even if the TSE recorded the revenue origin under a different label. This is a deliberate design choice that prioritizes economic substance (who provided the money) over institutional labeling.

3.4 Financial vs. In-Kind Contributions

The TSE field DS_NATUREZA_RECEITA distinguishes between cash contributions ("FINANCEIRO") and estimated or in-kind contributions (services, materials, event spaces, etc.). Our dataset preserves both the financial amount (financial_amt) and in-kind amount (inkind_amt) at the candidate level, along with their percentage shares of total revenue.

3.5 Validation

Our raw aggregation from the ~5.5 million transaction-level records was validated against a pre-existing candidate-level aggregated file produced independently by the parent project. The two aggregations show near-perfect correlation, confirming that the raw processing pipeline faithfully reproduces the intended totals.

4 Municipality Crosswalk

4.1 The TSE-IBGE Code Problem

The TSE uses its own internal municipality coding system, which differs from the IBGE’s official 7-digit municipality codes used by statistical agencies, the census, and the geobr R package. To produce choropleth maps and spatial analyses, we need to bridge these two systems.

4.2 Solution

The crosswalk file (crosswalk_tse_to_geobr.csv) resolves this mismatch through name-based matching within state. The procedure is:

  1. Start with the TSE municipality code, name, and state
  2. Match to IBGE municipality codes via exact name match within state (~95% of cases)
  3. Apply fuzzy name matching for municipalities with accent or spelling differences
  4. Manually resolve remaining cases (municipality mergers, name changes)

The resulting geobr_code (IBGE 7-digit code) enables spatial joins with geobr shapefiles via code_muni.

4.3 Match Rate

Show code
cat("Municipality match rate:",
    format_pct(mean(!is.na(df$geobr_code))), "\n")
Municipality match rate: 99.9% 
Show code
cat("Candidates with geobr_code:",
    format_n(sum(!is.na(df$geobr_code))), "of",
    format_n(nrow(df)), "\n")
Candidates with geobr_code: 463,338 of 463,601 

The small number of unmatched candidates (<1%) are primarily from overseas voting sections or reflect rare discrepancies between TSE and IBGE municipality code systems.

5 Municipality Context Variables

5.1 IBGE Census Data

Municipality-level characteristics from the IBGE 2022 Census are merged via the parent project’s municipality_characteristics.csv intermediate file. Variables include population, GDP per capita (deflated), racial composition (% White, Black, Brown, Indigenous), poverty indicators, and state capital status. The merge key is geobr_code (IBGE 7-digit municipality code), matching 99.9% of candidates.

5.2 Bolsonaro 2022 Vote Share

Local electorate conservatism is proxied by Bolsonaro’s first-round vote share in the 2022 presidential election. This is computed from TSE section-level voting data (votacao_secao_2022_BR.csv):

  1. Filter to first round (NR_TURNO == 1)
  2. Aggregate by municipality (CD_MUNICIPIO)
  3. Compute: 100 * (votes for candidate 22) / (total valid votes, excluding codes 95 and 96)

The result is cached as bolsonaro_share.rds to avoid re-processing the large file. Coverage is 88.8% of candidates; unmatched cases are primarily overseas voters or municipalities with TSE code mismatches.

5.3 Municipality Political Context

Two additional variables are derived from the candidate data itself:

  • n_parties: number of distinct party abbreviations running candidates in each municipality
  • n_seats: number of elected city councilors per municipality (defaulting to 9, Brazil’s statutory minimum, when no elected councilors are found)

6 Variable Construction Reference

The following table documents all derived variables in the analysis dataset, including the transformation logic and rationale for each.

Variable Formula / Logic Rationale
lgbt_category Trans > Sexual Orientation (via make_lgbt_category()) Gender identity cross-cuts orientation; trans identity is analytically distinct
education_simple 8 TSE codes collapsed to 3 levels: Less than HS, High School, College+ Reduces sparse cells for meaningful group comparisons
race_simple 6 TSE codes collapsed to 4 levels: White, Brown, Black, Other Aligns with census conventions; preserves key distinctions
age_group 5 bins: 18–29, 30–39, 40–49, 50–59, 60+ Standard demographic grouping for age-stratified analysis
region 27 states mapped to 5 macro-regions Standard Brazilian geographic division (North, Northeast, Center-West, Southeast, South)
ideology_category <4.0 Left, 4.0–7.1 Center, >=7.1 Right Bolognesi et al. (2023) expert survey thresholds, anchored on PT, MDB, PL
female gender == "FEMININO" Binary indicator for two-group gender comparisons
nonwhite race != "BRANCA" (NA-safe) Binary indicator for two-group racial comparisons
lgbtq_label "LGBTQ+" / "Non-LGBTQ+" factor Clean label for plots and tables; ordered for consistent display
position_simple TSE position codes translated to English: City Councilor, Mayor, Vice-Mayor Readable labels for analysis and visualization
populacao_2022 IBGE 2022 Census population Municipality-level context for geographic analyses
pop_bracket Population cut into 5 brackets: <10K, 10K-50K, 50K-200K, 200K-500K, 500K+ Meaningful population categories for cross-tabulation
capital_uf Binary: 1 = state capital, 0 = otherwise Distinguishes capitals (typically larger, more progressive)
pib_per_capita_defla GDP per capita, deflated (IBGE) Economic development proxy
bolsonaro_share Bolsonaro 2022 first-round vote share (%) per municipality Proxy for local electorate conservatism
bolsonaro_quartile Quartile of bolsonaro_share: Q1 (least conservative) through Q4 Categorical conservatism measure for cross-tabulation
n_parties n_distinct(party_abbrev) per municipality Political fragmentation / competitiveness
n_seats Elected councilors per municipality (min 9) Council size / institutional context
pct_nonwhite_muni perc_preta + perc_parda from Census Municipality racial composition
has_manifesto Logical: whether a manifesto PDF was found and text extracted Only available for mayoral candidates; ~96.5% extraction rate
manifesto_text Full extracted text of the candidate’s manifesto (proposta de governo) NA for non-mayors and image-only PDFs
manifesto_n_pages Page count of the manifesto PDF (from pdfinfo) Typically 6–13 pages
manifesto_n_words Word count of extracted manifesto text Typically 1,500–2,000 words
threeway_group Trans / LGB / Non-LGBTQ+ (trans_candidate > lgb_candidate) Three-way comparison in Ch06 intersectional analysis
On Ideology Thresholds

The Left/Center/Right cutpoints are not arbitrary. They are derived from natural breaks in the Bolognesi et al. (2023) expert survey distribution and anchored on three parties that serve as ideological reference points: PT (Partido dos Trabalhadores, score ~2.0, clearly Left), MDB (Movimento Democratico Brasileiro, score ~5.5, centrist catch-all), and PL (Partido Liberal, score ~8.5, clearly Right under Bolsonaro). Approximately 2% of candidates belong to minor parties not covered by the survey and have NA ideology scores.

7 Reproducibility

7.1 Prerequisites

The data pipeline depends on source files from the parent research project. If running on a machine other than the original development environment, set the QPP_DATA_DIR environment variable to point to the parent project directory:

Sys.setenv(QPP_DATA_DIR = "/path/to/Queer_politicians_project/")

7.2 Execution Order

Run the five data preparation scripts in sequence before rendering the Quarto site:

Show code
# Step 1: Load and prepare candidate data
source(here::here("code", "01_load_candidates.R"))
#   -> data/derived/candidates_analysis.rds

# Step 2: Process raw campaign finance data
source(here::here("code", "02_load_finance_raw.R"))
#   -> data/derived/finance_by_candidate.rds
#   -> data/derived/finance_transactions.rds
#   -> output/tables/finance_revenue_source_categories.csv

# Step 3: Download and cache geographic boundaries
source(here::here("code", "03_prepare_geography.R"))
#   -> data/geo_cache/municipalities.rds
#   -> data/geo_cache/states.rds

# Step 4: Extract text from candidate manifesto PDFs (~15,800 PDFs)
source(here::here("code", "05_process_manifestos.R"))
#   -> data/derived/manifestos_text.rds

# Step 5: Merge all components into analysis dataset
source(here::here("code", "04_build_analysis_data.R"))
#   -> data/derived/analysis_full.rds
#   -> data/derived/bolsonaro_share.rds  (cached)

# Step 6: Render the Quarto website
# quarto render docs/

7.3 Rendering

Once analysis_full.rds exists, the Quarto site can be rendered with:

quarto render docs/

All chapters source code/00_setup.R for shared paths, palettes, themes, and helper functions. The freeze: auto setting in _quarto.yml means that chapters whose source code has not changed will not be re-executed on subsequent renders.

7.4 Data Access

The parent project data is stored on Dropbox and is not publicly available. Researchers wishing to replicate this analysis should:

  1. Download raw TSE candidate registration files from https://dadosabertos.tse.jus.br/
  2. Download raw TSE campaign finance files from the same portal
  3. Download candidate manifesto PDFs (propostas de governo) from the same portal
  4. Contact VOTE LGBT for their candidate identification list
  5. Run the parent project’s integration scripts to produce the processed files
  6. Set QPP_DATA_DIR and run this project’s scripts 01–05 in order
  7. Render with quarto render docs/