7. Candidate Manifestos

What Do LGBTQ+ Candidates Promise? A Quantitative Text Analysis

Show code

source(here::here("code", "00_setup.R"))

# Text analysis libraries
library(quanteda)
library(quanteda.textstats)
library(quanteda.textplots)
library(tidytext)
library(stm)
library(SnowballC)

# Load data
df <- readRDS(paths$analysis_full_rds)

# Working subset: mayors with usable manifesto text
# Deduplicate by candidate_id (keep first row; manifesto text is the same across dupes)
mayors <- df %>%
  filter(position_simple == "Mayor", has_manifesto, !is.na(manifesto_text),
         nchar(manifesto_text) > 50) %>%
  distinct(candidate_id, .keep_all = TRUE) %>%
  mutate(
    lgbtq_label = factor(
      if_else(lgbtq_candidate, "LGBTQ+", "Non-LGBTQ+"),
      levels = c("LGBTQ+", "Non-LGBTQ+")
    ),
    ideology_category = factor(ideology_category, levels = ideology_levels)
  )

n_mayors       <- nrow(mayors)
n_lgbtq        <- sum(mayors$lgbtq_candidate)
n_nonlgbtq     <- n_mayors - n_lgbtq

Show code

# --- Pre-compute values for inline narrative text ---

# Word count comparisons
lgbtq_words   <- mayors$manifesto_n_words[mayors$lgbtq_candidate]
nonlgbtq_words <- mayors$manifesto_n_words[!mayors$lgbtq_candidate]

median_words_lgbtq    <- median(lgbtq_words, na.rm = TRUE)
median_words_nonlgbtq <- median(nonlgbtq_words, na.rm = TRUE)
mean_words_lgbtq      <- mean(lgbtq_words, na.rm = TRUE)
mean_words_nonlgbtq   <- mean(nonlgbtq_words, na.rm = TRUE)

# Page count comparisons
median_pages_lgbtq    <- median(mayors$manifesto_n_pages[mayors$lgbtq_candidate], na.rm = TRUE)
median_pages_nonlgbtq <- median(mayors$manifesto_n_pages[!mayors$lgbtq_candidate], na.rm = TRUE)

# Direction labels
word_direction <- if (median_words_lgbtq > median_words_nonlgbtq) "longer" else "shorter"

1 Overview

In Brazil, all mayoral candidates must submit a proposta de governo (government proposal/manifesto) to the Tribunal Superior Eleitoral (TSE) as part of their candidacy registration. These mandatory filings offer a unique window into what candidates promise voters — and whether LGBTQ+ candidates differ in what they emphasize.

This chapter analyzes 14,771 mayoral manifestos using quantitative text analysis, of which 37 belong to LGBTQ+ candidates. We proceed in five stages: (1) describing the corpus and its basic properties, (2) comparing text complexity, (3) examining word frequencies and statistical keyness, (4) measuring policy salience using a custom Portuguese-language dictionary, and (5) estimating a Structural Topic Model to identify latent themes associated with LGBTQ+ candidacy.

Small LGBTQ+ Sample (N = 37)

Only 37 LGBTQ+ mayoral candidates have usable manifesto text. While 39 LGBTQ+ mayors submitted manifesto PDFs, two filings (one in Bahia, one in São Paulo) produced no readable content upon extraction — their PDFs contained only blank pages or form-feed characters, likely due to image-only scans that OCR could not recover. All comparative statistics should be interpreted as descriptive patterns, not as statistically powered tests. We present medians and non-parametric tests where feasible.

2 Corpus Overview

2.1 Manifesto Coverage

Show code

coverage <- mayors %>%
  group_by(lgbtq_label) %>%
  summarise(
    N                = n(),
    `Median Words`   = format_n(median(manifesto_n_words, na.rm = TRUE)),
    `Mean Words`     = format_n(round(mean(manifesto_n_words, na.rm = TRUE))),
    `Median Pages`   = round(median(manifesto_n_pages, na.rm = TRUE), 1),
    `Mean Pages`     = round(mean(manifesto_n_pages, na.rm = TRUE), 1),
    .groups = "drop"
  ) %>%
  rename(Group = lgbtq_label)

coverage %>% kable(align = c("l", "r", "r", "r", "r", "r"))
save_table(coverage, "07_manifesto_coverage.csv")

Table 1: Manifesto Coverage and Length by LGBTQ+ Status

Group	N	Median Words	Mean Words	Median Pages	Mean Pages
LGBTQ+	37	4,517	6,902	19	24.0
Non-LGBTQ+	14734	2,819	3,775	12	15.7

LGBTQ+ candidates write manifestos that are longer than those of non-LGBTQ+ candidates (median 4,517 vs 2,819 words). The median page count is 19 for LGBTQ+ candidates and 12 for non-LGBTQ+ candidates.

2.2 Word Count Distribution

Show code

ggplot(mayors, aes(x = manifesto_n_words, fill = lgbtq_label, color = lgbtq_label)) +
  geom_density(alpha = 0.3, linewidth = 0.8) +
  geom_vline(
    data = mayors %>% group_by(lgbtq_label) %>%
      summarise(med = median(manifesto_n_words, na.rm = TRUE), .groups = "drop"),
    aes(xintercept = med, color = lgbtq_label),
    linetype = "dashed", linewidth = 0.7
  ) +
  scale_x_log10(labels = label_comma()) +
  scale_fill_manual(values = pal_lgbtq) +
  scale_color_manual(values = pal_lgbtq) +
  labs(
    x        = "Word Count (log scale)",
    y        = "Density",
    fill     = NULL,
    color    = NULL,
    title    = "Manifesto Length Distribution",
    subtitle = "Dashed lines indicate group medians"
  )

save_figure(last_plot(), "07_word_count_density")

Figure 1: Distribution of Manifesto Word Counts by LGBTQ+ Status (Log Scale)

2.3 Page Count Distribution

Show code

mayors %>%
  mutate(page_bin = cut(
    manifesto_n_pages,
    breaks = c(0, 5, 10, 20, 50, Inf),
    labels = c("1-5", "6-10", "11-20", "21-50", "50+"),
    right = TRUE
  )) %>%
  count(lgbtq_label, page_bin) %>%
  group_by(lgbtq_label) %>%
  mutate(pct = n / sum(n)) %>%
  ungroup() %>%
  ggplot(aes(x = page_bin, y = pct, fill = lgbtq_label)) +
  geom_col(position = "dodge", alpha = 0.85) +
  scale_y_continuous(labels = label_percent()) +
  scale_fill_manual(values = pal_lgbtq) +
  labs(
    x        = "Number of Pages",
    y        = "Proportion of Candidates",
    fill     = NULL,
    title    = "Manifesto Page Count Distribution",
    subtitle = "Grouped by LGBTQ+ status"
  )

save_figure(last_plot(), "07_page_count_dist")

Figure 2: Page Count Distribution by LGBTQ+ Status

2.4 Word Count by Identity Category

Show code

identity_words <- mayors %>%
  filter(lgbtq_candidate, lgbt_category != "Other LGBTQ+") %>%
  group_by(lgbt_category) %>%
  summarise(
    N              = n(),
    `Median Words` = format_n(median(manifesto_n_words, na.rm = TRUE)),
    `Mean Words`   = format_n(round(mean(manifesto_n_words, na.rm = TRUE))),
    `Median Pages` = round(median(manifesto_n_pages, na.rm = TRUE), 1),
    .groups = "drop"
  ) %>%
  rename(Category = lgbt_category) %>%
  arrange(desc(N))

# Add Non-LGBTQ+ reference row
nonlgbtq_ref <- tibble(
  Category       = "Non-LGBTQ+ (ref.)",
  N              = n_nonlgbtq,
  `Median Words` = format_n(median_words_nonlgbtq),
  `Mean Words`   = format_n(round(mean_words_nonlgbtq)),
  `Median Pages` = round(median_pages_nonlgbtq, 1)
)

bind_rows(identity_words, nonlgbtq_ref) %>%
  kable(align = c("l", "r", "r", "r", "r"))
save_table(bind_rows(identity_words, nonlgbtq_ref), "07_wordcount_identity.csv")

Table 2: Manifesto Length by LGBTQ+ Identity Category

Category	N	Median Words	Mean Words	Median Pages
Gay	22	4,396	6,703	16.5
Bisexual+	10	4,612	6,486	21.5
Trans	2	18,100	18,100	60.5
Asexual	2	2,788	2,788	13.5
Lesbian	1	1,260	1,260	5.0
Non-LGBTQ+ (ref.)	14734	2,819	3,775	12.0

3 Text Complexity

We examine two dimensions of text complexity: readability (how easy the text is to read) and lexical diversity (how varied the vocabulary is).

Readability Caveat

The Flesch Reading Ease formula was developed for English. While its syllable-counting mechanics work for Portuguese text, the absolute scores are not directly interpretable as US grade levels. We use these scores only for relative comparisons between groups, not for absolute readability claims.

Show code

# Build quanteda corpus
corp <- corpus(mayors, text_field = "manifesto_text",
               docid_field = "candidate_id")

# Readability (Flesch)
readability <- textstat_readability(corp, measure = "Flesch") %>%
  rename(candidate_id = document, flesch = Flesch) %>%
  mutate(candidate_id = as.numeric(candidate_id))

# Lexical diversity (CTTR — Corrected Type-Token Ratio)
# Note: MATTR crashes on quanteda.textstats 0.97.2 (window reset bug).
# CTTR = types / sqrt(2 * tokens), which also corrects for document length.
toks_lexdiv <- tokens(corp, remove_punct = TRUE, remove_numbers = TRUE)
lexdiv <- textstat_lexdiv(toks_lexdiv, measure = "CTTR") %>%
  rename(candidate_id = document, cttr = CTTR) %>%
  mutate(candidate_id = as.numeric(candidate_id))

# Join back to mayors
mayors <- mayors %>%
  left_join(readability, by = "candidate_id") %>%
  left_join(lexdiv, by = "candidate_id")

3.1 Readability Comparison

Show code

ggplot(mayors, aes(x = lgbtq_label, y = flesch, fill = lgbtq_label)) +
  geom_boxplot(alpha = 0.7, outlier.alpha = 0.3) +
  scale_fill_manual(values = pal_lgbtq) +
  labs(
    x        = NULL,
    y        = "Flesch Reading Ease",
    fill     = NULL,
    title    = "Manifesto Readability",
    subtitle = "Higher scores indicate easier-to-read text",
    caption  = "Note: Flesch formula calibrated for English; used here for relative comparison only."
  ) +
  guides(fill = "none")

save_figure(last_plot(), "07_readability_comparison")

Figure 3: Flesch Reading Ease Score by LGBTQ+ Status

3.2 Lexical Diversity

Show code

ggplot(mayors, aes(x = cttr, fill = lgbtq_label, color = lgbtq_label)) +
  geom_density(alpha = 0.3, linewidth = 0.8) +
  scale_fill_manual(values = pal_lgbtq) +
  scale_color_manual(values = pal_lgbtq) +
  labs(
    x        = "Corrected Type-Token Ratio (CTTR)",
    y        = "Density",
    fill     = NULL,
    color    = NULL,
    title    = "Lexical Diversity of Manifestos",
    subtitle = "CTTR = types / sqrt(2 * tokens); higher values indicate more varied vocabulary"
  )

save_figure(last_plot(), "07_lexdiv_comparison")

Figure 4: Lexical Diversity (CTTR) by LGBTQ+ Status

3.3 Complexity Summary

Show code

# Wilcoxon tests
flesch_test <- wilcox.test(flesch ~ lgbtq_label, data = mayors)
cttr_test   <- wilcox.test(cttr ~ lgbtq_label, data = mayors)

format_p <- function(p) {
  if (p < 0.001) "< 0.001" else sprintf("%.3f", p)
}

complexity_summary <- mayors %>%
  group_by(lgbtq_label) %>%
  summarise(
    N               = n(),
    `Median Flesch`  = round(median(flesch, na.rm = TRUE), 1),
    `Mean Flesch`    = round(mean(flesch, na.rm = TRUE), 1),
    `Median CTTR`    = round(median(cttr, na.rm = TRUE), 3),
    `Mean CTTR`      = round(mean(cttr, na.rm = TRUE), 3),
    .groups = "drop"
  ) %>%
  rename(Group = lgbtq_label)

complexity_summary %>% kable(align = c("l", "r", "r", "r", "r", "r"))
save_table(complexity_summary, "07_text_complexity.csv")

Table 3: Text Complexity Summary by LGBTQ+ Status

Group	N	Median Flesch	Mean Flesch	Median CTTR	Mean CTTR
LGBTQ+	37	-20.9	-23.7	14.178	14.274
Non-LGBTQ+	14734	-27.8	-50.9	12.645	12.466

Wilcoxon rank-sum test: Flesch p = 0.025; CTTR p = < 0.001.

4 Word Frequency and Keyness

Show code

# --- Build the NLP pipeline (shared across sections 4-6) ---

# Portuguese stopwords: combine two sources for broad coverage
pt_stopwords <- unique(c(
  stopwords("pt", source = "snowball"),
  stopwords("pt", source = "stopwords-iso")
))

# Domain-specific stopwords common in Brazilian political manifestos
manifesto_stopwords <- c(
  "município", "municipio", "cidade", "prefeito", "prefeita",
  "vice", "candidato", "candidata", "governo", "plano",
  "proposta", "propostas", "gestão", "gestao", "administração",
  "administracao", "público", "publico", "pública", "publica",
  "municipal", "prefeitura", "câmara", "camara", "vereador",
  "secretaria", "artigo", "lei", "parágrafo", "inciso",
  "nº", "art", "cf", "cpf", "cnpj", "ainda", "além", "cada",
  "forma", "bem", "toda", "todo", "todos", "todas", "será",
  "deve", "deverá", "podem", "podem", "sendo", "através",
  "assim", "sobre", "entre", "onde", "acordo", "partir"
)

all_stopwords <- unique(c(pt_stopwords, manifesto_stopwords))

# Tokenize
toks <- tokens(corp, remove_punct = TRUE, remove_numbers = TRUE,
               remove_symbols = TRUE) %>%
  tokens_tolower() %>%
  tokens_remove(pattern = all_stopwords) %>%
  tokens_remove(pattern = "^.{1,2}$", valuetype = "regex")

# Stemmed version for frequency and STM
toks_stem <- tokens_wordstem(toks, language = "portuguese")

# DFMs
dfm_raw  <- dfm(toks)        # unstemmed (for dictionary lookup)
dfm_stem <- dfm(toks_stem)   # stemmed

# Trimmed version for keyness and STM
# Convert count threshold to proportion (compatible with all quanteda versions)
dfm_trimmed <- dfm_stem %>%
  dfm_trim(min_docfreq = 10 / ndoc(dfm_stem), max_docfreq = 0.95,
           docfreq_type = "prop")

4.1 Top Words by Group

Show code

# Group DFM by LGBTQ+ label
dfm_grouped <- dfm_group(dfm_stem, groups = lgbtq_label)

# Get top words per group
top_words <- textstat_frequency(dfm_stem, n = 20, groups = lgbtq_label) %>%
  mutate(group = factor(group, levels = c("LGBTQ+", "Non-LGBTQ+")))

ggplot(top_words, aes(x = reorder_within(feature, frequency, group),
                      y = frequency, fill = group)) +
  geom_col(alpha = 0.85, show.legend = FALSE) +
  scale_x_reordered() +
  scale_fill_manual(values = pal_lgbtq) +
  coord_flip() +
  facet_wrap(~ group, scales = "free") +
  labs(
    x     = NULL,
    y     = "Frequency",
    title = "Most Frequent Words in Manifestos",
    subtitle = "After stopword removal and Portuguese stemming"
  )

save_figure(last_plot(), "07_top_words_group", height = 8)

Figure 5: Top 20 Most Frequent Stemmed Words by LGBTQ+ Status

4.2 Keyness Analysis

Keyness statistics identify words that appear disproportionately in one group relative to the other. We use the chi-squared measure, with LGBTQ+ manifestos as the target group.

Show code

# Group and compute keyness
dfm_key <- dfm_group(dfm_trimmed, groups = lgbtq_label)
keyness <- textstat_keyness(dfm_key, target = "LGBTQ+", measure = "chi2")

# Top 20 in each direction
top_pos <- keyness %>% slice_max(chi2, n = 20)
top_neg <- keyness %>% slice_min(chi2, n = 20)
key_plot_data <- bind_rows(top_pos, top_neg) %>%
  mutate(
    direction = if_else(chi2 > 0, "LGBTQ+", "Non-LGBTQ+"),
    feature   = reorder(feature, chi2)
  )

ggplot(key_plot_data, aes(x = chi2, y = feature, fill = direction)) +
  geom_col(alpha = 0.85) +
  geom_vline(xintercept = 0, linewidth = 0.5) +
  scale_fill_manual(values = pal_lgbtq) +
  labs(
    x        = expression(chi^2 ~ "statistic"),
    y        = NULL,
    fill     = "Over-represented in",
    title    = "Keyness: Distinctive Words by Group",
    subtitle = "Top 20 words over-represented in each group (chi-squared measure)",
    caption  = "Positive values = over-represented in LGBTQ+ manifestos"
  )

save_figure(last_plot(), "07_keyness_chi2", height = 9)

Figure 6: Statistical Keyness: Words Over-Represented in LGBTQ+ vs Non-LGBTQ+ Manifestos

Interpreting Keyness

A high positive chi-squared value means the word is overrepresented in LGBTQ+ manifestos relative to their size; a high negative value means it is overrepresented in non-LGBTQ+ manifestos. With only 37 LGBTQ+ documents, individual keyness scores should be treated as exploratory indicators rather than definitive findings.

5 Policy Dictionary Analysis

We define a custom Portuguese-language policy dictionary with 10 domains covering the main areas of Brazilian municipal governance. Each manifesto’s salience profile is the proportion of its (classified) words falling in each domain.

Show code

policy_dict <- dictionary(list(
  education = c(
    "educação", "educacao", "escola*", "escolar*", "ensino", "professor*",
    "aluno*", "estudant*", "creche*", "infantil", "fundamental",
    "pedagóg*", "pedagogic*", "aprendizagem", "alfabetização",
    "alfabetizacao", "universidade", "faculdade", "merenda",
    "biblioteca*", "letiv*", "curricul*"
  ),
  health = c(
    "saúde", "saude", "hospital*", "ubs", "upa", "médic*", "medic*",
    "enferm*", "vacina*", "atendimento", "emergência", "emergencia",
    "ambulância", "ambulancia", "sus", "farmácia", "farmacia",
    "odontológ*", "odontolog*", "mental", "psicológ*", "psicolog*",
    "terapêut*", "terapia", "maternidade"
  ),
  security = c(
    "segurança", "seguranca", "polícia*", "policia*", "guarda",
    "vigilância", "vigilancia", "câmera*", "camera*", "iluminação",
    "iluminacao", "violência", "violencia", "crime*", "criminal*",
    "tráfico", "trafico", "droga*", "patrulha*", "ronda*"
  ),
  economy = c(
    "emprego*", "trabalho", "econom*", "renda", "empreend*",
    "empresa*", "comércio", "comercio", "indústria", "industria",
    "turismo", "agricultur*", "cooperativ*", "microcrédito",
    "microcredito", "capacitação", "capacitacao", "desenvolviment*",
    "investiment*", "fiscal", "orçament*", "orcament*"
  ),
  social_policy = c(
    "assistência social", "assistencia social", "vulnerab*",
    "pobreza", "benefício", "beneficio", "cras", "creas",
    "idoso*", "criança*", "crianca*", "adolescente*", "juventude",
    "igualdade", "inclusão", "inclusao", "acessibilidade",
    "deficien*", "habitação", "habitacao", "moradia"
  ),
  environment = c(
    "ambiental*", "sustentab*", "ecológ*", "ecolog*",
    "reciclagem", "resíduo*", "residuo*", "lixo", "saneamento",
    "esgoto", "desmatamento", "reflorestamento",
    "poluição", "poluicao", "clima*", "energia solar",
    "preservação", "preservacao", "parque*"
  ),
  infrastructure = c(
    "infraestrutura", "pavimentação", "pavimentacao", "asfalto",
    "transporte", "ônibus", "onibus", "mobilidade", "trânsito",
    "transito", "estrada*", "ponte*", "construção",
    "construcao", "urbaniz*", "drenagem", "calçada*", "calcada*",
    "ciclovia*"
  ),
  lgbtq_rights = c(
    "lgbt*", "diversidade sexual", "orientação sexual",
    "orientacao sexual", "identidade de gênero", "identidade de genero",
    "homofob*", "transfob*", "travesti*", "transgêner*", "transgenero*",
    "lésbica*", "lesbica*", "bisexual*", "queer",
    "nome social"
  ),
  culture = c(
    "cultur*", "arte*", "artístic*", "artistic*", "museu*",
    "teatro*", "cinema", "festival*", "patrimônio", "patrimonio",
    "esporte*", "esportiv*", "lazer", "recreação", "recreacao"
  ),
  transparency = c(
    "transparência", "transparencia", "participação popular",
    "participacao popular", "conselho*", "audiência pública",
    "audiencia publica", "orçamento participativo",
    "orcamento participativo", "fiscalização", "fiscalizacao",
    "prestação de contas", "prestacao de contas",
    "ouvidoria", "dados abertos"
  )
))

Show code

# Apply dictionary to unstemmed tokens (dictionary entries use their own glob patterns)
dfm_dict <- dfm_lookup(dfm_raw, dictionary = policy_dict)

# Convert to proportions (salience = share of classified words per domain)
dfm_dict_prop <- dfm_weight(dfm_dict, scheme = "prop")

# Convert to tidy data frame
dict_results <- convert(dfm_dict_prop, to = "data.frame") %>%
  pivot_longer(-doc_id, names_to = "policy_domain", values_to = "salience") %>%
  mutate(doc_id = as.numeric(doc_id)) %>%
  left_join(
    mayors %>% select(candidate_id, lgbtq_candidate, lgbtq_label,
                      lgbt_category, ideology_category, region, state_abbrev),
    by = c("doc_id" = "candidate_id")
  )

# Clean domain labels for display
domain_labels <- c(
  education      = "Education",
  health         = "Health",
  security       = "Security",
  economy        = "Economy",
  social_policy  = "Social Policy",
  environment    = "Environment",
  infrastructure = "Infrastructure",
  lgbtq_rights   = "LGBTQ+ Rights",
  culture        = "Culture",
  transparency   = "Transparency"
)

dict_results <- dict_results %>%
  mutate(domain_label = domain_labels[policy_domain])

5.1 Salience Comparison

Show code

salience_by_group <- dict_results %>%
  group_by(domain_label, lgbtq_label) %>%
  summarise(mean_sal = mean(salience, na.rm = TRUE), .groups = "drop") %>%
  pivot_wider(names_from = lgbtq_label, values_from = mean_sal) %>%
  mutate(Difference = `LGBTQ+` - `Non-LGBTQ+`)

# Wilcoxon tests per domain
wilcox_p <- dict_results %>%
  group_by(domain_label) %>%
  summarise(
    p_value = wilcox.test(salience ~ lgbtq_label)$p.value,
    .groups = "drop"
  ) %>%
  mutate(p_fmt = sapply(p_value, format_p))

salience_table <- salience_by_group %>%
  left_join(wilcox_p %>% select(domain_label, p_fmt), by = "domain_label") %>%
  arrange(desc(abs(Difference))) %>%
  mutate(
    `LGBTQ+`      = format_pct(`LGBTQ+`),
    `Non-LGBTQ+`  = format_pct(`Non-LGBTQ+`),
    Difference     = sprintf("%+.1f pp", Difference * 100)
  ) %>%
  rename(Domain = domain_label, `Wilcoxon p` = p_fmt)

salience_table %>% kable(align = c("l", "r", "r", "r", "r"))
save_table(salience_table, "07_salience_comparison.csv")

Table 4: Mean Policy Salience by Domain and LGBTQ+ Status

Domain	LGBTQ+	Non-LGBTQ+	Difference	Wilcoxon p
Economy	16.0%	18.3%	-2.3 pp	0.082
Health	14.5%	16.4%	-1.9 pp	0.040
LGBTQ+ Rights	1.3%	0.1%	+1.2 pp	< 0.001
Environment	6.0%	5.5%	+0.5 pp	0.196
Transparency	2.0%	1.6%	+0.4 pp	0.014
Education	18.4%	18.6%	-0.2 pp	0.776
Social Policy	8.2%	8.0%	+0.2 pp	0.973
Infrastructure	10.7%	10.9%	-0.2 pp	0.720
Culture	13.6%	13.4%	+0.1 pp	0.514
Security	6.6%	6.5%	+0.1 pp	0.447

5.2 Salience Profile

Show code

salience_plot_data <- dict_results %>%
  group_by(domain_label, lgbtq_label) %>%
  summarise(mean_sal = mean(salience, na.rm = TRUE), .groups = "drop")

# Order domains by difference
domain_order <- salience_plot_data %>%
  pivot_wider(names_from = lgbtq_label, values_from = mean_sal) %>%
  mutate(diff = `LGBTQ+` - `Non-LGBTQ+`) %>%
  arrange(diff) %>%
  pull(domain_label)

salience_plot_data <- salience_plot_data %>%
  mutate(domain_label = factor(domain_label, levels = domain_order))

ggplot(salience_plot_data, aes(x = mean_sal, y = domain_label,
                               color = lgbtq_label)) +
  geom_line(aes(group = domain_label), color = "grey70", linewidth = 0.8) +
  geom_point(size = 3.5) +
  scale_x_continuous(labels = label_percent()) +
  scale_color_manual(values = pal_lgbtq) +
  labs(
    x        = "Mean Salience (proportion of classified words)",
    y        = NULL,
    color    = NULL,
    title    = "Policy Salience Profiles",
    subtitle = "Ordered by difference (LGBTQ+ minus Non-LGBTQ+)"
  )

save_figure(last_plot(), "07_salience_profile", height = 8)

Figure 7: Policy Salience Profiles: LGBTQ+ vs Non-LGBTQ+ Candidates

5.3 Salience by Identity Category

Show code

identity_salience <- dict_results %>%
  filter(lgbtq_candidate, !is.na(lgbt_category),
         lgbt_category != "Other LGBTQ+") %>%
  group_by(lgbt_category, domain_label) %>%
  summarise(mean_sal = mean(salience, na.rm = TRUE), .groups = "drop")

ggplot(identity_salience, aes(x = domain_label, y = lgbt_category,
                               fill = mean_sal)) +
  geom_tile(color = "white", linewidth = 0.5) +
  geom_text(aes(label = sprintf("%.1f%%", mean_sal * 100)),
            size = 3, color = "white") +
  scale_fill_gradient(low = "#3498DB", high = "#E74C3C",
                      labels = label_percent()) +
  labs(
    x     = NULL,
    y     = NULL,
    fill  = "Salience",
    title = "Policy Emphasis by Identity Category",
    subtitle = "Mean proportion of dictionary words per domain"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

save_figure(last_plot(), "07_salience_identity_heatmap")

Figure 8: Policy Salience Heatmap by LGBTQ+ Identity Category

5.4 Salience by Ideology

Show code

ideology_salience <- dict_results %>%
  filter(lgbtq_candidate, !is.na(ideology_category)) %>%
  group_by(ideology_category, domain_label) %>%
  summarise(mean_sal = mean(salience, na.rm = TRUE), .groups = "drop")

ggplot(ideology_salience, aes(x = mean_sal, y = domain_label,
                               fill = ideology_category)) +
  geom_col(position = "dodge", alpha = 0.85) +
  scale_x_continuous(labels = label_percent()) +
  scale_fill_manual(values = pal_ideology) +
  labs(
    x        = "Mean Salience",
    y        = NULL,
    fill     = "Ideology",
    title    = "LGBTQ+ Candidates: Policy Emphasis by Ideology",
    subtitle = "How left-right positioning shapes manifesto content"
  )

save_figure(last_plot(), "07_salience_ideology", height = 8)

Figure 9: Policy Salience by Ideology Among LGBTQ+ Candidates

Dictionary Limitations

Dictionary methods capture only explicit mentions of policy keywords. A candidate may address a topic using indirect language, metaphors, or euphemisms that the dictionary does not capture. Additionally, some words belong to multiple domains (e.g., “comunidade” could be social policy or infrastructure). These results indicate explicit policy emphasis, not comprehensive measures of issue attention.

6 Structural Topic Model

The Structural Topic Model (STM) identifies latent topics in the corpus and estimates whether LGBTQ+ status predicts different topic emphasis, controlling for ideology and region.

Show code

# Prepare DFM for STM: remove empty documents after trimming
dfm_stm <- dfm_subset(dfm_trimmed, ntoken(dfm_trimmed) > 0)

# Align metadata with DFM documents
stm_meta <- mayors[match(docnames(dfm_stm), as.character(mayors$candidate_id)), ] %>%
  mutate(
    lgbtq = as.integer(lgbtq_candidate),
    ideology_num = case_when(
      ideology_category == "Left"   ~ -1,
      ideology_category == "Center" ~  0,
      ideology_category == "Right"  ~  1,
      TRUE ~ NA_real_
    )
  )

# Drop rows with NA ideology or region (STM cannot handle NAs in prevalence covariates)
complete_idx <- !is.na(stm_meta$ideology_num) & !is.na(stm_meta$region)
dfm_stm <- dfm_subset(dfm_stm, complete_idx)
stm_meta <- stm_meta[complete_idx, ]

# Convert to STM format
stm_converted <- convert(dfm_stm, to = "stm")

6.1 Selecting the Number of Topics

Show code

set.seed(2024)
k_search <- searchK(
  documents = stm_converted$documents,
  vocab     = stm_converted$vocab,
  K         = c(10, 15, 20, 25),
  prevalence = ~ lgbtq + ideology_num + region,
  data      = stm_meta,
  init.type = "Spectral",
  cores     = 1
)

Show code

# Extract searchK results into a data frame
sk_results <- data.frame(
  K            = unlist(k_search$results$K),
  exclus       = unlist(k_search$results$exclus),
  semcoh       = unlist(k_search$results$semcoh),
  heldout      = unlist(k_search$results$heldout),
  residual     = unlist(k_search$results$residual)
)

p1 <- ggplot(sk_results, aes(x = K, y = semcoh)) +
  geom_line(linewidth = 1) + geom_point(size = 3) +
  labs(x = "K", y = "Semantic Coherence", title = "Semantic Coherence")

p2 <- ggplot(sk_results, aes(x = K, y = exclus)) +
  geom_line(linewidth = 1) + geom_point(size = 3) +
  labs(x = "K", y = "Exclusivity", title = "Exclusivity")

p3 <- ggplot(sk_results, aes(x = K, y = heldout)) +
  geom_line(linewidth = 1) + geom_point(size = 3) +
  labs(x = "K", y = "Held-Out Likelihood", title = "Held-Out Likelihood")

p4 <- ggplot(sk_results, aes(x = K, y = residual)) +
  geom_line(linewidth = 1) + geom_point(size = 3) +
  labs(x = "K", y = "Residuals", title = "Residual Dispersion")

(p1 + p2) / (p3 + p4) +
  plot_annotation(
    title    = "STM Model Selection",
    subtitle = "Diagnostics across different numbers of topics (K)"
  )

save_figure(last_plot(), "07_stm_searchk", height = 8)

Figure 10: STM Model Selection Diagnostics Across K Values

Selecting K

We evaluate models with K = 10, 15, 20, and 25 topics using four diagnostics: semantic coherence (do top words co-occur?), exclusivity (are top words unique to topics?), held-out likelihood (predictive fit), and residual dispersion. We select K = 15 as a balance between coherence and exclusivity, favoring interpretability for a descriptive analysis.

6.2 Fit Selected Model

Show code

K_selected <- 15

set.seed(2024)
stm_fit <- stm(
  documents = stm_converted$documents,
  vocab     = stm_converted$vocab,
  K         = K_selected,
  prevalence = ~ lgbtq + ideology_num + region,
  data      = stm_meta,
  init.type = "Spectral",
  max.em.its = 150,
  verbose    = FALSE
)

6.3 Topic Labels

Show code

topic_labels <- labelTopics(stm_fit, n = 7)

# Interpretive labels based on top FREX and probability words
topic_names <- c(
  "Public Incentives & Guarantees",
  "Women's Centres & Education",
  "Rural Development & Traditions",
  "OCR Artefacts / Formatting",
  "Social Assistance & Services",
  "Party Coalitions & Governance",
  "Diversity, Rights & Inclusion",
  "Sustainable Development",
  "Infrastructure & Road Works",
  "Fiscal Criticism & Budgets",
  "Health & Social Service Expansion",
  "Urban Infrastructure & Transport",
  "Transparency & Public Admin",
  "Tourism, Agriculture & Culture",
  "Governance Principles & Policy"
)

topic_descriptions <- c(
  "Tax incentives, public guarantees, and economic development programmes",
  "Community centres for women, educational programmes, and gender-focused services (includes city-specific OCR terms)",
  "Rural zones, traditional cultural events (e.g., vaquejada), agricultural slaughterhouses, and school uniforms",
  "Residual topic capturing formatting artefacts from PDF extraction (stop-words, punctuation with control characters)",
  "Social assistance networks, socio-educational programmes, and public service delivery",
  "Electoral coalitions, party alliances (PSD, MDB, PDT, PSB), and coalition governance language",
  "Racial equality, anti-racism, indigenous rights, LGBTQI+ inclusion, and left-party (PSOL) social justice framing",
  "Well-being, sustainability, citizen engagement, and long-term development challenges",
  "Road construction materials (gravel, limestone), truck logistics, and rural infrastructure (Paraná-area terms)",
  "Budgetary criticism, fiscal figures (millions/billions), and opposition-style problem framing",
  "Expansion and strengthening of multi-professional health teams and social assistance networks",
  "Urban projects: neighbourhoods, overpasses, airports, waterfront/nautical facilities, and road networks",
  "Auditing, evaluation reports, accountability criteria, and public resource management",
  "Tourism promotion, agricultural incentives, environmental preservation, and cultural events",
  "Legal principles, policy alignment across government spheres, and normative governance language"
)

topic_table <- tibble(
  Topic       = 1:K_selected,
  Label       = topic_names,
  Description = topic_descriptions,
  `Top FREX Words`  = apply(topic_labels$frex, 1, paste, collapse = ", "),
  `Top Prob. Words` = apply(topic_labels$prob, 1, paste, collapse = ", ")
)

topic_table %>%
  select(Topic, Label, Description, `Top FREX Words`) %>%
  kable(align = c("r", "l", "l", "l"))
save_table(topic_table, "07_stm_topic_labels.csv")

Table 5: STM Topics: Interpretive Labels, Top Words, and Descriptions

Topic	Label	Description	Top FREX Words
1	Public Incentives & Guarantees	Tax incentives, public guarantees, and economic development programmes	ent, garan, ant, vas, incen, nci, vidad
2	Women's Centres & Education	Community centres for women, educational programmes, and gender-focused services (includes city-specific OCR terms)	pág, aracaju, identiﬁc, olind, eletron, una, educaça
3	Rural Development & Traditions	Rural zones, traditional cultural events (e.g., vaquejada), agricultural slaughterhouses, and school uniforms	povo, festej, piau, molh, matadour, vaquej, fardament
4	OCR Artefacts / Formatting	Residual topic capturing formatting artefacts from PDF extraction (stop-words, punctuation with control characters)	de‬, ,‬‭, para‬, da‬, com‬, do‬, que‬
5	Social Assistance & Services	Social assistance networks, socio-educational programmes, and public service delivery	potencializ, relacion, paut, fornecedor, insum, socioeduc, assistenc
6	Party Coalitions & Governance	Electoral coalitions, party alliances (PSD, MDB, PDT, PSB), and coalition governance language	psd, prd, pdt, mdb, psb, págin, colig
7	Diversity, Rights & Inclusion	Racial equality, anti-racism, indigenous rights, LGBTQI+ inclusion, and left-party (PSOL) social justice framing	negr, racial, racism, indígen, solidár, psol, lgbtqi
8	Sustainable Development	Well-being, sustainability, citizen engagement, and long-term development challenges	essencial, crucial, bem-est, desafi, essenc, signific, diss
9	Infrastructure & Road Works	Road construction materials (gravel, limestone), truck logistics, and rural infrastructure (Paraná-area terms)	rodei, paran, cascalh, calcári, boch, caminhã, plantã
10	Fiscal Criticism & Budgets	Budgetary criticism, fiscal figures (millions/billions), and opposition-style problem framing	pior, milhõ, contrári, bilhõ, figur, exist, fic
11	Health & Social Service Expansion	Expansion and strengthening of multi-professional health teams and social assistance networks	mant, continu, multiprofissional, ofert, intensific, socioassistenc, oportuniz
12	Urban Infrastructure & Transport	Urban projects: neighbourhoods, overpasses, airports, waterfront/nautical facilities, and road networks	jardim, náutic, aeroport, andré, viári, estânc, viadut
13	Transparency & Public Admin	Auditing, evaluation reports, accountability criteria, and public resource management	avali, auditor, relatóri, identific, critéri, verific, descriçã
14	Tourism, Agriculture & Culture	Tourism promotion, agricultural incentives, environmental preservation, and cultural events	oferec, loc, turíst, agrícol, conscientiz, preserv, event
15	Governance Principles & Policy	Legal principles, policy alignment across government spheres, and normative governance language	princípi, conson, atos, esfer, ser, munícip, trac

Reading the Topic Labels

Labels were assigned by the authors based on the highest-probability and highest-FREX (frequent and exclusive) Portuguese stems for each topic. Topic 4 is a residual topic capturing OCR formatting artefacts from PDF extraction — it should be disregarded in substantive interpretation. Topic 7 (“Diversity, Rights & Inclusion”) is the topic most directly related to LGBTQ+ and minority-rights discourse.

6.4 Topic Proportions

Show code

topic_props <- tibble(
  Topic      = 1:K_selected,
  Label      = paste0(1:K_selected, ". ", topic_names),
  Proportion = colMeans(stm_fit$theta)
)

ggplot(topic_props, aes(x = reorder(Label, Proportion), y = Proportion)) +
  geom_col(fill = "#3498DB", alpha = 0.8) +
  coord_flip() +
  scale_y_continuous(labels = label_percent()) +
  labs(
    x        = NULL,
    y        = "Expected Proportion",
    title    = "Overall Topic Prevalence",
    subtitle = paste0("STM with K = ", K_selected, " topics")
  ) +
  theme(axis.text.y = element_text(size = 9))

save_figure(last_plot(), "07_stm_proportions", height = 8)

Figure 11: Expected Topic Proportions Across All Manifestos

6.5 LGBTQ+ Effect on Topic Prevalence

Show code

stm_effect <- estimateEffect(
  formula    = 1:K_selected ~ lgbtq + ideology_num + region,
  stmobj     = stm_fit,
  metadata   = stm_meta,
  uncertainty = "Global"
)

Show code

# Extract LGBTQ+ coefficient for each topic
effect_summary <- map_dfr(1:K_selected, function(k) {
  s <- summary(stm_effect, topics = k)
  coef_table <- s$tables[[1]]
  lgbtq_row <- coef_table["lgbtq", ]
  tibble(
    Topic    = k,
    Label    = topic_names[k],
    Estimate = lgbtq_row["Estimate"],
    SE       = lgbtq_row["Std. Error"],
    p_value  = lgbtq_row["Pr(>|t|)"]
  )
}) %>%
  arrange(desc(abs(Estimate)))

effect_display <- effect_summary %>%
  mutate(
    Estimate = sprintf("%+.4f", Estimate),
    SE       = sprintf("%.4f", SE),
    `p-value` = sapply(p_value, format_p)
  ) %>%
  select(Topic, Label, Estimate, SE, `p-value`)

effect_display %>% kable(align = c("r", "l", "r", "r", "r"))
save_table(effect_display, "07_stm_lgbtq_effect.csv")

Table 6: Effect of LGBTQ+ Status on Topic Prevalence

Topic	Label	Estimate	SE	p-value
7	Diversity, Rights & Inclusion	+0.1148	0.0249	< 0.001
10	Fiscal Criticism & Budgets	+0.0921	0.0247	< 0.001
9	Infrastructure & Road Works	-0.0765	0.0268	0.004
11	Health & Social Service Expansion	-0.0580	0.0211	0.006
14	Tourism, Agriculture & Culture	-0.0329	0.0223	0.140
5	Social Assistance & Services	-0.0195	0.0245	0.426
2	Women's Centres & Education	+0.0151	0.0197	0.443
3	Rural Development & Traditions	-0.0135	0.0227	0.552
8	Sustainable Development	+0.0086	0.0237	0.717
13	Transparency & Public Admin	-0.0063	0.0177	0.720
1	Public Incentives & Guarantees	-0.0061	0.0117	0.601
6	Party Coalitions & Governance	-0.0059	0.0175	0.737
15	Governance Principles & Policy	-0.0058	0.0197	0.769
4	OCR Artefacts / Formatting	-0.0026	0.0068	0.700
12	Urban Infrastructure & Transport	-0.0024	0.0207	0.907

Show code

effect_plot_data <- effect_summary %>%
  mutate(
    Topic_Label = paste0(Topic, ". ", Label),
    direction   = if_else(Estimate > 0, "LGBTQ+", "Non-LGBTQ+"),
    lower       = Estimate - 1.96 * SE,
    upper       = Estimate + 1.96 * SE
  )

ggplot(effect_plot_data, aes(x = Estimate, y = reorder(Topic_Label, Estimate),
                              color = direction)) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "grey50") +
  geom_errorbarh(aes(xmin = lower, xmax = upper), height = 0.3, linewidth = 0.7) +
  geom_point(size = 3) +
  scale_color_manual(values = pal_lgbtq) +
  labs(
    x        = "LGBTQ+ Coefficient (change in topic proportion)",
    y        = NULL,
    color    = "Direction",
    title    = "LGBTQ+ Effect on Topic Prevalence",
    subtitle = "Point estimates with 95% confidence intervals",
    caption  = "Controlling for ideology and region. Positive = more prevalent among LGBTQ+ candidates."
  ) +
  theme(axis.text.y = element_text(size = 9))

save_figure(last_plot(), "07_stm_lgbtq_effect", height = 8)

Figure 12: LGBTQ+ Effect on Topic Prevalence (Coefficient Plot)

6.6 Topic Correlation Heatmap

Show code

# Compute topic correlations from theta (document-topic proportions)
theta <- stm_fit$theta
colnames(theta) <- paste0(1:ncol(theta), ". ", topic_names)
cor_mat <- cor(theta)

# Melt to long format for ggplot
cor_df <- as.data.frame(as.table(cor_mat)) %>%
  rename(Topic1 = Var1, Topic2 = Var2, Correlation = Freq) %>%
  filter(as.integer(Topic1) < as.integer(Topic2))  # upper triangle only

ggplot(as.data.frame(as.table(cor_mat)),
       aes(x = Var1, y = Var2, fill = Freq)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "#2166AC", mid = "white", high = "#B2182B",
                       midpoint = 0, limits = c(-1, 1), name = "Correlation") +
  theme_descriptive +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
        axis.text.y = element_text(size = 8)) +
  labs(x = NULL, y = NULL,
       title = "Topic Correlation Heatmap",
       subtitle = "Pearson correlations between document-topic proportions")

save_figure(last_plot(), "07_stm_topic_correlation")

Figure 13: Pairwise Topic Correlations (Pearson)

STM Interpretation Caveats

The structural topic model identifies latent themes and estimates how LGBTQ+ status correlates with topic emphasis, controlling for ideology and region. However, with only 37 LGBTQ+ documents in a corpus of 14,771, the LGBTQ+ prevalence coefficients have wide uncertainty intervals. These results are best understood as suggestive patterns for future research with larger LGBTQ+ samples, not as definitive evidence of differential policy emphasis. Additionally, no multiple-comparison correction is applied across the 15 topics.

7 LGBTQ+ Rights Keywords

Do LGBTQ+ candidates explicitly mention LGBTQ+ rights issues in their manifestos? We search for a targeted set of keywords.

Show code

# Define LGBTQ+ keywords for targeted search
lgbtq_terms <- c(
  "lgbt*", "lgbtq*", "diversidade sexual", "orientação sexual",
  "orientacao sexual", "identidade de gênero", "identidade de genero",
  "homofob*", "transfob*", "travesti*", "transgêner*", "transgenero*",
  "lésbica*", "lesbica*", "bisexual*", "queer", "nome social",
  "orgulho", "arco-íris", "arco-iris"
)

# Search in unstemmed tokens
lgbtq_kwic <- tokens_select(toks, pattern = lgbtq_terms, valuetype = "glob",
                             selection = "keep")

# Count per document
lgbtq_counts <- ntoken(lgbtq_kwic) %>%
  tibble(candidate_id = as.numeric(names(.)), lgbtq_mentions = .) %>%
  mutate(has_lgbtq_mention = lgbtq_mentions > 0)

# Join to mayors
mayors <- mayors %>%
  left_join(lgbtq_counts, by = "candidate_id") %>%
  mutate(
    lgbtq_mentions = replace_na(lgbtq_mentions, 0L),
    has_lgbtq_mention = replace_na(has_lgbtq_mention, FALSE)
  )

Show code

keyword_summary <- mayors %>%
  group_by(lgbtq_label) %>%
  summarise(
    N                  = n(),
    `N with mentions`  = sum(has_lgbtq_mention),
    `% with mentions`  = format_pct(mean(has_lgbtq_mention)),
    `Mean mentions`    = round(mean(lgbtq_mentions), 2),
    `Median mentions`  = median(lgbtq_mentions),
    .groups = "drop"
  ) %>%
  rename(Group = lgbtq_label)

keyword_summary %>% kable(align = c("l", "r", "r", "r", "r", "r"))
save_table(keyword_summary, "07_lgbtq_keywords.csv")

Table 7: LGBTQ+ Keyword Mentions in Manifestos

Group	N	N with mentions	% with mentions	Mean mentions	Median mentions
LGBTQ+	37	25	67.6%	7.19	2
Non-LGBTQ+	14734	3601	24.4%	0.67	0

Show code

# Fisher's exact test
fisher_tbl <- table(mayors$lgbtq_label, mayors$has_lgbtq_mention)
fisher_result <- fisher.test(fisher_tbl)
fisher_p <- format_p(fisher_result$p.value)

keyword_plot <- mayors %>%
  group_by(lgbtq_label) %>%
  summarise(pct = mean(has_lgbtq_mention), .groups = "drop")

ggplot(keyword_plot, aes(x = lgbtq_label, y = pct, fill = lgbtq_label)) +
  geom_col(alpha = 0.85, width = 0.6) +
  scale_y_continuous(labels = label_percent()) +
  scale_fill_manual(values = pal_lgbtq) +
  labs(
    x        = NULL,
    y        = "Proportion mentioning LGBTQ+ terms",
    fill     = NULL,
    title    = "LGBTQ+ Rights Keywords in Manifestos",
    subtitle = paste0("Fisher's exact test p ", fisher_p),
    caption  = "Keywords: lgbt*, diversidade sexual, homofob*, transfob*, nome social, etc."
  ) +
  guides(fill = "none")

save_figure(last_plot(), "07_lgbtq_keyword_prevalence")

Figure 14: Proportion of Manifestos Mentioning LGBTQ+ Keywords

8 Comparative Word Clouds

Show code

library(wordcloud)

dfm_lgbtq <- dfm_subset(dfm_stem, lgbtq_label == "LGBTQ+")
freq_lgbtq <- textstat_frequency(dfm_lgbtq, n = 100)

set.seed(2024)
wordcloud(
  words     = freq_lgbtq$feature,
  freq      = freq_lgbtq$frequency,
  max.words = 100,
  random.order = FALSE,
  rot.per   = 0.15,
  colors    = c("#F5B7B1", "#E74C3C", "#922B21"),
  scale     = c(3, 0.5)
)

save_figure(last_plot(), "07_wordcloud_lgbtq", width = 8, height = 8)

Figure 15: Word Cloud: LGBTQ+ Candidate Manifestos

Show code

dfm_nonlgbtq <- dfm_subset(dfm_stem, lgbtq_label == "Non-LGBTQ+")
freq_nonlgbtq <- textstat_frequency(dfm_nonlgbtq, n = 100)

set.seed(2024)
wordcloud(
  words     = freq_nonlgbtq$feature,
  freq      = freq_nonlgbtq$frequency,
  max.words = 100,
  random.order = FALSE,
  rot.per   = 0.15,
  colors    = c("#AED6F1", "#3498DB", "#1A5276"),
  scale     = c(3, 0.5)
)

save_figure(last_plot(), "07_wordcloud_nonlgbtq", width = 8, height = 8)

Figure 16: Word Cloud: Non-LGBTQ+ Candidate Manifestos

9 Bigram Analysis

Single words lose multi-word expressions that carry important meaning in Portuguese political discourse. We examine the most frequent bigrams (two-word sequences) and compare distinctive bigrams across groups.

Show code

# Create bigrams from the unstemmed, stopword-removed tokens
toks_bigram <- tokens_ngrams(toks, n = 2, concatenator = " ")
dfm_bigram  <- dfm(toks_bigram)

Show code

top_bigrams <- textstat_frequency(dfm_bigram, n = 20, groups = lgbtq_label) %>%
  mutate(group = factor(group, levels = c("LGBTQ+", "Non-LGBTQ+")))

ggplot(top_bigrams, aes(x = reorder_within(feature, frequency, group),
                         y = frequency, fill = group)) +
  geom_col(alpha = 0.85, show.legend = FALSE) +
  scale_x_reordered() +
  scale_fill_manual(values = pal_lgbtq) +
  coord_flip() +
  facet_wrap(~ group, scales = "free") +
  labs(
    x     = NULL,
    y     = "Frequency",
    title = "Most Frequent Bigrams in Manifestos",
    subtitle = "Two-word expressions after stopword removal"
  )

save_figure(last_plot(), "07_bigrams_group", height = 9)

Figure 17: Top 20 Bigrams by LGBTQ+ Status

Show code

# Trim rare bigrams and compute keyness
dfm_bigram_trimmed <- dfm_bigram %>%
  dfm_trim(min_docfreq = 5, docfreq_type = "count")

dfm_bigram_grouped <- dfm_group(dfm_bigram_trimmed, groups = lgbtq_label)
bigram_keyness <- textstat_keyness(dfm_bigram_grouped, target = "LGBTQ+",
                                    measure = "chi2")

top_bi_pos <- bigram_keyness %>% slice_max(chi2, n = 15)
top_bi_neg <- bigram_keyness %>% slice_min(chi2, n = 15)
bi_key_data <- bind_rows(top_bi_pos, top_bi_neg) %>%
  mutate(
    direction = if_else(chi2 > 0, "LGBTQ+", "Non-LGBTQ+"),
    feature   = reorder(feature, chi2)
  )

ggplot(bi_key_data, aes(x = chi2, y = feature, fill = direction)) +
  geom_col(alpha = 0.85) +
  geom_vline(xintercept = 0, linewidth = 0.5) +
  scale_fill_manual(values = pal_lgbtq) +
  labs(
    x        = expression(chi^2 ~ "statistic"),
    y        = NULL,
    fill     = "Over-represented in",
    title    = "Distinctive Bigrams by Group",
    subtitle = "Top 15 two-word expressions over-represented in each group"
  )

save_figure(last_plot(), "07_bigram_keyness", height = 9)

Figure 18: Bigram Keyness: Distinctive Two-Word Expressions

10 Identity-Specific Language

Do Gay, Lesbian, and Trans candidates write different manifestos? We compare keyness within the LGBTQ+ subgroup, using each identity category as the target against all other LGBTQ+ candidates.

Show code

# Subset to LGBTQ+ candidates only
dfm_lgbtq_only <- dfm_subset(dfm_trimmed, lgbtq_label == "LGBTQ+")

# Get identity category for each document
identity_var <- mayors$lgbt_category[match(docnames(dfm_lgbtq_only),
                                           as.character(mayors$candidate_id))]

# Only categories with enough documents (>= 5)
cat_counts <- table(identity_var)
valid_cats <- names(cat_counts[cat_counts >= 5])

Show code

# Compute keyness for each identity category vs. all others
identity_key_list <- map_dfr(valid_cats, function(cat) {
  dfm_grouped_id <- dfm_group(dfm_lgbtq_only,
                                groups = if_else(identity_var == cat, cat, "Other"))
  tryCatch({
    key <- textstat_keyness(dfm_grouped_id, target = cat, measure = "chi2")
    key %>%
      slice_max(chi2, n = 10) %>%
      mutate(category = cat)
  }, error = function(e) tibble())
})

if (nrow(identity_key_list) > 0) {
  identity_key_list <- identity_key_list %>%
    mutate(category = factor(category, levels = names(pal_identity)))

  ggplot(identity_key_list, aes(x = chi2,
                                 y = reorder_within(feature, chi2, category),
                                 fill = category)) +
    geom_col(alpha = 0.85, show.legend = FALSE) +
    scale_y_reordered() +
    scale_fill_manual(values = pal_identity) +
    facet_wrap(~ category, scales = "free", ncol = 2) +
    labs(
      x     = expression(chi^2),
      y     = NULL,
      title = "Distinctive Vocabulary by Identity Category",
      subtitle = "Top 10 words over-represented vs. all other LGBTQ+ candidates"
    )

  save_figure(last_plot(), "07_identity_keyness", height = 10)
}

11 Electoral Outcomes and Manifestos

Do manifesto characteristics differ between winners and losers? We compare elected and non-elected candidates on manifesto length, complexity, and policy emphasis.

Show code

# Add elected status
mayors_elected <- mayors %>%
  filter(!is.na(elected)) %>%
  mutate(outcome = if_else(elected, "Elected", "Not Elected"))

Show code

outcome_stats <- mayors_elected %>%
  group_by(outcome) %>%
  summarise(
    N                = n(),
    `Median Words`   = format_n(median(manifesto_n_words, na.rm = TRUE)),
    `Mean Words`     = format_n(round(mean(manifesto_n_words, na.rm = TRUE))),
    `Median Flesch`  = round(median(flesch, na.rm = TRUE), 1),
    `Median CTTR`    = round(median(cttr, na.rm = TRUE), 3),
    .groups = "drop"
  )

outcome_stats %>% kable(align = c("l", "r", "r", "r", "r", "r"))
save_table(outcome_stats, "07_manifesto_outcome.csv")

Table 8: Manifesto Characteristics by Electoral Outcome

outcome	N	Median Words	Mean Words	Median Flesch	Median CTTR
Elected	5183	2,814	3,537	-32.0	12.549
Not Elected	9111	2,819	3,877	-26.3	12.694

Show code

ggplot(mayors_elected, aes(x = outcome, y = manifesto_n_words,
                            fill = lgbtq_label)) +
  geom_boxplot(alpha = 0.7, outlier.alpha = 0.2) +
  scale_y_log10(labels = label_comma()) +
  scale_fill_manual(values = pal_lgbtq) +
  labs(
    x        = NULL,
    y        = "Word Count (log scale)",
    fill     = NULL,
    title    = "Manifesto Length and Electoral Success",
    subtitle = "Do winners write longer manifestos?"
  )

save_figure(last_plot(), "07_words_outcome")

Figure 19: Manifesto Length by Electoral Outcome and LGBTQ+ Status

Show code

# Compute salience by outcome
dict_with_outcome <- dict_results %>%
  left_join(
    mayors_elected %>% select(candidate_id, outcome),
    by = c("doc_id" = "candidate_id")
  ) %>%
  filter(!is.na(outcome))

salience_outcome <- dict_with_outcome %>%
  group_by(outcome, domain_label) %>%
  summarise(mean_sal = mean(salience, na.rm = TRUE), .groups = "drop")

ggplot(salience_outcome, aes(x = mean_sal, y = domain_label,
                              color = outcome)) +
  geom_line(aes(group = domain_label), color = "grey70", linewidth = 0.8) +
  geom_point(size = 3.5) +
  scale_x_continuous(labels = label_percent()) +
  scale_color_manual(values = c("Elected" = "#2ECC71", "Not Elected" = "#E74C3C")) +
  labs(
    x        = "Mean Salience",
    y        = NULL,
    color    = NULL,
    title    = "Policy Emphasis and Electoral Outcomes",
    subtitle = "Do winning candidates emphasize different policy domains?"
  )

save_figure(last_plot(), "07_salience_outcome", height = 8)

Figure 20: Policy Salience: Elected vs Non-Elected Candidates

Show code

lgbtq_outcome <- mayors_elected %>%
  filter(lgbtq_candidate)

if (nrow(lgbtq_outcome) >= 5) {
  outcome_mention <- lgbtq_outcome %>%
    group_by(outcome) %>%
    summarise(
      n         = n(),
      pct_mention = mean(has_lgbtq_mention),
      mean_mentions = mean(lgbtq_mentions),
      .groups = "drop"
    )

  ggplot(outcome_mention, aes(x = outcome, y = pct_mention, fill = outcome)) +
    geom_col(alpha = 0.85, width = 0.6) +
    geom_text(aes(label = paste0("n=", n)), vjust = -0.5, size = 4) +
    scale_y_continuous(labels = label_percent(), limits = c(0, NA)) +
    scale_fill_manual(values = c("Elected" = "#2ECC71", "Not Elected" = "#E74C3C")) +
    labs(
      x        = NULL,
      y        = "% mentioning LGBTQ+ terms",
      fill     = NULL,
      title    = "Do LGBTQ+ Winners Mention LGBTQ+ Rights?",
      subtitle = "Among LGBTQ+ mayoral candidates only"
    ) +
    guides(fill = "none")

  save_figure(last_plot(), "07_lgbtq_mention_outcome")
}

Electoral Outcome Caveats

Electoral success depends on many factors beyond manifesto content (incumbency, party strength, campaign resources, name recognition). These comparisons do not imply that manifesto characteristics cause electoral outcomes. They simply describe whether winners and losers tend to produce different types of documents.

12 Regional Variation

Brazilian regions vary enormously in political culture, economic development, and LGBTQ+ acceptance. We examine how manifesto content varies across regions, particularly for LGBTQ+ candidates.

Show code

salience_region <- dict_results %>%
  filter(!is.na(region)) %>%
  group_by(region, domain_label) %>%
  summarise(mean_sal = mean(salience, na.rm = TRUE), .groups = "drop")

ggplot(salience_region, aes(x = mean_sal, y = domain_label, color = region)) +
  geom_point(size = 3, alpha = 0.8) +
  scale_x_continuous(labels = label_percent()) +
  scale_color_manual(values = pal_region) +
  labs(
    x        = "Mean Salience",
    y        = NULL,
    color    = "Region",
    title    = "Policy Emphasis by Region",
    subtitle = "Do regions prioritize different policy domains?"
  )

save_figure(last_plot(), "07_salience_region", height = 9)

Figure 21: Policy Salience by Region (All Candidates)

Show code

ggplot(mayors %>% filter(!is.na(region)),
       aes(x = region, y = manifesto_n_words, fill = lgbtq_label)) +
  geom_boxplot(alpha = 0.7, outlier.alpha = 0.2) +
  scale_y_log10(labels = label_comma()) +
  scale_fill_manual(values = pal_lgbtq) +
  labs(
    x        = NULL,
    y        = "Word Count (log scale)",
    fill     = NULL,
    title    = "Manifesto Length by Region",
    subtitle = "LGBTQ+ vs Non-LGBTQ+ candidates across regions"
  ) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

save_figure(last_plot(), "07_words_region")

Figure 22: Manifesto Length by Region and LGBTQ+ Status

Show code

region_keywords <- mayors %>%
  filter(!is.na(region)) %>%
  group_by(region) %>%
  summarise(
    N               = n(),
    `% mentioning`  = format_pct(mean(has_lgbtq_mention)),
    `Mean mentions`  = round(mean(lgbtq_mentions), 2),
    .groups = "drop"
  )

region_keywords %>% kable(align = c("l", "r", "r", "r"))
save_table(region_keywords, "07_lgbtq_keywords_region.csv")

Table 9: LGBTQ+ Keyword Mentions by Region (All Candidates)

region	N	% mentioning	Mean mentions
North	1267	23.0%	0.49
Northeast	4395	33.7%	0.90
Center-West	1154	18.9%	0.46
Southeast	4871	23.8%	0.76
South	3084	15.5%	0.43

13 Summary

This chapter examined 14,771 mayoral candidate manifestos through the lens of quantitative text analysis, comparing the 37 LGBTQ+ candidates against the broader population of 14,734 non-LGBTQ+ candidates.

Corpus characteristics: LGBTQ+ candidate manifestos are longer than their non-LGBTQ+ counterparts (median 4,517 vs 2,819 words).

Text complexity: Readability and lexical diversity comparisons provide insight into whether LGBTQ+ candidates write differently at the stylistic level, though interpretation requires caution given the small sample.

Policy emphasis: The custom Portuguese-language dictionary reveals whether LGBTQ+ candidates place differential emphasis on specific policy domains — education, health, security, economy, social policy, environment, infrastructure, LGBTQ+ rights, culture, and transparency.

Structural topics: The STM identifies latent themes and estimates the marginal effect of LGBTQ+ status on topic prevalence, controlling for ideology and region. With only 37 LGBTQ+ documents, these coefficients carry substantial uncertainty and should be treated as suggestive rather than definitive.

LGBTQ+ rights keywords: The targeted keyword search reveals whether LGBTQ+ candidates are more likely to explicitly reference LGBTQ+ rights issues in their official platform documents.

Bigrams and identity-specific language: Multi-word expression analysis uncovers distinctive two-word phrases, while within-group keyness reveals how Gay, Lesbian, and Trans candidates differ in their policy vocabulary.

Electoral outcomes: Comparisons between elected and non-elected candidates on manifesto length, complexity, and policy emphasis explore whether manifesto characteristics correlate with electoral success.

Regional variation: Geographic breakdowns reveal how manifesto content and LGBTQ+ keyword prevalence vary across Brazil’s five macro-regions.

All findings in this chapter are descriptive. The small LGBTQ+ sample precludes strong inferential claims but reveals patterns that merit investigation as LGBTQ+ political representation grows in future elections.