Political discourse shapes policy, and policy shapes lives. But how do politicians actually speak? What topics dominate their agendas? And do their words reflect the concerns of everyday citizens? Using NLP, we can answer these questions with data not assumptions

Introduction

Natural Language Processing (NLP) has become a central pillar of AI, especially with the rise of Large Language Models (LLMs). At its core, an LLM is just NLP applied at scale—processing massive amounts of text to generate human-like responses.

In this project, I applied NLP techniques to transcripts from the 14th convocation of the National Assembly of the Republic of Serbia. The goal was to explore political discourse in a time of heightened governmental instability (November 2024 – May 2025). The full dataset and results are available in the github repository.

Project Goals

Analyze who speaks most frequently and how they speak.

Detect recurring topics and rhetorical patterns.

Enrich political discourse with metadata and sentiment trends (ongoing work).

First step - Data collection and cleaning

If you’ve worked with real-world data before, you know the rule: 80% of the time goes to cleaning, 20% to analysis. This project was no exception.

I sourced transcripts directly from the Serbian Parliament’s official website, despite its outdated infrastructure (non-secure HTTP, expired SSL certificates). Downloads had to be manually approved due to browser warnings.

Each parliamentary session is split into several daily transcripts, so I organized them into directories by session. The files were originally in poorly structured .docx format. I converted them to .txt, then cleaned the text using regular expressions (REGEX) to extract individual speeches.

The final cleaned data was stored in JSON format:

  {
    "file": "2024-03-18_konstitutivna_sednica_dan2.txt",
    "speaker": "MARINIKA TEPIĆ",
    "speech": "Morate svima da date reč, a ne samo Aleksiću."
  },

After all of that, I was ready to start the analsys.

Second step - analsys

Once the data was ready, I lemmatized and tokenized the text, saved the preprocessed corpus, and began with some visual exploration—such as word clouds and speech length analysis.

I also removed stopwords using a custom Serbian stopword list (more on that later).

Who Spoke the Most? Note: Many top speakers are session chairpersons who regularly introduce other speakers or close sessions. Their dominance in the rankings might skew the data. A future improvement could be weighting speeches differently—or filtering chairpersons out entirely.

Top Speakers

RANKSPEAKERNUM OF SPEECHESPARTYNote
1ANA BRNABIĆ1762SNSChairman
2SNEŽANA PAUNOVIĆ623SPSChairman
3MARINA RAGUŠ454SNSChairman
4MILENKO JOVANOV328SNS
5ALEKSANDAR JOVANOVIĆ268Narodni pokret Srbije
6STOJAN RADENOVIĆ258SNS
7RADOMIR LAZOVIĆ125ZLF
8SRĐAN MILIVOJEVIĆ106DS
9MILOŠ PARANDILOVIĆ95NLS
10ZORAN LUTOVAC90DS
11MARINIKA TEPIĆ86SSP
12MILOŠ VUČEVIĆ79SNSPrime minister
13BORISLAV NOVAKOVIĆ77Narodni pokret Srbije
14DANIJELA NESTOROVIĆ74Narodni pokret Srbije
15ELVIRA KOVAČ57SVM-VMSZChairman

Speech length (No of characters)

RANKSPEAKERSPEECH LEN (CHAR)PARTYNote
1ANA BRNABIĆ799335SNSChairman
2MILENKO JOVANOV696929SNS
3MILOŠ VUČEVIĆ482097SNSPrime minister
4SNEŽANA PAUNOVIĆ470949SPSChairman
5DANIJELA NESTOROVIĆ283581Narodni pokret Srbije
6DUBRAVKA ĐEDOVIĆ HANDANOVIĆ262915SNSMinister
7ALEKSANDAR MARTINOVIĆ218080SNSMinister
8SINIŠA MALI213075SNS
9ŽIVOTA STARČEVIĆ151432JS
10MARINA RAGUŠ146521SNSChairman
11ALEKSANDAR JOVANOVIĆ144590Narodni pokret Srbije
12MIROSLAV ALEKSIĆ141545Narodni pokret Srbije
13RADOMIR LAZOVIĆ138195ZLF
14MARIJAN RISTIČEVIĆ133678SNS
15ALEKSANDAR VULIN133602PSMinister

Wordcloud

Serbia Wordcloud

Using a standard LDA (Latent Dirichlet Allocation) model, I extracted 10 dominant topics, each with 10 keywords. This approach helped identify core themes in parliamentary debates—from family policy to budget discussions and procedural language.

Topics

Topic #Top WordsProbable ThemeInterpretation
0zakon, godina, srbija, imati, hteti, ministarstvo, obrazovanje, građaninPublic Policy & LegislationGeneral discussion of laws, government ministries, and citizens’ rights, likely around national policy or legal reforms.
1litijum, rudnik, tinto, sredina, zakon, projektEnvironmental Concerns & Lithium MiningFocused on Rio Tinto, lithium mining, and its impact on the environment and legislation.
2srbija, republika, razvoj, država, zemljaNational Development & StrategyTalk of Serbia’s development, government, and national identity. High-level political or economic strategy themes.
3hteti, čovek, imati, kazati, vlada, govoritiPolitical Rhetoric & GovernanceSpeeches reflecting personal appeals, leadership intentions, and critiques of governance.
4porođaj, porodiljski, bolovanje, trudanMaternal & Family PolicyCentered around maternity, childbirth, and family leave policies. Likely from social/health policy discussions.
5imati, hvati, izvoliti, narodni, poslanikParliamentary Procedure & FormalitiesLanguage typical in formal sessions — calling on MPs, presence, and formal phrases. Possibly procedural or ironic.
6privoditi, kraj, isteklo, reklamiranje, potpredsednicaSession Closure & MiscellaneousA mixed set likely from session endings, odd terms, and possibly sarcastic or humorous remarks.
7godina, evro, milion, milijardaBudget & EconomyFinancial discussion — budget, expenditures, and national economic outlook.
8narodni, poslanik, amandman, glasanje, zakonLegislative Process & AmendmentsFocused on voting, proposing laws, and parliamentary decision-making. Core legislative procedures.
9poslovnik, replika, vreme, kazatiDebate Rules & Floor ManagementDeals with speaking time, requests, and adherence to parliamentary rulebook — classic floor debate content.

Sentiment analsys

Sentiment analysis in Serbian is a real challenge. Most tools are designed for English, Chinese, or Spanish—leaving low-resource languages behind. So, I created a custom sentiment lexicon for Serbian using the NRC Emotion Lexicon by Saif Mohammad.

I mapped each English word with a Serbian translation, removed the English column, and built separate datasets for Latin and Cyrillic scripts.

Here’s an example output from one party’s emotional tone:

Emotions SNS

The Stopwords Problem

No high-quality Serbian stopword lists? No problem. I built one manually. My method:

  • Extract most frequent words.
  • Filter out low-information words (often 3 letters or fewer).
  • Reiterate until the list made sense.
  • Create separate lists for Latinica and Ćirilica.
  • For sentiment analsys, I relied on Saif Mohammad work in NRC Word-Emotion Association Lexicon. If you are interested, the dataframes with stopwords and sentiment scores are here.

Conclusion & Future Work

This project is just a beginning. I plan to:

  • Improve sentiment classification using Serbian-specific transformers.
  • Incorporate speaker embeddings and unsupervised clustering (e.g., UMAP).
  • Build interactive dashboards for public access. If you’re a researcher, developer, or just curious about NLP and politics feel free to fork the repo, suggest improvements, or collaborate!