Skip to article frontmatterSkip to article content

Part 4: Choose your own advecnture! (7 Points; Optional for Extra Credit)

Authors
Affiliations
UC Berkeley

Part 4: Choose your own advecnture! (7 Points; Optional for Extra Credit)

This section is open ended and your chance to explare any advanced analysis. Please perform any additional analysis you find interesting! Suggested analyses (only do one max):

import spacy 
spacy.cli.download("en_core_web_sm")
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.8/12.8 MB 77.5 MB/s eta 0:00:0000:01
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')
⚠ Restart to reload dependencies
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
# make a path for the outputs
from pathlib import Path

OUTPUT_DIR = Path("outputs")
OUTPUT_DIR.mkdir(exist_ok=True)
import pandas as pd

sou = pd.read_csv("data/SOTU.csv")

sou_21cen = sou[sou["Year"] >= 2000].reset_index(drop=True)
import spacy
from tqdm import tqdm

nlp = spacy.load("en_core_web_sm")

# process speeches
processed_speech = [nlp(text) for text in tqdm(sou_21cen["Text"])]
100%|██████████| 25/25 [00:24<00:00,  1.01it/s]
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words="english")
tfidf_matrix = tfidf.fit_transform(sou_21cen["Text"])


tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf.get_feature_names_out())

# add president label
tfidf_df["President"] = sou_21cen["President"].values
tfidf_df.head()
Loading...
# group by president and obtain means
president_vectors = tfidf_df.groupby("President").mean()
president_vectors
Loading...
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import mpld3

# plot pca analysis -- which presidents had most similar speeches
pca = PCA(n_components=2)
coords = pca.fit_transform(president_vectors)

pca_df = pd.DataFrame(coords, columns=["PC1", "PC2"],
                      index=president_vectors.index)

fig, ax = plt.subplots(figsize=(8,6))

for name, row in pca_df.iterrows():
    plt.scatter(row["PC1"], row["PC2"])
    plt.text(row["PC1"] + 0.01, row["PC2"] + 0.01, name, fontsize=9)

plt.title("Presidents' Speeches (PCA on TF–IDF)")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.axhline(0, color="gray", linewidth=0.5)
plt.axvline(0, color="gray", linewidth=0.5)
plt.show()

# save as html
mpld3.save_html(fig, str(OUTPUT_DIR / "president_speeches_PCA.html"))
<Figure size 800x600 with 1 Axes>
/srv/conda/envs/notebook/lib/python3.12/site-packages/mpld3/mplexporter/exporter.py:84: UserWarning: Blended transforms not yet supported. Zoom behavior may not work as expected.
  warnings.warn("Blended transforms not yet supported. "

The PCA plot shows Biden and Obama close together, reflecting similar modern Democratic language and overlapping vocabulary (of course, they of course also shared two terms together). Trump and Bush cluster nearer each other, suggesting overlap in Republican themes. Interestingly, Clinton appears farther from both groups, likely because his speeches predate the 2000s shift in political language (i.e. pre-9/11), making his rhetorical style more distinct.