8  Homework 2

Submission Instructions

Submit the R Markdown (.Rmd) file with code and answers on Brightspace. Due date: March 11, 2025 by 17:00 CET.

8.1 Question 1: Large Language Model (LLM) for Text Analysis

Kaggle website has an archive of Donal Trump’s Tweets during his first presidency, until 8-Jan-2021. The data contains information on number of retweets, deletion of tweets, device through which tweeted, flagged tweets, favorite tweets, etc. For this question, you will work with this data set.

DTtweets=read.csv("https://raw.githubusercontent.com/babakrezaee/MethodsCourses/refs/heads/master/DataSets/trump_tweets.csv")

# Let's first see the dimension of the dataset
dim(DTtweets)
[1] 56571    10

You will use ChatGPT API to analyze the sentiment of Donald Trump’s tweets related to NATO and the European Union (EU).

library(dplyr)

# Convert text to lowercase for case-insensitive filtering
DTtweets <- DTtweets %>%
  mutate(text = tolower(text))

# Filter tweets mentioning NATO, EU, and China
nato_tweets <- filter(DTtweets, grepl("nato", text))
eu_tweets <- filter(DTtweets, grepl("eu|european union|europeanunion",text))

# Display counts
nrow(nato_tweets)
[1] 656
nrow(eu_tweets)
[1] 907

Your main tasks are:

  1. Use ChatGPT API to analyze the sentiment of each tweet, as a categorical (Positive, Neutral, and Negtaive) and continuous variable (from -1 to +1).

  2. Compare and interpret the sentiment trends across topics over time.

  3. Analyze whether the popularity of these tweets are associated with their sentiment. Use a regression analysis. tip: first think about your unit of analysis.

8.2 Question 2: Random Forest Regression Using caret package

In class, you learned to use randomForest package to develop a random forest algorithm. In this assignment, you are asked to develop a 5-fold cross-validated random forest model using caret package, which you learned about in Naive Bayes session.

Load the dataset:

library(readr)

# Load dataset
CHdata <- read_csv("https://raw.githubusercontent.com/babakrezaee/MethodsCourses/master/DataSets/calhouse.csv")

# Check structure of dataset
head(CHdata)
# A tibble: 6 × 10
  longitude latitude housingMedianAge population households medianIncome
      <dbl>    <dbl>            <dbl>      <dbl>      <dbl>        <dbl>
1     -122.     37.9               41        322        126         8.33
2     -122.     37.9               21       2401       1138         8.30
3     -122.     37.8               52        496        177         7.26
4     -122.     37.8               52        558        219         5.64
5     -122.     37.8               52        565        259         3.85
6     -122.     37.8               52        413        193         4.04
# ℹ 4 more variables: AveBedrms <dbl>, AveRooms <dbl>, AveOccupancy <dbl>,
#   logMedVal <dbl>
colnames(CHdata)
 [1] "longitude"        "latitude"         "housingMedianAge" "population"      
 [5] "households"       "medianIncome"     "AveBedrms"        "AveRooms"        
 [9] "AveOccupancy"     "logMedVal"       
dim(CHdata)
[1] 20640    10

Now, partition the data into train (80%) and test (20%).

library(caret)

# Set seed for reproducibility
set.seed(7)

# Shuffle the data
CHdata <- CHdata[sample(nrow(CHdata)), ]

# Create a partition (80% train, 20% test)
trainIndex <- createDataPartition(CHdata$logMedVal, p = 0.8, list = FALSE)

# Split the dataset
CAtrain <- CHdata[trainIndex, ]
CAtest <- CHdata[-trainIndex, ]

# Check dimensions
dim(CAtrain)
[1] 16513    10
dim(CAtest)
[1] 4127   10

For caret package, you first need to declare the traincontrol() function, where you specify cross-validation and the number of folds.

# Train Random Forest model using `caret`
control <- trainControl(method = "cv", number = 5, verboseIter = TRUE)  # 5-fold cross-validation

RF_model <- train(
  logMedVal ~ ., 
  data = CAtrain, 
  method = "rf", 
  trControl = control, 
  tuneGrid = expand.grid(mtry = 3),  # Using mtry=3
  ntree = 100  # Number of trees
)
+ Fold1: mtry=3 
- Fold1: mtry=3 
+ Fold2: mtry=3 
- Fold2: mtry=3 
+ Fold3: mtry=3 
- Fold3: mtry=3 
+ Fold4: mtry=3 
- Fold4: mtry=3 
+ Fold5: mtry=3 
- Fold5: mtry=3 
Aggregating results
Fitting final model on full training set
# Print model summary
print(RF_model)
Random Forest 

16513 samples
    9 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 13210, 13211, 13210, 13211, 13210 
Resampling results:

  RMSE       Rsquared   MAE      
  0.2351345  0.8325762  0.1655492

Tuning parameter 'mtry' was held constant at a value of 3

Now that you trained your model, we can make Predictions and Evaluate Performance.

# Predict on the test set
RFpred <- predict(RF_model, newdata = CAtest)

# Compute RMSE
RMSE <- sqrt(mean((CAtest$logMedVal - RFpred)^2))

cat("RMSE on the test data for Random Forest using caret is:", RMSE, "\n")
RMSE on the test data for Random Forest using caret is: 0.2327727 

Here are your tasks for this question:

  1. What is the purpose of verboseIter = TRUE in the above code?
  2. What does mtry=3 and ntree=100 mean in the Random Forest model?
  3. How does increasing ntree (number of trees) impact model performance?
  4. Explain how we can find the best mtry here? Discuss your tuning strategy. tip: you can set expand.grid(mtry = (3:5))! IMPORTANT: This is a computationally intensive process!