DTtweets=read.csv("https://raw.githubusercontent.com/babakrezaee/MethodsCourses/refs/heads/master/DataSets/trump_tweets.csv")
# Let's first see the dimension of the dataset
dim(DTtweets)[1] 56571 10
Submission Instructions
Submit the R Markdown (.Rmd) file with code and answers on Brightspace. Due date: March 11, 2025 by 17:00 CET.
Kaggle website has an archive of Donal Trump’s Tweets during his first presidency, until 8-Jan-2021. The data contains information on number of retweets, deletion of tweets, device through which tweeted, flagged tweets, favorite tweets, etc. For this question, you will work with this data set.
DTtweets=read.csv("https://raw.githubusercontent.com/babakrezaee/MethodsCourses/refs/heads/master/DataSets/trump_tweets.csv")
# Let's first see the dimension of the dataset
dim(DTtweets)[1] 56571 10
You will use ChatGPT API to analyze the sentiment of Donald Trump’s tweets related to NATO and the European Union (EU).
library(dplyr)
# Convert text to lowercase for case-insensitive filtering
DTtweets <- DTtweets %>%
mutate(text = tolower(text))
# Filter tweets mentioning NATO, EU, and China
nato_tweets <- filter(DTtweets, grepl("nato", text))
eu_tweets <- filter(DTtweets, grepl("eu|european union|europeanunion",text))
# Display counts
nrow(nato_tweets)[1] 656
nrow(eu_tweets)[1] 907
Your main tasks are:
Use ChatGPT API to analyze the sentiment of each tweet, as a categorical (Positive, Neutral, and Negtaive) and continuous variable (from -1 to +1).
Compare and interpret the sentiment trends across topics over time.
Analyze whether the popularity of these tweets are associated with their sentiment. Use a regression analysis. tip: first think about your unit of analysis.
In class, you learned to use randomForest package to develop a random forest algorithm. In this assignment, you are asked to develop a 5-fold cross-validated random forest model using caret package, which you learned about in Naive Bayes session.
Load the dataset:
library(readr)
# Load dataset
CHdata <- read_csv("https://raw.githubusercontent.com/babakrezaee/MethodsCourses/master/DataSets/calhouse.csv")
# Check structure of dataset
head(CHdata)# A tibble: 6 × 10
longitude latitude housingMedianAge population households medianIncome
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -122. 37.9 41 322 126 8.33
2 -122. 37.9 21 2401 1138 8.30
3 -122. 37.8 52 496 177 7.26
4 -122. 37.8 52 558 219 5.64
5 -122. 37.8 52 565 259 3.85
6 -122. 37.8 52 413 193 4.04
# ℹ 4 more variables: AveBedrms <dbl>, AveRooms <dbl>, AveOccupancy <dbl>,
# logMedVal <dbl>
colnames(CHdata) [1] "longitude" "latitude" "housingMedianAge" "population"
[5] "households" "medianIncome" "AveBedrms" "AveRooms"
[9] "AveOccupancy" "logMedVal"
dim(CHdata)[1] 20640 10
Now, partition the data into train (80%) and test (20%).
library(caret)
# Set seed for reproducibility
set.seed(7)
# Shuffle the data
CHdata <- CHdata[sample(nrow(CHdata)), ]
# Create a partition (80% train, 20% test)
trainIndex <- createDataPartition(CHdata$logMedVal, p = 0.8, list = FALSE)
# Split the dataset
CAtrain <- CHdata[trainIndex, ]
CAtest <- CHdata[-trainIndex, ]
# Check dimensions
dim(CAtrain)[1] 16513 10
dim(CAtest)[1] 4127 10
For caret package, you first need to declare the traincontrol() function, where you specify cross-validation and the number of folds.
# Train Random Forest model using `caret`
control <- trainControl(method = "cv", number = 5, verboseIter = TRUE) # 5-fold cross-validation
RF_model <- train(
logMedVal ~ .,
data = CAtrain,
method = "rf",
trControl = control,
tuneGrid = expand.grid(mtry = 3), # Using mtry=3
ntree = 100 # Number of trees
)+ Fold1: mtry=3
- Fold1: mtry=3
+ Fold2: mtry=3
- Fold2: mtry=3
+ Fold3: mtry=3
- Fold3: mtry=3
+ Fold4: mtry=3
- Fold4: mtry=3
+ Fold5: mtry=3
- Fold5: mtry=3
Aggregating results
Fitting final model on full training set
# Print model summary
print(RF_model)Random Forest
16513 samples
9 predictor
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 13210, 13211, 13210, 13211, 13210
Resampling results:
RMSE Rsquared MAE
0.2351345 0.8325762 0.1655492
Tuning parameter 'mtry' was held constant at a value of 3
Now that you trained your model, we can make Predictions and Evaluate Performance.
# Predict on the test set
RFpred <- predict(RF_model, newdata = CAtest)
# Compute RMSE
RMSE <- sqrt(mean((CAtest$logMedVal - RFpred)^2))
cat("RMSE on the test data for Random Forest using caret is:", RMSE, "\n")RMSE on the test data for Random Forest using caret is: 0.2327727
Here are your tasks for this question: