# Let's first see the dimension of the dataset
[1] 56571 10
Submission Instructions
Submit the R Markdown (.Rmd) file with code and answers on Brightspace. Due date: March 11, 2025 by 17:00 CET.
Kaggle website has an archive of Donal Trump’s Tweets during his first presidency, until 8-Jan-2021. The data contains information on number of retweets, deletion of tweets, device through which tweeted, flagged tweets, favorite tweets, etc. For this question, you will work with this data set.
# Let's first see the dimension of the dataset
[1] 56571 10
You will use ChatGPT API to analyze the sentiment of Donald Trump’s tweets related to NATO and the European Union (EU).
# Convert text to lowercase for case-insensitive filtering
<- DTtweets %>%
DTtweets mutate(text = tolower(text))
# Filter tweets mentioning NATO, EU, and China
<- filter(DTtweets, grepl("nato", text))
nato_tweets <- filter(DTtweets, grepl("eu|european union|europeanunion",text))
# Display counts
[1] 656
[1] 907
Your main tasks are:
Use ChatGPT API to analyze the sentiment of each tweet, as a categorical (Positive, Neutral, and Negtaive) and continuous variable (from -1 to +1).
Compare and interpret the sentiment trends across topics over time.
Analyze whether the popularity of these tweets are associated with their sentiment. Use a regression analysis. tip: first think about your unit of analysis.
In class, you learned to use randomForest package to develop a random forest algorithm. In this assignment, you are asked to develop a 5-fold cross-validated random forest model using caret package, which you learned about in Naive Bayes session.
Load the dataset:
# Load dataset
<- read_csv("https://raw.githubusercontent.com/babakrezaee/MethodsCourses/master/DataSets/calhouse.csv")
# Check structure of dataset
# A tibble: 6 × 10
longitude latitude housingMedianAge population households medianIncome
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -122. 37.9 41 322 126 8.33
2 -122. 37.9 21 2401 1138 8.30
3 -122. 37.8 52 496 177 7.26
4 -122. 37.8 52 558 219 5.64
5 -122. 37.8 52 565 259 3.85
6 -122. 37.8 52 413 193 4.04
# ℹ 4 more variables: AveBedrms <dbl>, AveRooms <dbl>, AveOccupancy <dbl>,
# logMedVal <dbl>
[1] "longitude" "latitude" "housingMedianAge" "population"
[5] "households" "medianIncome" "AveBedrms" "AveRooms"
[9] "AveOccupancy" "logMedVal"
[1] 20640 10
Now, partition the data into train (80%) and test (20%).
# Set seed for reproducibility
# Shuffle the data
<- CHdata[sample(nrow(CHdata)), ]
# Create a partition (80% train, 20% test)
<- createDataPartition(CHdata$logMedVal, p = 0.8, list = FALSE)
# Split the dataset
<- CHdata[trainIndex, ]
CAtrain <- CHdata[-trainIndex, ]
# Check dimensions
[1] 16513 10
[1] 4127 10
For caret package, you first need to declare the traincontrol() function, where you specify cross-validation and the number of folds.
# Train Random Forest model using `caret`
<- trainControl(method = "cv", number = 5, verboseIter = TRUE) # 5-fold cross-validation
<- train(
RF_model ~ .,
logMedVal data = CAtrain,
method = "rf",
trControl = control,
tuneGrid = expand.grid(mtry = 3), # Using mtry=3
ntree = 100 # Number of trees
+ Fold1: mtry=3
- Fold1: mtry=3
+ Fold2: mtry=3
- Fold2: mtry=3
+ Fold3: mtry=3
- Fold3: mtry=3
+ Fold4: mtry=3
- Fold4: mtry=3
+ Fold5: mtry=3
- Fold5: mtry=3
Aggregating results
Fitting final model on full training set
# Print model summary
Random Forest
16513 samples
9 predictor
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 13210, 13211, 13210, 13211, 13210
Resampling results:
RMSE Rsquared MAE
0.2351345 0.8325762 0.1655492
Tuning parameter 'mtry' was held constant at a value of 3
Now that you trained your model, we can make Predictions and Evaluate Performance.
# Predict on the test set
<- predict(RF_model, newdata = CAtest)
# Compute RMSE
<- sqrt(mean((CAtest$logMedVal - RFpred)^2))
cat("RMSE on the test data for Random Forest using caret is:", RMSE, "\n")
RMSE on the test data for Random Forest using caret is: 0.2327727
Here are your tasks for this question: