Sentiment Disagreement vs Bitcoin Volatility Analysis

1226 words

6 minutes

Sentiment Disagreement vs Bitcoin Volatility Analysis

2025-12-22

Project

UvA

/

Bitcoin

/

Model

/

Analysis

Does sentiment disagreement predict next‑day Bitcoin volatility?#

Most sentiment‑based market analysis focuses on the average mood. But averages can hide conflict: a day where half the posts are very positive and half are very negative can average out to “neutral,” even though the market conversation is highly polarized. In this analysis, I focus on sentiment disagreement, i.e. how divided sentiment is, rather than just the mean.

Research question
Does daily sentiment disagreement (Twitter + Reddit) help predict Bitcoin price instability on the next day?

Hypothesis
If public opinion is more divided today, Bitcoin should be more unstable tomorrow (higher next‑day volatility).

This analysis research is my own part in the project Predicting Bitcoin Price Movements Using Public Sentiment, for the course Scientific Data Analysis in University of Amsterdam.

arkdong

/

SDA25_Group11

Waiting for api.github.com...

00K

0K

Waiting...

Data Collection and Preperation#

All the essential datasets for the sentiments and Bitcoin price data are download from Kaggle. To make the analysis achieveble within the project, we limit the timeframe for all dataset to 1 Jan 2018 - 1 Jan 2019. All datasets are:

Twitter: Bitcoin tweets - 16M tweets
Reddit: Reddit Comments Containing “Bitcoin” 2009 to 2019
Bitcoin: Bitcoin Historical Data

Before doing sentiment analysis we need to translate and evaluate the sentiment score for each post content from Twitter and Reddit.

Translate content based on the model Helsinski-NLP/opus-mt-mul-en (Hugging Face), this is a multilingual translation model trained on the Tatoeba Translation Challenge dataset from OPUS.
Evaluate sentiment score based on the model cardiffnlp/Twitter-roBERTa-base-sentiment (Hugging Face), this is trained on 58M tweets and finetuned for sentiment analysis with TweetEval benchmark.

This translation and evaluation tasks are done parallel on A100 GPU in Google Colab, with 25 hours in total for 1.2M Tweets and 890K reddit posts.

translate_sentiment.ipynb

Parallelized code that run on an A100 GPU to translate, and perform large-scale sentiment analysis on 890K Reddit posts and 1.2M Tweets

ipynb

...

Defining Sentiment Disagreement#

To measure whether the crowd is “aligned” or “split” in their opinion, I engineered daily sentiment disagreement features. Let $s_{i,t}$ be the sentiment score of the post $i$ on day $t$ . I used multiple disagreement measures:

Median Absolute Deviation (MAD), this is the continuous score dispersion based on the median. Defined as:

D^{\text{MAD}}_t = \text{median} \Big(|s_{i, t}-\text{median}(s_{i, t})|\Big)

Mean Gap fot the cross-platform disagreement between Twitter and Reddit. First compute the mean sentiment per platform $\mu_{t}^{tw}$ and $\mu_{t}^{rd}$ . Then the mean gap between the platform is defined as:

D^{\Delta \mu}_t= \bigg|\mu_{t}^{tw} - \mu_{t}^{rd}\bigg|

Variance defined as:

D_t^{\text{var}}=\text{Var}(s_{i, t})

Standard Deviation defined as:

D_t^{\text{std}}=\sqrt{\text{Var}(s_{i, t})}

Interquartile Range is as robust volatility proxy, capturing the typical variation while ignoring extremes. Defined as:

D_t^{\text{IQR}}=Q_{0.75}(s_{i,t})-Q_{0.25}(s_{i,t})

In addition to daily values, I computed 7-day roling versions to capture more stable disagreement regimes and reduce day-to-day noise. Defined as:

D_t^{(7\text{d})}=\frac{1}{7}\sum^t_{r=t-6}D_r

We can see that the overall sentiment disagreement tends to have a clear downward trend indicating reduced sentiment polarization over time. The decline in the measure MAD and IQR shows this convergence is structural rather than driven by outliers, since these are robust measure. However, the platform sentiment on Twitter and Reddit occasionally diverges, producing sharp spikes in the GAP.

Daily sentiment disagreement plot Normalized daily sentiment plot 7 Day rolling sentiment disagreement Normalized 7 day rolling sentiment disagreement

Compute Price Instability#

To caputre “instability”, I derived daily volatility measures from 1-minute Bicoin data. First I define the close-to-close log return, the measure of relative price change from one day to the next day using logarithmic scale. Defined using close price $C$ :

r_t=\log\left(\frac{C_t}{C_{t-1}}\right)

Then I computed three daily instability targets:

Realized Volatility (RV): A volatility measure computed from intraday data, it captures the total price varaition that occurred within the day. Using $r_t$ we can define the realized volatility:

RV_t=\sqrt{\sum_{i\in t}r_i^2}

Parkinson Volatility: A range-based daily volatility estimator that only uses the high and low prices, it reflect how wide the price range was during the day. Using high $H_t$ and low $L_t$ , we define:

\sigma_t^{\text{Parkinson}}=\sqrt{\frac{1}{4\ln 2}\left(\ln \frac{H_t}{L_t}\right)^2}

Absolute Daily Log Return: The absolute value of the close-to-close daily log return, measures the magnitude of the day’s price change ignoring the direction. Defined with:

\text{absret\_daily}=|\ln (C_t)-\ln (C_{t-1})|

I choose to focus on Realized Volatility (RV) because it uses all intraday information with 1 minute bars. As a result, RV provides a more stable and informative volatility, that is capturing the full path of price fluctuations.

Daily instability plot contain

Combine into a Training Dataset#

The training dataset is built to predict next-day realized volatility using lagged volatility and lagged sentiment disagreement. Based on this dataset, we want the model perfom task of “Given yesterday’s sentiment disagreement and volatilty, predict today’s realized volatilty”.

Based on some EDA we can see that there exist high correlation between the features shown in the matrix. Which means some features captures the same underlying signal, and only some disagreement measures remain significant in multivariate models.

Dataset correlation matrix and scatter plot

Training the Model#

We have now all the features and data that we need, and we train the model in the following steps:

Define Model Space:
- Generate an exhaustive subset selection over all these possible features, plus autoregression AR(1) variants. In total of 2048 model combinations
- The space contain model with name or prefix:
  - Baselin_const → Intercept only
  - D_only_... → single disagreement feature
  - D_... → all combinations of 2..N disagreement features
  - Baseline_AR1 → baseline autoregression based on lagged RV rv_lag1
  - AR1_plus_... → AR1 + any combo of disagreement features
- For example we could have model:
  - AR1_plus_D_GAP_D_MAD_7D_D_GAP_7D
  - AR1_plus_D_STD_D_IQR_D_MAD_7D_D_GAP_7D
Build Design matrices:
- Construc matrix $X$ for trainig and validation from selected predictors. Since OLS is solved via the matrix form $\hat{\beta}=(X^{\top} X)^{-1} X^{\top} y$
Fit OLS with HAC standard errors:
- Model fit:
```
1
res = sm.OLS(y_training, X_training).fit(
2
    cov_type="HAC",
3
    cov_kwds={"maxlags": hac_lags}  # default 7
4
)
```
- HAC(Newey-West) keeps coefficients the same, but fixes standard errors for autocorrelation + heteroskedasticity
  - Standard OLS assumes regression error are independent and homoskedastic.
  - Autocorrelation in standard OLS cause standard error to be too small that implies false significance.
  - HAC changes the covariance matrix that allows correlated residual up to maxlags
Predict on the validation set.

Model Validation#

Based on two simple error function we can evaluate the trained models using raw and log scale:

Root Mean Squared Error (RMSE): Computes the average sqaured prediction error, penalizes large errros more heavily and good for measuring overall prediction accuracy. Defined as:

\text{RMSE}=\sqrt{\frac{1}{n}\sum_{i=1}^n (y_i-\hat{y_i})^2}

Mean Absolute Error: Computes the average absolute prediction error, measures typical prediction error, less sensitive to outliers than RMSE. Defines as:

\text{MAE}=\frac{1}{n}\sum_{i=1}^n|y_i-\hat{y_i}|

The evaluation shows the best model on validation set:

RMSE (raw scale) → AR1_plus_D_GAP_D_MAD_7D_D_GAP_7D (value = 0.010210)
RMSE (log scale) → AR1_plus_D_STD_D_IQR_D_MAD_7D_D_GAP_7D (value = 0.444736)
MAE (raw scale) → AR1_plus_D_VAR_D_STD_D_IQR_D_MAD_7D_D_GAP_7D (value = 0.008572)
MAE (log scale) → AR1_plus_D_VAR_D_STD_D_IQR_D_MAD_7D_D_GAP_7D (value = 0.363641)

Plot of all models performance on the validation set Plot of the best models performance on the validation set

Hypothesis Testing#

We can now perform hypothesis testing on single regression coefficient from our OLS statistical model, with the following hypothesis:

Null Hypothesis: Sentiment disagreement is not associated and has no effect with Bitcoin price instability.

H_0 : \beta = 0

Alternative Hypothesis: Higher sentiment disagreement predict higher Bitcoin price instability.

H_1:\beta>0

Hypothesis test for each individual coefficient and plots

Based on these results we can successfully reject $H_0: \beta=0$ based on the key sentiment disagreement features like MAD 7 day, GAP 7 day, and lagged IQR. It is clear that higher sentiment disagreement is associated, and predicts higher next-day bitcoin price instability in our regression models. Therefore, this is a strong predictive evidence, but this is not a definitive causality proof.

Future Work#

Stronger volatility baselines: Replace AR(1) with HAR-RV to test whether sentiment disagreement adds predictive power beyond standard volatility models.
Richer polarization measures: Go beyond dispersion (MAD, IQR) by modeling sentiment bimodality, entropy, and extreme-sentiment mass.
Risk decomposition: Study whether disagreement predicts downside volatility, jumps, or crash-risk rather than smooth price variation.
Economic relevance: Integrate volatility forecasts into volatility-targeting or risk-management strategies to assess real trading value.
Robustness & generalization: Extend to longer time periods and other crypto assets (e.g., ETH) to test stability across regimes.