Learning Data Science: Why a High R^2 Can Be Misleading
6 mins read

Learning Data Science: Why a High R^2 Can Be Misleading

A top R^2R^2 can make a regression model look incredibly accurate – but this figure can be misleading. If you want to understand why a high R^2R^2 is not always a sign of a good role model, keep reading!

In the article Learning Data Science: Modeling Basics, we built a simple model to predict income based on age. R printed a model summary containing something called R-squaredbut we haven’t yet discussed what this value actually means.

At first glance, a high level R^2R^2 that seems very reassuring. In our example, the linear model obtained a R^2R^2 almost 90%. This looks impressive.

However, just as high classification accuracy can be misleading — as discussed in ZeroR: The Simplest Classifier Possible, or Why High Accuracy Can Be Deceptive — high accuracy can be misleading. R^2R^2 can also create a false sense of confidence.

To understand why, it helps to look at the formula itself, then revisit the three models from the previous post: the medium modelTHE linear modeland the polynomial model.


The meaning of R^2R^2

The coefficient of determination is defined as:

R^2 = 1 - \frac{\sum (y_i-\hat y_i)^2}{\sum (y_i-\bar y)^2}R^2 = 1 - \frac{\sum (y_i-\hat y_i)^2}{\sum (y_i-\bar y)^2}

At first glance, the formula seems intimidating, but its basic idea is relatively simple.

The denominator

SS_{tot} = \sum (y_i-\bar y)^2SS_{tot} = \sum (y_i-\bar y)^2

measure the total variation of the target variable. It quantifies the extent to which observed values ​​differ from their mean.

The numerator

SS_{res} = \sum (y_i-\hat y_i)^2SS_{res} = \sum (y_i-\hat y_i)^2

measure the unexplained error remaining after model fitting.

So, R^2R^2 measure the proportion of variation explained by the model.

A R^2R^2 of:

  • 0 means that the model does not explain any variation,
  • 1 means that the model perfectly explains all variations.

It seems simple enough. The difficulty is that perfectly explaining observed data is not necessarily the same thing as building a useful predictive model.


The average model

Let’s start with the simplest possible regression model.

Suppose we ignore age completely and simply predict the average income of each individual:

\hat y_i = \bar y\hat y_i = \bar y

This is effectively the regression equivalent of ZeroR. The model does not learn any relationships.

In this case:

y_i - \hat y_i = y_i - \bar yy_i - \hat y_i = y_i - \bar y

Thus, the residual sum of squares becomes identical to the total sum of squares:

\sum (y_i-\hat y_i)^2 = \sum (y_i-\bar y)^2\sum (y_i-\hat y_i)^2 = \sum (y_i-\bar y)^2

Substituting this into the formula, we get:

R^2 = 1 - \frac{SS_{tot}}{SS_{tot}} = 0R^2 = 1 - \frac{SS_{tot}}{SS_{tot}} = 0

The model does not explain any of the variations in the data.

This corresponds to the underfitting case mentioned previously: the model is too simple to capture the underlying structure.


The polynomial model

Now consider the opposite extreme.

Instead of fitting a straight line, suppose we fit a polynomial of sufficiently high degree. In fact, if we have nn observations with distinct age values, a degree polynomial up to n-1n-1 can pass exactly through all observed data points:

y = a_0 + a_1x + a_2x^2 + \dots + a_{n-1}x^{n-1}y = a_0 + a_1x + a_2x^2 + \dots + a_{n-1}x^{n-1}

In this case :

y_i = \hat y_iy_i = \hat y_i

for all observations, involving:

\sum (y_i-\hat y_i)^2 = 0\sum (y_i-\hat y_i)^2 = 0

and therefore:

R^2 = 1R^2 = 1

The model achieves a perfect fit.

At first glance, this seems ideal. In practice, however, such a model often performs poorly on unseen data because it has adapted not only to the underlying relationship, but also to random fluctuations and noise within the training data.

It’s the classic overfitting issue.

A perfect R^2R^2 may therefore indicate not a particularly good model, but a model that has become too flexible.


The linear model

The linear model from the previous post falls between these two extremes.

It’s simple enough to avoid remembering every random fluctuation, but flexible enough to capture a meaningful trend in the data.

This balance between simplicity and flexibility is one of the central themes of statistical learning.

The idea was summarized in the previous post with the following plot:

and by the famous observation attributed to George Box:

“All models are wrong, but some are useful.”

The objective of modeling is therefore not to maximize complexity or to maximize R^2R^2but to find a model that generalizes well beyond the observed sample.


For what R^2R^2 Alone is insufficient

The main limitation of R^2R^2 is that it evaluates the fit only on the observed data.

It does not directly measure:

  • predictive performance on invisible data,
  • robustness,
  • causal validity, or
  • ability to generalize.

As the complexity of the model increases, R^2R^2 almost always increases as well. A sufficiently flexible model can often achieve values ​​very close to 1 even when its predictions on new data are poor.

For this reason, practical data science relies on additional evaluation methods such as:

  • train test splits,
  • cross validation,
  • regularization,
  • adjusted R^2R^2And
  • out-of-sample testing.

The goal is not to perfectly reproduce historical observations, but to build models that remain useful in the face of new data.

A top R^2R^2 can therefore mean two very different things:

  • the model identified a real structure,
  • or the model simply overfitted the training data.

Distinguishing between these possibilities is one of the central challenges of machine learning and statistical modeling.


PakarPBN

A Private Blog Network (PBN) is a collection of websites that are controlled by a single individual or organization and used primarily to build backlinks to a “money site” in order to influence its ranking in search engines such as Google. The core idea behind a PBN is based on the importance of backlinks in Google’s ranking algorithm. Since Google views backlinks as signals of authority and trust, some website owners attempt to artificially create these signals through a controlled network of sites.

In a typical PBN setup, the owner acquires expired or aged domains that already have existing authority, backlinks, and history. These domains are rebuilt with new content and hosted separately, often using different IP addresses, hosting providers, themes, and ownership details to make them appear unrelated. Within the content published on these sites, links are strategically placed that point to the main website the owner wants to rank higher. By doing this, the owner attempts to pass link equity (also known as “link juice”) from the PBN sites to the target website.

The purpose of a PBN is to give the impression that the target website is naturally earning links from multiple independent sources. If done effectively, this can temporarily improve keyword rankings, increase organic visibility, and drive more traffic from search results.

Jasa Backlink

Download Anime Batch