Correlation Between Categorical and Continuous Variables

Palavras-chave:

Publicado em: 04/08/2025

Understanding Correlation Between Categorical and Continuous Variables

This article explores methods for assessing the relationship between categorical and continuous variables, a common task in machine learning and data analysis. We will focus on a practical implementation using ANOVA (Analysis of Variance) to determine if there's a statistically significant difference in the mean of the continuous variable across different categories of the categorical variable.

Fundamental Concepts / Prerequisites

Before diving into the implementation, a basic understanding of the following concepts is beneficial:

Categorical Variable: A variable that can take on one of a limited, and usually fixed, number of possible values (e.g., color, gender, country).
Continuous Variable: A variable that can take on any value within a given range (e.g., height, temperature, salary).
Mean: The average value of a set of numbers.
Variance: A measure of how spread out a set of numbers is.
ANOVA (Analysis of Variance): A statistical test that determines if the means of two or more groups are significantly different. The underlying principle is partitioning the total variance in the data into components attributable to different sources.
F-statistic: A value calculated in ANOVA that reflects the ratio of variance between groups to variance within groups. A higher F-statistic suggests a stronger relationship.
P-value: The probability of observing a test statistic as extreme as, or more extreme than, the observed value, assuming the null hypothesis (no difference in means) is true. A low p-value (typically below 0.05) indicates strong evidence against the null hypothesis.

Implementation using ANOVA with Python (Scikit-learn & SciPy)


import pandas as pd
from scipy import stats
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

def analyze_correlation(categorical_variable, continuous_variable):
  """
  Analyzes the correlation between a categorical and a continuous variable using ANOVA.

  Args:
    categorical_variable: A pandas Series representing the categorical variable.
    continuous_variable: A pandas Series representing the continuous variable.

  Returns:
    A dictionary containing the F-statistic and p-value from the ANOVA test.
  """

  # Create a pandas DataFrame for easier handling.
  data = pd.DataFrame({'categorical': categorical_variable, 'continuous': continuous_variable})

  # Perform ANOVA using statsmodels.
  model = ols('continuous ~ C(categorical)', data=data).fit()
  anova_table = anova_lm(model)

  # Extract the F-statistic and p-value.
  f_statistic = anova_table['F'].iloc[0]
  p_value = anova_table['PR(>F)'].iloc[0]

  return {'f_statistic': f_statistic, 'p_value': p_value}


# Example Usage:
# Create sample data (replace with your actual data)
data = {'category': ['A', 'A', 'B', 'B', 'C', 'C'],
        'value': [10, 12, 15, 18, 20, 22]}

df = pd.DataFrame(data)

categorical_column = df['category']
continuous_column = df['value']

# Analyze the correlation
results = analyze_correlation(categorical_column, continuous_column)

print(f"F-statistic: {results['f_statistic']:.2f}")
print(f"P-value: {results['p_value']:.3f}")

# Interpret the results:
alpha = 0.05 # Significance level

if results['p_value'] < alpha:
  print("There is a statistically significant correlation between the categorical and continuous variables.")
else:
  print("There is no statistically significant correlation between the categorical and continuous variables.")

Code Explanation

The Python code uses the `scipy` and `statsmodels` libraries to perform ANOVA. First, the `analyze_correlation` function takes two pandas Series as input: the categorical and continuous variables. These are combined into a Pandas DataFrame. Then, `statsmodels` library is used to create a linear model using ordinary least squares (OLS). The model defines `continuous` variable as function of the categorical variable. The `anova_lm` function calculates the ANOVA table based on this model.

The F-statistic and p-value are extracted from the ANOVA table. The p-value indicates the probability of observing the data (or more extreme data) if there were no actual relationship between the categorical and continuous variables. Finally, the code checks whether the p-value is below a predefined significance level (alpha = 0.05). If it is, the null hypothesis (no relationship) is rejected, suggesting a statistically significant correlation.

Complexity Analysis

The time complexity of the ANOVA test using `statsmodels` is primarily determined by the fitting of the OLS model and the subsequent ANOVA calculation. The model fitting typically has a time complexity of O(n*k^2), where 'n' is the number of data points and 'k' is the number of parameters in the model (which is related to the number of categories in the categorical variable). The ANOVA calculation itself generally takes O(k) time. Therefore, the overall time complexity is dominated by O(n*k^2). The space complexity is O(n+k) due to storing the data and the model parameters.

Alternative Approaches

One alternative approach is to use the Kruskal-Wallis H-test. This is a non-parametric test that can be used when the assumptions of ANOVA (e.g., normality of residuals) are not met. The Kruskal-Wallis test ranks all data points and compares the sum of ranks across different categories. It is less sensitive to outliers than ANOVA but may be less powerful when the ANOVA assumptions are satisfied. Another is calculating point-biserial correlation which is the Pearson correlation between one continuous variable and one dichotomous variable. This approach, however, is not suitable for categorical variables with more than two categories.

Conclusion

This article demonstrated a practical method for assessing the correlation between categorical and continuous variables using ANOVA. By calculating the F-statistic and p-value, we can determine the statistical significance of the relationship. Understanding these relationships is crucial for various machine learning tasks, including feature selection, data exploration, and model building. While ANOVA is a powerful tool, it's important to consider its assumptions and explore alternative approaches when appropriate.