Why Seaborn pairplot hue='C' Isn't Hiding Column 'C'? Troubleshooting Hue Parameter Behavior
Seaborn’s pairplot is a staple in exploratory data analysis (EDA), offering a concise way to visualize relationships between multiple numeric variables in a dataset. By creating a grid of scatterplots (for pairwise relationships) and histograms/kernel density estimates (KDEs, for univariate distributions), it helps analysts quickly identify patterns, correlations, and outliers.
A common feature of pairplot is the hue parameter, which allows users to group data points by a categorical or numeric variable, encoding these groups with distinct colors. However, a frequent source of confusion arises when users set hue='C' expecting the column C to be excluded from the pair grid—only to find C still plotted as one of the variables.
In this blog, we’ll demystify this behavior, explain why hue='C' doesn’t automatically hide column C, and provide step-by-step troubleshooting to resolve the issue.
Table of Contents#
- Understanding Seaborn’s
pairplotand thehueParameter - Why Doesn’t
hue='C'Hide ColumnC? - Step-by-Step Troubleshooting
- Advanced: Explicitly Excluding Columns from
pairplot - Conclusion
- References
1. Understanding Seaborn’s pairplot and the hue Parameter#
Before diving into the troubleshooting, let’s clarify how pairplot works under the hood and the role of the hue parameter.
What is pairplot?#
Seaborn’s pairplot (short for “pairwise plot”) generates a matrix of plots where each cell displays the relationship between two variables. By default:
- The x-axis and y-axis of each cell correspond to numeric columns in the input DataFrame.
- The diagonal cells show univariate distributions (histograms or KDEs) of individual variables.
- Off-diagonal cells show scatterplots (or regression lines) of pairs of variables.
Role of the hue Parameter#
The hue parameter adds a layer of grouping to the plot by color-encoding data points based on a specified column. For example:
import seaborn as sns
import pandas as pd
# Load sample dataset (iris has numeric features and a categorical 'species' column)
iris = sns.load_dataset("iris")
sns.pairplot(iris, hue="species") Here, hue="species" colors data points by the species category, but species does not appear as an axis variable in the pair grid. Why? Because species is a categorical column (non-numeric), and pairplot only includes numeric columns in the pair grid by default.
Key Takeaway#
pairplot automatically excludes non-numeric columns from the pair grid, even if they are used as hue. However, this behavior changes if the hue column is numeric.
2. Why Doesn’t hue='C' Hide Column C?#
The confusion often stems from a misunderstanding of hue’s purpose: hue controls color encoding, not variable inclusion/exclusion in the pair grid.
The Root Cause: Numeric vs. Non-Numeric hue Columns#
pairplot includes all numeric columns in the DataFrame by default, regardless of whether they are used as hue. If your C column is:
Case 1: C is Numeric#
If C is a numeric column (e.g., integer, float), pairplot will treat it as a variable to include in the pair grid. Thus, even if you set hue='C', C will appear as an axis variable (since it’s numeric), leading to plots like C vs. A, C vs. B, etc.
Case 2: C is Non-Numeric (Categorical)#
If C is non-numeric (e.g., string labels, pd.Categorical, or object dtype), pairplot will exclude it from the pair grid (since only numeric columns are plotted). In this case, hue='C' will color points without C appearing as an axis variable.
Example: Numeric C is Included#
Let’s create a DataFrame where C is numeric and use it as hue:
import pandas as pd
import seaborn as sns
# Create a DataFrame with numeric columns A, B, C
data = pd.DataFrame({
"A": [1, 2, 3, 4, 5],
"B": [5, 4, 3, 2, 1],
"C": [10, 20, 30, 40, 50] # Numeric column
})
# Use C as hue
sns.pairplot(data, hue="C") Output: The pair grid will include A, B, and C as axis variables. C will appear on both x and y axes, with points colored by C values.
Example: Non-Numeric C is Excluded#
Now, convert C to a categorical column and re-run:
# Convert C to categorical (non-numeric)
data["C"] = pd.Categorical(["Low", "Low", "Medium", "High", "High"])
# Use C as hue
sns.pairplot(data, hue="C") Output: The pair grid now only includes A and B as axis variables. C (categorical) is excluded from the grid, even though it’s used as hue.
Why This Happens#
pairplot uses data.select_dtypes(include=[np.number]) internally to determine which columns to plot. Numeric C passes this check and is included; non-numeric C fails and is excluded. The hue parameter does not override this selection logic.
3. Step-by-Step Troubleshooting#
If hue='C' isn’t hiding C, follow these steps to diagnose the issue:
Step 1: Check the Data Type of C#
First, confirm if C is numeric. Use df.dtypes to check:
print(data.dtypes)
# Output (if C is numeric):
# A int64
# B int64
# C int64 # Numeric! Will be included in pairplot
# dtype: object
# Output (if C is non-numeric):
# A int64
# B int64
# C category # Non-numeric! Will be excluded
# dtype: object Step 2: Verify pairplot’s Default Behavior#
pairplot includes all numeric columns by default. To confirm, check the columns in your DataFrame:
print(data.columns)
# If columns are ['A', 'B', 'C'] and all are numeric, pairplot will plot all 3. Step 3: Confirm hue Doesn’t Control Exclusion#
The hue parameter has no logic to exclude columns from the pair grid. To test this, run pairplot without hue and compare:
# Without hue: includes all numeric columns
sns.pairplot(data)
# With hue='C' (numeric): still includes all numeric columns
sns.pairplot(data, hue="C") Both plots will include C if it’s numeric.
4. Advanced: Explicitly Excluding Columns#
To hide C (or any column) from the pair grid, explicitly specify which columns to include using the vars parameter.
Solution 1: Use vars to Select Columns#
The vars parameter lets you list the numeric columns to include in the pair grid. Exclude C by omitting it:
# Include only A and B (exclude C)
sns.pairplot(data, hue="C", vars=["A", "B"]) Solution 2: Drop C Before Plotting#
Alternatively, drop C from the DataFrame before passing it to pairplot (but keep a copy for hue):
# Drop C from the data, but use the original C for hue
sns.pairplot(data.drop("C", axis=1), hue=data["C"]) Solution 3: Use x_vars and y_vars#
For more control, use x_vars and y_vars to specify axes separately:
# Plot A and B on x-axis; A and B on y-axis (exclude C)
sns.pairplot(data, hue="C", x_vars=["A", "B"], y_vars=["A", "B"]) 5. Conclusion#
The key takeaway is that hue in pairplot controls color encoding, not variable exclusion. If C isn’t hidden, it’s likely because C is a numeric column, and pairplot includes all numeric columns by default.
To fix this:
- Check if
Cis numeric (usedf.dtypes). - Explicitly exclude
Cusingvars,x_vars, ory_vars. - Convert
Cto a categorical dtype if it represents groups (this auto-excludes it from the pair grid).
By understanding pairplot’s default behavior and using explicit parameters, you can tailor the pair grid to your needs.