How to Select Rows from a Pandas DataFrame with Same Values in One Column but Different in Another

In data analysis, a common task is to identify groups of rows where values in one column are identical (e.g., the same user ID or product code) but values in another column differ (e.g., different purchase dates or product categories). This scenario arises in diverse use cases:

  • Finding customers who bought the same product on different dates.
  • Identifying duplicate entries with conflicting data (e.g., same user ID but different emails).
  • Validating data consistency (e.g., ensuring a sensor ID always records the same unit).

Pandas, the go-to Python library for data manipulation, offers several efficient ways to solve this problem. In this blog, we’ll explore three methods to select such rows, with step-by-step examples, explanations, and comparisons to help you choose the best approach for your data.

Table of Contents#

  1. Introduction
  2. Setting Up the Example DataFrame
  3. Methods to Select Rows with Same Values in One Column and Different in Another
  4. Comparing the Methods: When to Use Which?
  5. Handling Edge Cases
  6. Conclusion
  7. References

Setting Up the Example DataFrame#

To illustrate the methods, we’ll use a sample sales dataset with columns: CustomerID (unique customer identifier), Product (product purchased), and PurchaseDate (date of purchase). Let’s create the DataFrame first:

import pandas as pd  
 
# Sample data  
data = {  
    'CustomerID': [101, 101, 101, 102, 102, 103, 103, 104, 104, 104],  
    'Product': ['A', 'A', 'B', 'C', 'C', 'D', 'E', 'F', 'F', 'F'],  
    'PurchaseDate': ['2023-01-01', '2023-01-02', '2023-01-03',  # Customer 101  
                     '2023-01-01', '2023-01-01',  # Customer 102  
                     '2023-01-02', '2023-01-02',  # Customer 103  
                     '2023-01-03', '2023-01-04', '2023-01-05']  # Customer 104  
}  
 
df = pd.DataFrame(data)  
print("Original DataFrame:")  
print(df)  

Output:

   CustomerID Product PurchaseDate  
0         101       A   2023-01-01  
1         101       A   2023-01-02  
2         101       B   2023-01-03  
3         102       C   2023-01-01  
4         102       C   2023-01-01  
5         103       D   2023-01-02  
6         103       E   2023-01-02  
7         104       F   2023-01-03  
8         104       F   2023-01-04  
9         104       F   2023-01-05  

We’ll use this DataFrame to demonstrate selecting rows where:

  • CustomerID is the same, but Product differs (e.g., Customer 101 bought both A and B).
  • Product is the same, but PurchaseDate differs (e.g., Product F was bought on 2023-01-03, 04, and 05).

Methods to Select Rows with Same Values in One Column and Different in Another#

3.1 Method 1: Using groupby() and filter()#

The simplest approach is to group rows by the column with identical values (e.g., CustomerID) and filter groups where the “different” column (e.g., Product) contains more than one unique value.

Step-by-Step Explanation:#

  1. Group the DataFrame by the column with identical values (let’s call this group_col).
  2. Filter groups where the “different” column (let’s call this diff_col) has more than one unique value using nunique() > 1.

Example 1: Same CustomerID, Different Product#

Let’s find all rows where CustomerID is the same, but Product varies:

group_col = "CustomerID"  # Column with identical values  
diff_col = "Product"      # Column with differing values  
 
# Group by `group_col` and filter groups with >1 unique `diff_col` values  
result = df.groupby(group_col).filter(lambda x: x[diff_col].nunique() > 1)  
 
print(f"Rows where {group_col} is same but {diff_col} differs:")  
print(result)  

Output:

   CustomerID Product PurchaseDate  
0         101       A   2023-01-01  
1         101       A   2023-01-02  
2         101       B   2023-01-03  
5         103       D   2023-01-02  
6         103       E   2023-01-02  

Why this works:

  • groupby("CustomerID") groups rows by customer.
  • filter(lambda x: x["Product"].nunique() > 1) keeps only groups where Product has more than one unique value.
  • Customer 101 has Products A and B (2 unique values), and Customer 103 has D and E (2 unique values), so their rows are retained.

Example 2: Same Product, Different PurchaseDate#

To find rows where Product is the same but PurchaseDate differs:

group_col = "Product"  
diff_col = "PurchaseDate"  
 
result = df.groupby(group_col).filter(lambda x: x[diff_col].nunique() > 1)  
 
print(f"Rows where {group_col} is same but {diff_col} differs:")  
print(result)  

Output:

   CustomerID Product PurchaseDate  
0         101       A   2023-01-01  
1         101       A   2023-01-02  
7         104       F   2023-01-03  
8         104       F   2023-01-04  
9         104       F   2023-01-05  

Why this works:

  • Product A has 2 unique PurchaseDates (2023-01-01 and 2023-01-02).
  • Product F has 3 unique PurchaseDates (2023-01-03, 04, 05).
  • These groups are filtered and returned.

3.2 Method 2: Using Self-Merge with merge()#

If you need to analyze pairwise relationships (e.g., “Which two rows in the same group have different values?”), use a self-merge. This method merges the DataFrame with itself on group_col and identifies rows where diff_col values differ.

Example: Pairwise Comparisons for CustomerID and Product#

Find all pairs of rows with the same CustomerID but different Products:

group_col = "CustomerID"  
diff_col = "Product"  
 
# Merge DataFrame with itself on `group_col`  
merged = df.merge(df, on=group_col, suffixes=("_x", "_y"))  
 
# Filter rows where `diff_col` values differ and avoid duplicate pairs (e.g., row1-row2 and row2-row1)  
result = merged[  
    (merged[f"{diff_col}_x"] != merged[f"{diff_col}_y"]) &  
    (merged.index_x < merged.index_y)  # Avoid duplicate pairs  
]  
 
print(f"Pairwise rows where {group_col} is same but {diff_col} differs:")  
print(result[[f"{group_col}", f"{diff_col}_x", f"{diff_col}_y", "PurchaseDate_x", "PurchaseDate_y"]])  

Output:

   CustomerID Product_x Product_y PurchaseDate_x PurchaseDate_y  
2         101         A         B     2023-01-01     2023-01-03  
3         101         A         B     2023-01-02     2023-01-03  
12        103         D         E     2023-01-02     2023-01-02  

Why this works:

  • merge(df, df, on="CustomerID") creates all possible pairs of rows for each CustomerID.
  • merged["Product_x"] != merged["Product_y"] selects pairs with different Products.
  • merged.index_x < merged.index_y removes duplicate pairs (e.g., (row0, row2) and (row2, row0)).

3.3 Method 3: Using transform() with Boolean Indexing#

For more control (e.g., flagging rows in the original DataFrame), use transform() to compute the number of unique diff_col values per group, then filter rows where this count exceeds 1.

Example: Flagging Rows with transform()#

Add a column has_diff to the original DataFrame indicating if the row belongs to a group with differing diff_col values:

group_col = "CustomerID"  
diff_col = "Product"  
 
# Compute number of unique `diff_col` values per group and add as a column  
df["unique_diff_count"] = df.groupby(group_col)[diff_col].transform("nunique")  
 
# Filter rows where unique count > 1  
result = df[df["unique_diff_count"] > 1].drop(columns="unique_diff_count")  
 
print(f"Rows where {group_col} is same but {diff_col} differs:")  
print(result)  

Output:

   CustomerID Product PurchaseDate  
0         101       A   2023-01-01  
1         101       A   2023-01-02  
2         101       B   2023-01-03  
5         103       D   2023-01-02  
6         103       E   2023-01-02  

Why this works:

  • df.groupby(group_col)[diff_col].transform("nunique") adds a column unique_diff_count with the number of unique diff_col values per group.
  • df[df["unique_diff_count"] > 1] selects rows where the group has >1 unique diff_col values.

Comparing the Methods: When to Use Which?#

MethodUse CaseProsCons
groupby() + filter()Most cases (simple group-level filtering)Concise, retains original row orderReturns all group rows (even duplicates)
Self-MergePairwise comparisons (e.g., conflict analysis)Shows explicit row pairsGenerates many rows; memory-heavy for large DataFrames
transform() + Boolean IndexingAdding flags for further analysisIntegrates with original DataFrameRequires temporary column (easily dropped)

Handling Edge Cases#

  • diff_col contains NaN values: nunique() treats NaN as a unique value. For example, a group with [1, NaN] has nunique() = 2 and will be selected.
  • group_col has single-row groups: Groups with only one row will have nunique() = 1 and are excluded.
  • All diff_col values are identical: Groups with nunique() = 1 (e.g., Customer 102 with Product C only) are excluded.

Conclusion#

Selecting rows with identical values in one column and differing values in another is a critical data cleaning and analysis task. Pandas provides flexible methods to achieve this:

  • Use groupby() + filter() for simplicity and group-level filtering.
  • Use self-merge for pairwise comparisons.
  • Use transform() + boolean indexing to flag rows in the original DataFrame.

Choose the method based on your use case, and refer to the examples above to adapt the code to your dataset.

References#