How to Select Rows from a Pandas DataFrame with Same Values in One Column but Different in Another
In data analysis, a common task is to identify groups of rows where values in one column are identical (e.g., the same user ID or product code) but values in another column differ (e.g., different purchase dates or product categories). This scenario arises in diverse use cases:
- Finding customers who bought the same product on different dates.
- Identifying duplicate entries with conflicting data (e.g., same user ID but different emails).
- Validating data consistency (e.g., ensuring a sensor ID always records the same unit).
Pandas, the go-to Python library for data manipulation, offers several efficient ways to solve this problem. In this blog, we’ll explore three methods to select such rows, with step-by-step examples, explanations, and comparisons to help you choose the best approach for your data.
Table of Contents#
- Introduction
- Setting Up the Example DataFrame
- Methods to Select Rows with Same Values in One Column and Different in Another
- Comparing the Methods: When to Use Which?
- Handling Edge Cases
- Conclusion
- References
Setting Up the Example DataFrame#
To illustrate the methods, we’ll use a sample sales dataset with columns: CustomerID (unique customer identifier), Product (product purchased), and PurchaseDate (date of purchase). Let’s create the DataFrame first:
import pandas as pd
# Sample data
data = {
'CustomerID': [101, 101, 101, 102, 102, 103, 103, 104, 104, 104],
'Product': ['A', 'A', 'B', 'C', 'C', 'D', 'E', 'F', 'F', 'F'],
'PurchaseDate': ['2023-01-01', '2023-01-02', '2023-01-03', # Customer 101
'2023-01-01', '2023-01-01', # Customer 102
'2023-01-02', '2023-01-02', # Customer 103
'2023-01-03', '2023-01-04', '2023-01-05'] # Customer 104
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df) Output:
CustomerID Product PurchaseDate
0 101 A 2023-01-01
1 101 A 2023-01-02
2 101 B 2023-01-03
3 102 C 2023-01-01
4 102 C 2023-01-01
5 103 D 2023-01-02
6 103 E 2023-01-02
7 104 F 2023-01-03
8 104 F 2023-01-04
9 104 F 2023-01-05
We’ll use this DataFrame to demonstrate selecting rows where:
CustomerIDis the same, butProductdiffers (e.g., Customer 101 bought both A and B).Productis the same, butPurchaseDatediffers (e.g., Product F was bought on 2023-01-03, 04, and 05).
Methods to Select Rows with Same Values in One Column and Different in Another#
3.1 Method 1: Using groupby() and filter()#
The simplest approach is to group rows by the column with identical values (e.g., CustomerID) and filter groups where the “different” column (e.g., Product) contains more than one unique value.
Step-by-Step Explanation:#
- Group the DataFrame by the column with identical values (let’s call this
group_col). - Filter groups where the “different” column (let’s call this
diff_col) has more than one unique value usingnunique() > 1.
Example 1: Same CustomerID, Different Product#
Let’s find all rows where CustomerID is the same, but Product varies:
group_col = "CustomerID" # Column with identical values
diff_col = "Product" # Column with differing values
# Group by `group_col` and filter groups with >1 unique `diff_col` values
result = df.groupby(group_col).filter(lambda x: x[diff_col].nunique() > 1)
print(f"Rows where {group_col} is same but {diff_col} differs:")
print(result) Output:
CustomerID Product PurchaseDate
0 101 A 2023-01-01
1 101 A 2023-01-02
2 101 B 2023-01-03
5 103 D 2023-01-02
6 103 E 2023-01-02
Why this works:
groupby("CustomerID")groups rows by customer.filter(lambda x: x["Product"].nunique() > 1)keeps only groups whereProducthas more than one unique value.- Customer 101 has Products
AandB(2 unique values), and Customer 103 hasDandE(2 unique values), so their rows are retained.
Example 2: Same Product, Different PurchaseDate#
To find rows where Product is the same but PurchaseDate differs:
group_col = "Product"
diff_col = "PurchaseDate"
result = df.groupby(group_col).filter(lambda x: x[diff_col].nunique() > 1)
print(f"Rows where {group_col} is same but {diff_col} differs:")
print(result) Output:
CustomerID Product PurchaseDate
0 101 A 2023-01-01
1 101 A 2023-01-02
7 104 F 2023-01-03
8 104 F 2023-01-04
9 104 F 2023-01-05
Why this works:
Product Ahas 2 uniquePurchaseDates (2023-01-01 and 2023-01-02).Product Fhas 3 uniquePurchaseDates (2023-01-03, 04, 05).- These groups are filtered and returned.
3.2 Method 2: Using Self-Merge with merge()#
If you need to analyze pairwise relationships (e.g., “Which two rows in the same group have different values?”), use a self-merge. This method merges the DataFrame with itself on group_col and identifies rows where diff_col values differ.
Example: Pairwise Comparisons for CustomerID and Product#
Find all pairs of rows with the same CustomerID but different Products:
group_col = "CustomerID"
diff_col = "Product"
# Merge DataFrame with itself on `group_col`
merged = df.merge(df, on=group_col, suffixes=("_x", "_y"))
# Filter rows where `diff_col` values differ and avoid duplicate pairs (e.g., row1-row2 and row2-row1)
result = merged[
(merged[f"{diff_col}_x"] != merged[f"{diff_col}_y"]) &
(merged.index_x < merged.index_y) # Avoid duplicate pairs
]
print(f"Pairwise rows where {group_col} is same but {diff_col} differs:")
print(result[[f"{group_col}", f"{diff_col}_x", f"{diff_col}_y", "PurchaseDate_x", "PurchaseDate_y"]]) Output:
CustomerID Product_x Product_y PurchaseDate_x PurchaseDate_y
2 101 A B 2023-01-01 2023-01-03
3 101 A B 2023-01-02 2023-01-03
12 103 D E 2023-01-02 2023-01-02
Why this works:
merge(df, df, on="CustomerID")creates all possible pairs of rows for eachCustomerID.merged["Product_x"] != merged["Product_y"]selects pairs with differentProducts.merged.index_x < merged.index_yremoves duplicate pairs (e.g., (row0, row2) and (row2, row0)).
3.3 Method 3: Using transform() with Boolean Indexing#
For more control (e.g., flagging rows in the original DataFrame), use transform() to compute the number of unique diff_col values per group, then filter rows where this count exceeds 1.
Example: Flagging Rows with transform()#
Add a column has_diff to the original DataFrame indicating if the row belongs to a group with differing diff_col values:
group_col = "CustomerID"
diff_col = "Product"
# Compute number of unique `diff_col` values per group and add as a column
df["unique_diff_count"] = df.groupby(group_col)[diff_col].transform("nunique")
# Filter rows where unique count > 1
result = df[df["unique_diff_count"] > 1].drop(columns="unique_diff_count")
print(f"Rows where {group_col} is same but {diff_col} differs:")
print(result) Output:
CustomerID Product PurchaseDate
0 101 A 2023-01-01
1 101 A 2023-01-02
2 101 B 2023-01-03
5 103 D 2023-01-02
6 103 E 2023-01-02
Why this works:
df.groupby(group_col)[diff_col].transform("nunique")adds a columnunique_diff_countwith the number of uniquediff_colvalues per group.df[df["unique_diff_count"] > 1]selects rows where the group has >1 uniquediff_colvalues.
Comparing the Methods: When to Use Which?#
| Method | Use Case | Pros | Cons |
|---|---|---|---|
groupby() + filter() | Most cases (simple group-level filtering) | Concise, retains original row order | Returns all group rows (even duplicates) |
| Self-Merge | Pairwise comparisons (e.g., conflict analysis) | Shows explicit row pairs | Generates many rows; memory-heavy for large DataFrames |
transform() + Boolean Indexing | Adding flags for further analysis | Integrates with original DataFrame | Requires temporary column (easily dropped) |
Handling Edge Cases#
diff_colcontainsNaNvalues:nunique()treatsNaNas a unique value. For example, a group with[1, NaN]hasnunique() = 2and will be selected.group_colhas single-row groups: Groups with only one row will havenunique() = 1and are excluded.- All
diff_colvalues are identical: Groups withnunique() = 1(e.g., Customer 102 with ProductConly) are excluded.
Conclusion#
Selecting rows with identical values in one column and differing values in another is a critical data cleaning and analysis task. Pandas provides flexible methods to achieve this:
- Use
groupby()+filter()for simplicity and group-level filtering. - Use self-merge for pairwise comparisons.
- Use
transform()+ boolean indexing to flag rows in the original DataFrame.
Choose the method based on your use case, and refer to the examples above to adapt the code to your dataset.