Bias Detection with Amazon SageMaker Clarify
Letโs explore bias detection using a banking marketing dataset ๐
What is SageMaker Clarify?
Amazon SageMaker Clarify is a powerful tool that helps:
- Detect bias in data and models
- Explain model predictions
- Promote transparency in machine learning
Key Capabilities:
Bias Detection:
- Pre-training data bias analysis
- Post-training model bias evaluation
- Multiple bias metrics calculation
-
Model Explainability:
- Feature importance
- SHAP values for explainability
- Global and local explanations
Integration:
- Seamless integration with SageMaker
- Works with various data types
- Supports multiple model types
When to Use Clarify:
- Before training: Detect data bias
- After training: Evaluate model fairness
- During deployment: Monitor for bias
- For compliance: Document fairness metrics
Example usage of SageMaker Clarify for data bias detection ๐ฏ
Weโll analyze a banking marketing campaign dataset stored in the /tmp/ directory (remember, this is temporary storage just for our analysis session!):
# Download and prepare data
!curl -o bank-additional.zip https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip
!unzip -o bank-additional.zip -d /tmp/
local_data_path = '/tmp/bank-additional/bank-additional-full.csv'
df = pd.read_csv(local_data_path)
df.columns
df.head()
๐ Initial Data Exploration Letโs visualize our data relationships:
# Create pairplot for key variables
sns.pairplot(df[['age','campaign', 'pdays']])
Remember:
- Diagonal shows distributions
- Off-diagonal shows relationships between variables
- โpdaysโ represents days since last contact (999 = never contacted)
๐ Campaign Success Visualization
# Visualize campaign outcomes
sns.countplot(data=df, x='y')
๐ Detailed Bias Analysis
Letโs set up our bias detection:
# Create age buckets
df['age_disc'] = pd.cut(df.age, bins=3, labels=['young', 'middle', 'old'])
facet_column = report.FacetColumn('age_disc')
label_column = report.LabelColumn(name='y', series=df['y'], positive_label_values=['yes'])
# Run bias report with education as confounding variable
report.bias_report(df, facet_column, label_column,
stage_type=report.StageType.PRE_TRAINING,
group_variable=df['education'])
๐ Notes: While we used other tools for visualization (seaborn, matplotlib), clarify is used for the report (from smclarify.bias import report)
๐ฏ Key Bias Metrics Explained:
- CDDL (Conditional Demographic Disparity in Labels)
- Middle age: 0.007 (minimal bias)
- Young: 0.028 (slight bias)
- Old: -0.035 (significant disadvantage)
- CI (Class Imbalance)
- Middle: 0.390
- Young: -0.372 (underrepresented)
- Old: 0.982 (severe imbalance!)
- DPL (Difference in Positive Proportions)
- Middle: 0.005
- Young: 0.010
- Old: -0.381 (much lower success rate)
๐ก AWS ML Engineer exam tip: Know Your Metrics
CDDL (Conditional Demographic Disparity in Labels)
- Measures bias while accounting for confounding variables (like education)
- Shows if disparities persist after controlling for other factors
- Range: -1 to +1, where 0 indicates no conditional bias
- Positive values indicate favorable bias towards the facet group
- Negative values indicate unfavorable bias against the facet group
CI (Class Imbalance)
- representation imbalance between groups
- Range: -1 to +1 (0 = perfect balance)
- Positive values indicate overrepresentation, negative values indicate underrepresentation
DPL (Difference in Positive Proportions in Labels)
- Compares the rate of positive outcomes between groups
- Shows if certain groups are more/less likely to get positive outcomes
JS (Jensen-Shannon Divergence)
- Measures similarity between probability distributions
- Range: 0 to 1 (0 = identical distributions)
KL (Kullback-Leibler Divergence)
- Measures how one probability distribution differs from another
- Larger values indicate greater differences
- KS (Kolmogorov-Smirnov Distance)
Maximum difference between cumulative distributions
- Range: 0 to 1 (0 = no difference)
TVD (Total Variation Distance)
- Measures maximum difference in probabilities between groups
- Range: 0 to 1 (0 = identical distributions)
LP (L-p Norm)
- Measures the magnitude of differences between distributions
- Larger values indicate greater disparity
๐ Campaign Contact Analysis
# Analyze contact patterns
age_group_stats = df.groupby('age_group').agg({
'campaign': ['count', 'mean', 'sum']
}).round(2)
Key findings:
- Ages 50-60: highest average contacts (2.69)
- Ages 70-80: lowest average contacts (1.94)
- Clear age-based contact strategy differences
๐ ๏ธ Mitigation Strategies:
- Data Collection & Sampling:
- Balance age group representation
- Implement stratified sampling
- Increase older age group data
- Marketing Campaign Adjustments:
- Age-appropriate strategies
- Standardize contact frequencies
- Monitor contact rates
- Process Modifications:
- Regular bias monitoring
- Clear decision criteria
- Document justifications
- Training & Awareness:
- Staff training on age bias
- Policy development
- Regular reviews
- Understanding Confounding Variables:
- Why we used education as a group variable
- How it helps isolate true age bias
- Impact on bias interpretation
- Practical Application:
- How to use SageMaker Clarify
- Interpreting bias reports
- Implementing mitigation strategies
๐ Exam Tip: The AWS ML Engineer exam specifically tests:
- Bias identification using AWS tools
- Pre-training metric understanding
- Mitigation strategy knowledge
- Real-world application
Remember: Understanding bias isnโt just about passing the exam - itโs about building fair, ethical ML solutions that work for everyone! ๐
๐ฌ Try It Yourself! Want to get hands-on experience with this example? Check out this notebook: Bias_metrics_usage_marketing