image-1745435703244

Letโ€™s explore bias detection using a banking marketing dataset ๐Ÿ“Š

What is SageMaker Clarify?

Amazon SageMaker Clarify is a powerful tool that helps:

  • Detect bias in data and models
  • Explain model predictions
  • Promote transparency in machine learning

Key Capabilities:

Bias Detection:

  • Pre-training data bias analysis
  • Post-training model bias evaluation
  • Multiple bias metrics calculation
  • Model Explainability:

  • Feature importance
  • SHAP values for explainability
  • Global and local explanations

Integration:

  • Seamless integration with SageMaker
  • Works with various data types
  • Supports multiple model types

When to Use Clarify:

  • Before training: Detect data bias
  • After training: Evaluate model fairness
  • During deployment: Monitor for bias
  • For compliance: Document fairness metrics

Example usage of SageMaker Clarify for data bias detection ๐ŸŽฏ

Weโ€™ll analyze a banking marketing campaign dataset stored in the /tmp/ directory (remember, this is temporary storage just for our analysis session!):

# Download and prepare data
!curl -o bank-additional.zip https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip
!unzip -o bank-additional.zip -d /tmp/
local_data_path = '/tmp/bank-additional/bank-additional-full.csv'
df = pd.read_csv(local_data_path)
df.columns

image

df.head()

image

๐Ÿ“Š Initial Data Exploration Letโ€™s visualize our data relationships:

# Create pairplot for key variables
sns.pairplot(df[['age','campaign', 'pdays']])

image

Remember:

  • Diagonal shows distributions
  • Off-diagonal shows relationships between variables
  • โ€˜pdaysโ€™ represents days since last contact (999 = never contacted)

๐Ÿ“ˆ Campaign Success Visualization

# Visualize campaign outcomes
sns.countplot(data=df, x='y')

image

๐Ÿ” Detailed Bias Analysis

Letโ€™s set up our bias detection:

# Create age buckets
df['age_disc'] = pd.cut(df.age, bins=3, labels=['young', 'middle', 'old'])
facet_column = report.FacetColumn('age_disc')
label_column = report.LabelColumn(name='y', series=df['y'], positive_label_values=['yes'])

# Run bias report with education as confounding variable
report.bias_report(df, facet_column, label_column, 
                  stage_type=report.StageType.PRE_TRAINING, 
                  group_variable=df['education'])

๐Ÿ” Notes: While we used other tools for visualization (seaborn, matplotlib), clarify is used for the report (from smclarify.bias import report)

image

๐ŸŽฏ Key Bias Metrics Explained:

  1. CDDL (Conditional Demographic Disparity in Labels)
    • Middle age: 0.007 (minimal bias)
    • Young: 0.028 (slight bias)
    • Old: -0.035 (significant disadvantage)
  2. CI (Class Imbalance)
    • Middle: 0.390
    • Young: -0.372 (underrepresented)
    • Old: 0.982 (severe imbalance!)
  3. DPL (Difference in Positive Proportions)
    • Middle: 0.005
    • Young: 0.010
    • Old: -0.381 (much lower success rate)

๐Ÿ’ก AWS ML Engineer exam tip: Know Your Metrics

CDDL (Conditional Demographic Disparity in Labels)

  • Measures bias while accounting for confounding variables (like education)
  • Shows if disparities persist after controlling for other factors
  • Range: -1 to +1, where 0 indicates no conditional bias
  • Positive values indicate favorable bias towards the facet group
  • Negative values indicate unfavorable bias against the facet group

CI (Class Imbalance)

  • representation imbalance between groups
  • Range: -1 to +1 (0 = perfect balance)
  • Positive values indicate overrepresentation, negative values indicate underrepresentation

DPL (Difference in Positive Proportions in Labels)

  • Compares the rate of positive outcomes between groups
  • Shows if certain groups are more/less likely to get positive outcomes

JS (Jensen-Shannon Divergence)

  • Measures similarity between probability distributions
  • Range: 0 to 1 (0 = identical distributions)

KL (Kullback-Leibler Divergence)

  • Measures how one probability distribution differs from another
  • Larger values indicate greater differences
  • KS (Kolmogorov-Smirnov Distance)

Maximum difference between cumulative distributions

  • Range: 0 to 1 (0 = no difference)

TVD (Total Variation Distance)

  • Measures maximum difference in probabilities between groups
  • Range: 0 to 1 (0 = identical distributions)

LP (L-p Norm)

  • Measures the magnitude of differences between distributions
  • Larger values indicate greater disparity

๐Ÿ“Š Campaign Contact Analysis

# Analyze contact patterns
age_group_stats = df.groupby('age_group').agg({
    'campaign': ['count', 'mean', 'sum']
}).round(2)

image

Key findings:

  • Ages 50-60: highest average contacts (2.69)
  • Ages 70-80: lowest average contacts (1.94)
  • Clear age-based contact strategy differences

๐Ÿ› ๏ธ Mitigation Strategies:

  1. Data Collection & Sampling:
    • Balance age group representation
    • Implement stratified sampling
    • Increase older age group data
  2. Marketing Campaign Adjustments:
    • Age-appropriate strategies
    • Standardize contact frequencies
    • Monitor contact rates
  3. Process Modifications:
    • Regular bias monitoring
    • Clear decision criteria
    • Document justifications
  4. Training & Awareness:
    • Staff training on age bias
    • Policy development
    • Regular reviews
  5. Understanding Confounding Variables:
    • Why we used education as a group variable
    • How it helps isolate true age bias
    • Impact on bias interpretation
  6. Practical Application:
    • How to use SageMaker Clarify
    • Interpreting bias reports
    • Implementing mitigation strategies

๐ŸŽ“ Exam Tip: The AWS ML Engineer exam specifically tests:

  • Bias identification using AWS tools
  • Pre-training metric understanding
  • Mitigation strategy knowledge
  • Real-world application

Remember: Understanding bias isnโ€™t just about passing the exam - itโ€™s about building fair, ethical ML solutions that work for everyone! ๐ŸŒŸ

๐Ÿ”ฌ Try It Yourself! Want to get hands-on experience with this example? Check out this notebook: Bias_metrics_usage_marketing


<
Previous Post
Understanding Amazon Bedrock Metrics: A Deep Dive Into Real Performance
>
Next Post
๐Ÿ› ๏ธ Feature Engineering with Amazon SageMaker Data Wrangler