Aequitas Report

Last Updated: 07/24/2020

What is the Aequitas Package?

Machine Learning, AI and Data Science based predictive tools are being increasingly used in problems that can have a drastic impact on people's lives in policy areas such as criminal justice, education, public health, workforce development and social services. Recent work has raised concerns on the risk of unintended bias in these models, affecting individuals from certain groups unfairly. While a lot of bias metrics and fairness definitions have been proposed, there is no consensus on which definitions and metrics should be used in practice to evaluate and audit these systems. Further, there has been very little empirical work done on using and evaluating these measures on real-world problems, especially in public policy.

Aequitas is an open-source bias audit toolkit for data scientists, machine learning researchers, and policymakers to audit machine learning models for discrimination and bias, and to make informed and equitable decisions around developing and deploying predictive tools..

What will you learn?

What is Bias and Fairness in Machine Learning?
What are fairness metrics available and how to interpret?

Before we talk about dataset, it is important to know what can be considered as a protected variable.

Protected variable: An attribute that partitions a population into groups whose outcomes should have parity. Examples include race, gender, caste, and religion. Protected attributes are not universal, but are application specific.

We provide a couple of sample dataset: Compas Dataset and Lending Club dataset.

The COMPAS Recidivism Risk Score Data and Analysis: Data contains variables used by the COMPAS algorithm in scoring defendants, along with their outcomes within 2 years of the decision, for over 10,000 criminal defendants in Broward County, Florida. Below you can identify some (not all) columns within the dataset.

Looking into the Lending Club dataset, the files contain complete loan data for all loans issued through the 2007-2015, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. Additional features include credit scores, number of finance inquiries, address including zip codes, and state, and collections among others. There are 145 columns with information representing individual loan accounts. Each row is divided by an individual loan id and member id, of course, for the interest of privacy each member id has been removed from the dataset. Below you can identify some (not all) columns within the dataset.

Note: The sample .csv files already have ‘score' and ‘label_value' as the columns. They just need to be downloaded and used for report generation.

The aequitas package offers functionalities as a web-app as well as from command line to generate the report. The packages, not only creates a PDF document but also a .csv file with all the values. In the most simplest terms, the .csv//PDF can be thought of a large pivot table representing the data.

The report has a fixed input format. It requires:

Score: The model predictions. This column can be either class probabilities, or even class itself.
Label: This is the ground truth value of the data. This column represents the actual true labels.
Attribute(s): These columns are the data. All the features, be it category, continuous fall under these columns.

Following is sample for the report format:

A sample report can be found here. The report can be mainly divided into 3 parts.

Summary of the metrics,
The relative values of metrics with reference to the protected variable.
The absolute values of metrics with reference to the protected variable.

This tutorial heavily depends on metrics like accuracy, parity difference, equal odds difference, disparate impact, theil index, and other. So it's better to have a good understanding for the same. Let's look at them now. Refer to this wikipedia page for more information.

Equal Parity

Ensure all protected groups have equal representation in the selected set. This criteria considers an attribute to have equal parity if every group is equally represented in the selected set. In simple terms, it tells about the data's distribution among the categories.

For example, consider the column ‘homeownership' from our Lending Club data. The available categories are [‘Any', ‘MORTGAGE', ‘RENT', ‘OTHER']. We see that there is unequal distribution of them in the data.

Proportional Parity

Ensure all protected groups are selected proportional to their percentage of the population. This criteria considers an attribute to have proportional parity if every group is represented proportionally to their share of the population. If your desired outcome is to intervene proportionally on people from all categories, then you care about this criteria

For example, the disparity for ‘Grade' is

False Positive Rate Parity

Ensure all protected groups have the same false positive rates (as the reference group). This criteria considers an attribute to have False Positive parity if every group has the same False Positive Error Rate.

For example, the FPR for column ‘homeownership' is

This means that categories with ‘Mortgage' have 0.7 times the false positive rate as ‘Own' i.e. the model would predict positive class for ‘Own' more than the category ‘Mortgage'.

False Discovery Rate Parity

Ensure all protected groups have equally proportional false positives within the selected set (compared to the reference group). This criteria considers an attribute to have False Discovery Rate parity if every group has the same False Discovery Error Rate.

For example, again for homeownership, the FDR parity is

False Negative Rate Parity

Ensure all protected groups have the same false negative rates (as the reference group). This criteria considers an attribute to have False Negative parity if every group has the same False Negative Error Rate.

For example, for the column ‘term period' the FNR parity is

At the end of the report, the relative values and the absolute values for the metrics are reported.

Relative Value

Absolute Value

In this module we showcase how different people can read and/or analyze the report.

Let's start with a Data Scientist.

Data Scientist's POV

After reading the report, the data scientist would infer that the model is biased in many ways. Some of the possible scenarios to come up are as follows:

Looking at the Equal parity and Proportional parity metrics, she/he might want to reconsider the data itself. The data, itself, is biased against some categories and thus the model would be biased.

Solutions to this problem can be as follows:

Get more representative data
Use stratify splitting to at least mitigate the proportional parity bias.
Assign instance weights or class weights to the different categories involved
Have all the data on the same scale

Looking at the false positive/negative/discovery/omission rates, she/he might want to reconsider the model itself, or the hyperparameters for it.

Solutions to this problem can be addressed in following ways:

Experiment with different models like tree models, linear models or more try to add some form of non-linearity to the model thereby being able to capture the information.
Add high penalty for wrong classifications and reward
Change the loss function.
Regularize the model by adding type I or type II penalty losses.

Finance Person's POV

Now, let's consider a finance person. Clearly, she/he doesn't have knowledge of the underlying model and cannot effectively change the model. Some of the possible scenarios for her/him would be as follows:

Looking at the favourable groups, she/he can form intuitions as to why the model would arrive at a certain decision. This would help them to answer any concerns that customers come up with.
Another understanding that they can develop looking at the favourable groups would be to apply extra scrutiny to them. Consider an applicant who falls under many favourable categories. This ‘favourableness' might compensate for the less favourable ones. The finance person can here check for those parameters and have more scrutiny for the same.
She/he can lower the threshold for the use case. Say, a score of >0.8 might result in positive class. They can reduce that score to, for example, >0.7 for an application that falls majorly in unfavourable class and increase the score to >0.85 for someone who is major favourable categories.

This way, one can ‘reject opinion' and ‘remove prejudice' from the model manually.