Datasets

Collection of common benchmark datasets from fairness research.

Each dataset object contains a pandas.DataFrame as df attribute that holds the actual data. The dataset object will take care of loading, preprocessing and validating the data. The preprocessing is done by standard practices that are associated with this data set: from its manual (e.g., README) or as other did in the literature.

See ethically.dataset.Dataset for additional attribute and complete documentation.

Currently these are the available datasets:

Usage

>>> from ethically.dataset import COMPASDataset
>>> compas_ds = COMPASDataset()
>>> print(compas_ds)
<ProPublica Recidivism/COMPAS Dataset. 6172 rows, 56 columns in
which {race, sex} are sensitive attributes>
>>> type(compas_ds.df)
<class 'pandas.core.frame.DataFrame'>
>>> compas_ds.df['race'].value_counts()
African-American    3175
Caucasian           2103
Hispanic             509
Other                343
Asian                 31
Native American       11
Name: race, dtype: int64

General Datasets

class ethically.dataset.Dataset(target, sensitive_attributes, prediction=None)[source]

Bases: abc.ABC

Base class for datasets.

Attributes
  • df - pandas.DataFrame that holds the actual data.
  • target - Column name of the variable to predict
    (ground truth)
  • sensitive_attributes - Column name of the
    sensitive attributes
  • prediction - Columns name of the
    prediction (optional)

Available Datasets

class ethically.dataset.COMPASDataset[source]

Bases: ethically.dataset.core.Dataset

ProPublica Recidivism/COMPAS Dataset.

See Dataset for a description of the arguments and attributes.

References:
https://github.com/propublica/compas-analysis
class ethically.dataset.AdultDataset[source]

Bases: ethically.dataset.core.Dataset

Adult Dataset.

See Dataset for a description of the arguments and attributes.

References:
https://archive.ics.uci.edu/ml/datasets/adult
class ethically.dataset.GermanDataset[source]

Bases: ethically.dataset.core.Dataset

German Credit Dataset.

See Dataset for a description of the arguments and attributes.

References:
Extra

This dataset requires use of a cost matrix (see below)

   1 2
   ----
1 | 0 1
  |----
2 | 5 0

(1 = Good, 2 = Bad)

The rows represent the actual classification and the columns the predicted classification. It is worse to class a customer as good when they are bad (5), than it is to class a customer as bad when they are good (1).

FICO Dataset

ethically.dataset.build_FICO_dataset()[source]

Build the FICO dataset.

Dataset of the credit score of TransUnion (called TransRisk). The TransRisk score is in turn based on a proprietary model created by FICO, hence often referred to as FICO scores.

The data is aggregated, i.e., there is no outcome and prediction information per individual, but summarized statistics for each FICO score and race/race/ethnicity group.

FICO key Meaning
totals Number of individuals per group
cdf Cumulative distribution function of score per group
pdf Probability distribution function of score per group
performance Fraction of non-defaulters per score and group
base_rates Base rate of non-defaulters per group
base_rate The overall base rate non-defaulters
proportions Fraction of individuals per group
fpr True Positive Rate by score as threshold per group
tpr False Positive Rate by score as threshold per group
rocs ROC per group
aucs ROC AUC per group
Returns:Dictionary of various aggregated statics of the FICO credit score.
Return type:dict
References: