Datasets

Collection of common benchmark datasets from fairness research.

Each dataset object contains a pandas.DataFrame as df attribute that holds the actual data. The dataset object will take care of loading, preprocessing and validating the data. The preprocessing is done by standard practices that are associated with this data set: from its manual (e.g., README) or as other did in the literature.

See ethically.dataset.Dataset for additional attribute and complete documentation.

Currently these are the available datasets:

Usage

>>> from ethically.dataset import COMPASDataset
>>> compas_ds = COMPASDataset()
>>> print(compas_ds)
<ProPublica Recidivism/COMPAS Dataset. 6172 rows, 56 columns in
which {race, sex} are sensitive attributes>
>>> type(compas_ds.df)
<class 'pandas.core.frame.DataFrame'>
>>> compas_ds.df['race'].value_counts()
African-American    3175
Caucasian           2103
Hispanic             509
Other                343
Asian                 31
Native American       11
Name: race, dtype: int64

General Dataset

class ethically.dataset.Dataset(target, sensitive_attributes, prediction=None)[source]

Bases: abc.ABC

Base class for datasets.

Attributes
  • df - pandas.DataFrame that holds the actual data.

  • target - Column name of the variable to predict

    (ground truth)

  • sensitive_attributes - Column name of the

    sensitive attributes

  • prediction - Columns name of the

    prediction (optional)

Available Datasets

class ethically.dataset.COMPASDataset[source]

Bases: ethically.dataset.core.Dataset

ProPublica Recidivism/COMPAS Dataset.

See Dataset for a description of the arguments and attributes.

References:

https://github.com/propublica/compas-analysis

class ethically.dataset.AdultDataset[source]

Bases: ethically.dataset.core.Dataset

Adult Dataset.

See Dataset for a description of the arguments and attributes.

References:

https://archive.ics.uci.edu/ml/datasets/adult

class ethically.dataset.GermanDataset[source]

Bases: ethically.dataset.core.Dataset

German Credit Dataset.

See Dataset for a description of the arguments and attributes.

References:
Extra

This dataset requires use of a cost matrix (see below)

   1 2
   ----
1 | 0 1
  |----
2 | 5 0

(1 = Good, 2 = Bad)

The rows represent the actual classification and the columns the predicted classification. It is worse to class a customer as good when they are bad (5), than it is to class a customer as bad when they are good (1).

FICO Dataset

ethically.dataset.build_FICO_dataset()[source]

Build the FICO dataset.

Dataset of the credit score of TransUnion (called TransRisk). The TransRisk score is in turn based on a proprietary model created by FICO, hence often referred to as FICO scores.

The data is aggregated, i.e., there is no outcome and prediction information per individual, but summarized statistics for each FICO score and race/race/ethnicity group.

FICO key

Meaning

totals

Number of individuals per group

cdf

Cumulative distribution function of score per group

pdf

Probability distribution function of score per group

performance

Fraction of non-defaulters per score and group

base_rates

Base rate of non-defaulters per group

base_rate

The overall base rate non-defaulters

proportions

Fraction of individuals per group

fpr

True Positive Rate by score as threshold per group

tpr

False Positive Rate by score as threshold per group

rocs

ROC per group

aucs

ROC AUC per group

Returns

Dictionary of various aggregated statics of the FICO credit score.

Return type

dict

References: