Biases in Predicting Recidivism: How COMPAS Failed and Alternate Models – Team 2
Abstract
Northpointe created a recidivism measure aimed at classifying individuals as low and high risk. However, their models and measures are biased against Black defendants. Our group aimed to eliminate this bias by analyzing 8 different classification models. We found that Quadratic Discriminant Models yielded the best misclassification rates, but were still biased. We discuss that QDA did very well in capturing the trends in the data; however, we discuss the fact that the justice department’s dataset may be skewed in a way that leads data scientists to believe models are biased.
Introduction and Description
We decided to look at the ProPublica COMPAS data set. COMPAS is a scoring model developed by Northpointe that is used to predict the likelihood that a criminal will recidivate. The use of such a model is typically used to determine bail amounts and to determine the placement of correctional officers / how to treat the offenders (Angwin et al. 2016). In 2016, the analysis by ProPublica uncovered that, despite being marketed as fair across all gender and race groups, the COMPAS algorithm incorrectly flagged certain groups as recidivating more often. Specifically, they found that Black defendants (who did not recidivate) were flagged 45% of the time, but only 23% of White defendants were flagged (Angwin et al. 2016). Further, 48% of White defendants were marked as low risk, compared to 28% of Black defendants. Even after controlling for other variables, they found that Black defendants were 77% more likely to be flagged for a violent crime than White defendants (Angwin et al. 2016). It is clear that the COMPAS algorithm is biased toward Black defendants, which has a number of negative implications for them that White defendants are less likely to experience. Our group’s goal is to find a classification algorithm that correctly predicts recidivism, while having higher accuracy, and being less biased towards racial groups, than the COMPAS algorithm.
We chose to look at classification accuracy because the issue with COMPAS was that it had high bias particularly among Black defendants. Higher classification rates of recidivism and violent recidivism towards Black compared to White defendants led us to believe that this was a misclassification problem – Black defendants were being misclassified while White defendants were not. We also observed and calculated the Concordance rate of our chosen model. Concordance is defined as “probability that a randomly selected subject who experienced the outcome will have a higher predicted probability of having the outcome occur than a randomly selected subject who did not experience the outcome” (Logan 2013). Northpointe used concordance as the accuracy measure of their cox model, and we believed that we could only compare predictive accuracy in our model, with COMPAS, via concordance.
We begin our report by discussing Exploratory Data Analysis.
Exploratory Data Analysis
We produced a number of plots and statistical tests to determine the distributions of the ProPublica data set. To remain constant across both of the analyses, we used their provided database, and eliminated the exact observations that ProPublica did. We also deleted several other columns because our group determined they would not be useful predictors (i.e., first and last name) for recidivism. We also deleted COMPAS columns as we are not interested in using COMPAS data – we’re trying to create a better model separate from COMPAS. In this exploratory analysis, we found the following:
#Read in Recidivism Data
data = read.csv("recid_data.csv")
#Data Preparation
data <- mutate(data, sex = as.factor(sex)) %>%
within(sex <- relevel(sex, ref = 2)) %>%
mutate(race = as.factor(race)) %>%
within(race <- relevel(race, ref = 3)) %>%
mutate(age_cat = as.factor(age_cat)) %>%
mutate(c_charge_degree = as.factor(c_charge_degree)) %>%
mutate(is_recid = as.factor(is_recid)) %>%
mutate(is_violent_recid = as.factor(is_violent_recid))
#Removing Extraneous Info
data = subset(data, select = -c(X, person_id, age_cat, dob, name, first, last,
c_arrest_date, c_offense_date, r_case_number,
compas_screening_date, r_offense_date, r_jail_in, r_jail_out, vr_case_number,
vr_offense_date, decile_score, num_r_cases, c_days_from_compas,
r_charge_degree, r_days_from_arrest, vr_charge_degree))
#Remove Problem c_charge_degree values
x1 = which(data$c_charge_degree=="(CO3)")
data = data[-x1,]
x2 = which(data$c_charge_degree=="(CT)")
data = data[-x2,]
x3 = which(data$c_charge_degree=="(TCX)")
data = data[-x3,]
x4 = which(data$c_charge_degree=="(X)")
data = data[-x4,]
x5 = which(data$c_charge_degree=="(F5)")
data = data[-x5,]
x6 = which(data$c_charge_degree=="(F6)")
data = data[-x6,]
x7 = which(data$c_charge_degree=="(NI0)")
data = data[-x7,]
rm(list=setdiff(ls(), "data"))
Correlation between Variables
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...) {
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y, use = "complete.obs"))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
pairs(data[,c("age", "juv_fel_count","juv_misd_count","juv_other_count", "priors_count","is_recid","is_violent_recid","count.prior.jail","count.prior.prison","timeinjail","timeinprison" )], lower.panel = panel.cor)
We did not believe that there were any strong collinearity concerns in this data. We defined strong collinearity as if the correlation was .70 or greater. Via our pairs plot, we did not find any strong concerns, and as such, left all 14 of our variables in the data set.
EDA – Two Year Nonviolent Recidivism
#Sex
ggplot(data, aes(x = sex, fill = as.factor(is_recid))) + geom_bar(position = 'dodge') +
scale_fill_discrete(name = "Recidivated", labels = c("No", "Yes")) +
xlab('Sex') +
ylab('Number of People in Bin')
data$is_recid <- as.numeric(data$is_recid)
data$is_recid[data$is_recid == 1] <- 0
data$is_recid[data$is_recid == 2] <- 1
t.test(is_recid ~ sex, data)
##
## Welch Two Sample t-test
##
## data: is_recid by sex
## t = 10.295, df = 3451.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.09116776 0.13406283
## sample estimates:
## mean in group Male mean in group Female
## 0.3529714 0.2403561
A greater proportion of males (M = .353) recidivate than females (M = .240), and this is statistically significant at the .05 level.
#Race
ggplot(data, aes(x = race, fill = as.factor(is_recid))) + geom_bar(position = 'dodge') +
scale_fill_discrete(name = "Recidivated", labels = c("No", "Yes")) +
xlab('Race') +
ylab('Number of People in Bin')
## Df Sum Sq Mean Sq F value Pr(>F)
## race 5 39.8 7.952 36.63 <2e-16 ***
## Residuals 9807 2129.1 0.217
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
race | is_recid |
---|---|
Caucasian | 0.28 |
African-American | 0.39 |
Asian | 0.21 |
Hispanic | 0.25 |
Native American | 0.28 |
Other | 0.24 |
## Df Sum Sq Mean Sq F value Pr(>F)
## race 4 1.4 0.3461 1.768 0.132
## Residuals 4904 960.0 0.1958
African-Americans have the greatest proportion of individuals recidivating (M = .390) than other races. This is statistically significant at the .05 level. If we remove African-Americans from the data set, this no longer is statistically significant.
#Age
ggplot(data, aes(x = age, fill = as.factor(is_recid))) + geom_histogram(position = 'fill') +
scale_fill_discrete(name = "Recidivated", labels = c("No", "Yes")) +
xlab('Age Category') +
ylab('Number of People in Bin')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 8 rows containing missing values (geom_bar).
Younger people appear to be more likely to recidivate more than older people.
#Priors Count
ggplot(data, aes(x = priors_count, fill = as.factor(is_recid))) + geom_histogram(position = 'fill') +
scale_fill_discrete(name = "Recidivated", labels = c("No", "Yes")) +
xlab('Number of Priors') +
ylab('Probability of Recidivating')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing missing values (geom_bar).
#Juvenile Counts
juv.fel.plot <- data %>% dplyr::group_by(juv_fel_count) %>% dplyr::summarize(mean = mean(is_recid)) %>%
ggplot(aes(x = as.factor(juv_fel_count), y = mean)) +
geom_bar(stat = 'identity') +
ylab('Proportion Recidivated') +
xlab('Number of Juvenile Felonies')
juv.misd.plot <- data %>% dplyr::group_by(juv_misd_count) %>% dplyr::summarize(mean = mean(is_recid)) %>%
ggplot(aes(x = as.factor(juv_misd_count), y = mean)) + geom_bar(stat = 'identity') +
ylab('Proportion Recidivated') +
xlab('Number of Juvenile Misdemeanors')
juv.other.plot <- data %>% dplyr::group_by(juv_other_count) %>% dplyr::summarize(mean = mean(is_recid)) %>%
ggplot(aes(x = as.factor(juv_other_count), y = mean)) + geom_bar(stat = 'identity') +
ylab('Proportion Recidivated') +
xlab('Number of Juvenile Others')
grid.arrange(juv.fel.plot, juv.misd.plot, juv.other.plot, ncol = 2)
It appears that the higher the number of priors, the more likely a person is to recidivate. There also appears to be no real full relationship between the number of juvenile priors and probability of recidivating.
#Time in Jail / Prison
jail.plot <- ggplot(data, aes(x = timeinjail, fill = as.factor(is_recid))) + geom_histogram(position = 'fill') +
xlab('Median Days in Jail') +
ylab('Probability of Recidivating') +
scale_fill_discrete(name = "Recidivated", labels = c("No", "Yes"))
prison.plot <- ggplot(data, aes(x = timeinprison, fill = as.factor(is_recid))) + geom_histogram(position = 'fill') +
xlab('Median Days in Prison') +
ylab('Probability of Recidivating') +
scale_fill_discrete(name = "Recidivated", labels = c("No", "Yes"))
grid.arrange(jail.plot, prison.plot)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 10 rows containing missing values (geom_bar).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 16 rows containing missing values (geom_bar).
Similarly, there appears to be no convincing relationship between the number of days spent in prison / jail and recidivating.
EDA – Violent Recidivism
data$is_violent_recid <- as.numeric(data$is_violent_recid)
data$is_violent_recid[data$is_violent_recid == 1] <- 0
data$is_violent_recid[data$is_violent_recid == 2] <- 1
#Sex
ggplot(data, aes(x = sex, fill = as.factor(is_violent_recid))) + geom_bar(position = 'dodge') +
scale_fill_discrete(name = "Violent Recidivated", labels = c("No", "Yes")) +
xlab('Sex') +
ylab('Number of People in Bin')
##
## Welch Two Sample t-test
##
## data: is_violent_recid by sex
## t = 6.7221, df = 4042.2, p-value = 2.041e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.02699855 0.04923146
## sample estimates:
## mean in group Male mean in group Female
## 0.08509819 0.04698318
Males (M = .085) violently recidivate more often than females (M = .047) and this is statistically significant at the .05 level.
#Race
ggplot(data, aes(x = race, fill = as.factor(is_violent_recid))) + geom_bar(position = 'dodge') +
scale_fill_discrete(name = "Violent Recidivated", labels = c("No", "Yes")) +
xlab('Race') +
ylab('Number of People in Bin')
## Df Sum Sq Mean Sq F value Pr(>F)
## race 5 4.1 0.8273 11.67 2.89e-11 ***
## Residuals 9807 695.3 0.0709
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
race | is_violent_recid |
---|---|
Caucasian | 0.06 |
African-American | 0.10 |
Asian | 0.06 |
Hispanic | 0.05 |
Native American | 0.14 |
Other | 0.07 |
## Df Sum Sq Mean Sq F value Pr(>F)
## race 4 0.36 0.09113 1.678 0.152
## Residuals 4904 266.32 0.05431
African Americans (M = .10) and Native Americans (M = .14) violently recidivate more often than other races, and this is statistically significant at the .05 level. If we remove African-Americans from the data set, these differences are no longer statistically significant.
#Age Category
ggplot(data, aes(x = age, fill = as.factor(is_violent_recid))) + geom_histogram(position = 'fill') +
scale_fill_discrete(name = "Violent Recidivated", labels = c("No", "Yes")) +
xlab('Age Category') +
ylab('Number of People in Bin')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 8 rows containing missing values (geom_bar).
Younger individuals appear more likely to violently recidivate than older individuals
#Priors Count
ggplot(data, aes(x = priors_count, fill = as.factor(is_violent_recid))) + geom_histogram(position = 'fill') +
scale_fill_discrete(name = "Violent Recidivated", labels = c("No", "Yes")) +
xlab('Number of Priors') +
ylab('Probability of Recidivating')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing missing values (geom_bar).
#Juvenile Counts
juv.fel.plot <- data %>% dplyr::group_by(juv_fel_count) %>% dplyr::summarize(mean = mean(is_violent_recid)) %>%
ggplot(aes(x = as.factor(juv_fel_count), y = mean)) + geom_bar(stat = 'identity') +
ylab('Proportion Recidivated') +
xlab('Number of Juvenile Felonies')
juv.misd.plot <- data %>% dplyr::group_by(juv_misd_count) %>% dplyr::summarize(mean = mean(is_violent_recid)) %>%
ggplot(aes(x = as.factor(juv_misd_count), y = mean)) + geom_bar(stat = 'identity') +
ylab('Proportion Recidivated') +
xlab('Number of Juvenile Misdemeanors')
juv.other.plot <- data %>% dplyr::group_by(juv_other_count) %>% dplyr::summarize(mean = mean(is_violent_recid)) %>%
ggplot(aes(x = as.factor(juv_other_count), y = mean)) + geom_bar(stat = 'identity') +
ylab('Proportion Recidivated') +
xlab('Number of Juvenile Others')
grid.arrange(juv.fel.plot, juv.misd.plot, juv.other.plot, ncol = 2)
The higher the number of priors, the more likely it appears that a person is to violently recidivate. The higher the number of juvenile priors, the more likely it appears that a person is to violently recidivate.
#Time in Jail / Prison
jail.plot <- ggplot(data, aes(x = timeinjail, fill = as.factor(is_violent_recid))) + geom_histogram(position = 'fill') +
xlab('Median Days in Jail') +
ylab('Probability of Recidivating') +
scale_fill_discrete(name = "Violent Recidivated", labels = c("No", "Yes"))
prison.plot <- ggplot(data, aes(x = timeinprison, fill = as.factor(is_violent_recid))) + geom_histogram(position = 'fill') +
xlab('Median Days in Prison') +
ylab('Probability of Recidivating') +
scale_fill_discrete(name = "Violent Recidivated", labels = c("No", "Yes"))
grid.arrange(jail.plot, prison.plot)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 10 rows containing missing values (geom_bar).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 16 rows containing missing values (geom_bar).
There is no strong observable relationship between the number of days spent in prison and jail on the likelihood of violently recidivating.
Methodology
For our model selection, we analyzed the use of 8 different models (for both Two Year and Violent Recidivism), resulting in a total of 16 different models that we explored. We used the following models in our exploration of the data:
- Logistic Regression
- Naive Bayes Classifier
- Linear Discriminant Analysis (LDA)
- Quadratic Discriminant Analysis (QDA)
- Random Forest
- Neural Network
- K-Nearest Neighbors
- Linear Regression.
Summary
Each group member was assigned two models. We decided two look at the two models for both Two-Year Recidivism and Violent Recidivism. In total, each person created 4 different models, and compared each of their two models within their own analysis, and across each group members analysis. Each group member also used their corresponding model’s feature selection, or another feature selection that they thought fit their model the best. In the end, we had 16 models that we compared, with the QDA model having the lowest misclassification rate. This QDA model is the model that we base our predictions and bias findings on.
We start with Logistic Regression and Naive Bayes.
JOHN – Logistic Regression and Naive Bayes Classifier
Before I started, I removed problematic values from the ‘c_charge_degree’ variable. When splitting into training and test data sets for using the Validation Set approach, the trained models would encounter observations of the variable it had never seen before. This was always due to only a single person having a specific c_charge_degree, so we thought it best to remove them.
We settled on the use of 14 variables in our data set. For my section, I did not want to assume that using all 14 variables would give the best model. In my research, I found that using best subsets, forwards selection, and backwards selection is not appropriate in today’s world. Flom and Castell (2013) suggest that the use of these best subset selections results in small standard errors, narrow confidence intervals, and they can exacerbate any collinearity problems. For something as important as predicting recidivism, I believed the use of LASSO modeling and feature selection would be appropriate, and this is indeed what Flom and Castell recommend for model selection. I also utilized the r package boruta
, which finds the most important predictors for the model via Random Forest. Use of these predictors can result in overfitting, but they are included for completeness. As such, I usually fit a full model, model with lasso-selected features, and a model with boruta-selected features, and all three would be compared.
In the analysis, I took advantage of the validation set approach, but also calculated models based on cross validation using the caret
package. Accuracy measures were better (i.e., higher) in the CV models. However, both the Validation Set Approach and the Cross Validation model validation methods selected the same models for both Two Year and Violent Recidivism.
Two Year Recidivism
Preparing Lasso Feature Selection
Feature Selection Two Year Recidivism
# Split data into test and train
train <- sample(1:nrow(x), nrow(x)/2)
test <- (-train)
y.test <- y[test]
cv.logit <- cv.glmnet(x[train,], y[train], alpha = 1, family = "binomial")
plot(cv.logit)
#Logistic Lasso Model
lasso.model <- glmnet(x[train, ], y[train], alpha = 1, family = "binomial", lambda = grid)
lasso.coef <- predict(lasso.model, type = 'coefficients', s = cv.logit$lambda.min)
lasso.coef[lasso.coef != 0] #Includes Sex, Race, age, juv_misd_count, priors_count, is_violent_recid, count.prior.jail
## <sparse>[ <logic> ] : .M.sub.i.logical() maybe inefficient
## [1] -2.11657335 -0.04844816 0.08518812 -0.01484361 0.03407172 3.80717910
## [7] 0.78552549
lasso.preds <- predict(lasso.model, s = cv.logit$lambda.min, newx = x[test, ], type = "response")
#Boruta Feature Selection
y <- data$is_recid
x <- subset(data, select = -c(is_recid))
boruta_output <- Boruta(x, y)
boruta_output$finalDecision #All Variables, except c_charge_degree are important
## sex race age juv_fel_count
## Confirmed Confirmed Confirmed Confirmed
## juv_misd_count juv_other_count priors_count c_charge_degree
## Confirmed Confirmed Confirmed Rejected
## is_violent_recid count.prior.jail count.prior.prison timeinjail
## Confirmed Confirmed Confirmed Confirmed
## timeinprison
## Confirmed
## Levels: Tentative Confirmed Rejected
I made use of both a Logistic Lasso and Boruta feature selection package to determine the important predictors to include in a model. The Boruta Package utilizes the Random Forest to select the most important predictors. The Logistic Lasso method only made use of Sex, Race, Age, juv_misd_count, priors_count, is_violent_recid, and count.prior.jail. The Boruta method selected all variables, except c_charge_degree; however, caution with this model is important due to risk of overfitting.
Predicting Two Year Recidivism – Logistic Regression
#Two Year Recidivism -- Logistic and Naive Bayes
# Define x matrix and y vector for use with glmnet
x <- model.matrix(is_recid~., data)[,-1]
y <- data$is_recid
grid=10^seq(10,-2, length =100) #Need a preselected grid of lambdas -- glmnet won't work with a single lambda value
#Create Sample Sets
train <- sample(1:nrow(data), nrow(data)/2)
data.train <- data[train, ]
data.test <- data[-train, ]
#Train Logistic Regression, with features selected by Boruta via Importance
logit.all <- glm(is_recid~. - c_charge_degree, data = data.train, family = "binomial")
logit.all.probs <- logit.all %>% predict(data.test, type = "response")
log.all.pred <- rep("No Recid", nrow(data.test))
log.all.pred[logit.all.probs > .5] <- "Yes Recid"
log.all.table <- table(log.all.pred, data.test$is_recid)
log.all.error <- (log.all.table[1, 2] + log.all.table[2, 1])/nrow(data.test)
#Train Normal Logistic Regression, with features selected by Logistic Lasso
#Predict Logistic Probabilities
logit <- glm(is_recid~sex + race + age + juv_misd_count + priors_count +
is_violent_recid + count.prior.jail, data = data.train, family = "binomial")
logit.probs <- logit %>% predict(data.test, type = "response")
#Create Tables for Logistic and Lasso-Logistic Models
log.pred <- rep("No Recid", nrow(data.test))
lass.pred <- rep("No Recid", nrow(data.test))
log.pred[logit.probs > .5] <- "Yes Recid"
lass.pred[lasso.preds > .5] <- "Yes Recid"
#Find Error Rates
log.table <- table(log.pred, data.test$is_recid)
log.error <- (log.table[1, 2] + log.table[2, 1])/nrow(data.test)
lass.table <- table(lass.pred, data.test$is_recid)
lass.error <- (lass.table[1, 2] + lass.table[2, 1])/nrow(data.test)
#Compare Logistic Models
log.error < log.all.error
## [1] TRUE
## [1] TRUE
Via comparison of the Logistic Lasso misclassification error, and the Logistic Regression misclassification error, I find that the logistic regression has the lowest misclassification rate.
Predicting Two Year Recidivism – Naive Bayes Classifier
#Preparing Naive Bayes
#Naive Bayes -- Features Selected by Lasso
nb.model <- NaiveBayes(data = data.train, is_recid~sex + race + age + juv_misd_count + priors_count +
is_violent_recid + count.prior.jail, usekernel = TRUE)
nb.pred <- nb.model %>% predict(data.test)
#Naive Bayes -- Features selected by Boruta
nb.all.model <- NaiveBayes(data = data.train, is_recid~. - c_charge_degree, usekernel = TRUE)
nb.pred.all <- nb.all.model %>% predict(data.test)
#Calculate Naive Bayes Error
nb.table <- table(nb.pred$class, data.test$is_recid)
nb.error <- (nb.table[1, 2] + nb.table[2, 1])/nrow(data.test)
nb.table.all <- table(nb.pred.all$class, data.test$is_recid)
nb.error.all <- (nb.table.all[1, 2] + nb.table.all[2, 1])/nrow(data.test)
nb.error < nb.error.all #Will keep the less complicated model
## [1] TRUE
#We will keep logistic regression rather than Naive Bayes, although they are very close!
log.error < nb.error
## [1] TRUE
I found that the Naive Bayes model with features selected by the logistic lasso had the best error out of both Naive Bayes models, but the Logistic model with lasso selected features had the best misclassification rate out of all of the models in this section.
10-Fold CV Error for Two Year Recidivism
control <- trainControl(method = "cv")
nb.lasso.cv <- 1 - train(is_recid~sex + race + age + juv_misd_count +
priors_count + is_violent_recid + count.prior.jail, data = data, trControl = control, method =
'naive_bayes')$results[2, "Accuracy"]
nb.boruta.cv <- 1 - train( is_recid~. - c_charge_degree, data = data, trControl = control,
method = 'naive_bayes')$results[2, "Accuracy"]
logit.boruta.cv <- 1 - train(is_recid~. - c_charge_degree, data = data, trControl = control,
method = 'glm', family = 'binomial')$results[1, "Accuracy"]
logit.lasso.cv <- 1 - train(is_recid~sex + race + age + juv_misd_count + priors_count +
is_violent_recid + count.prior.jail, data = data, trControl = control, method = 'glm',
family = 'binomial')$results[1, "Accuracy"]
lasso.cv <- cv.glmnet(x, y, alpha = 1, lambda = grid, family = "binomial", type.measure = "class")$cvm[100]
which.min(c(nb.lasso.cv, nb.boruta.cv, logit.boruta.cv, logit.lasso.cv, lasso.cv))
## [1] 4
The Cross Validation finds that, just like Validation Set Approach, that the Logistic Regression with Lasso-Selected Coefficients results in the best misclassification rate.
ROC Curve with Selected Model - Two Year Recidivism
log.preds <- logit %>% predict(data.test, type = "response")
log.perf <- prediction(log.preds, data.test$is_recid)
roc.perf <- performance(log.perf, measure = "tpr", x.measure = "fpr")
auc.train <- performance(log.perf, measure = "auc")
auc.train <- auc.train@y.values
plot(roc.perf)
abline(a=0, b= 1)
text(x = .25, y = .65 ,paste("AUC = ", round(auc.train[[1]],3), sep = ""))
The ROC curve shows that this is a good model for the classification of two year recidivism.
Findings & Bias Analysis
#Two Year Recid
data.test$predictions <- log.pred
data.test$predictions[data.test$predictions == "No Recid"] <- 0
data.test$predictions[data.test$predictions == "Yes Recid"] <- 1
data.test$predictions <- as.numeric(data.test$predictions)
#Race
race.r <- ggplot(data.test, aes(x = race, fill = as.factor(predictions))) + geom_bar(position = 'dodge') +
scale_fill_discrete(name = "Recidivated", labels = c("No", "Yes")) +
xlab('Race') +
ylab('Number of People in Bin')
summary(aov(predictions ~ race, data.test))
## Df Sum Sq Mean Sq F value Pr(>F)
## race 5 24.7 4.947 28.47 <2e-16 ***
## Residuals 4901 851.5 0.174
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
race | predictions |
---|---|
Caucasian | 0.18 |
African-American | 0.30 |
Asian | 0.15 |
Hispanic | 0.12 |
Native American | 0.06 |
Other | 0.15 |
## Df Sum Sq Mean Sq F value Pr(>F)
## race 4 1.6 0.3963 2.883 0.0214 *
## Residuals 2485 341.6 0.1375
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Age
age.r <- ggplot(data.test, aes(x = age, fill = as.factor(predictions))) + geom_histogram(position = "fill") +
scale_fill_discrete(name = "Recidivated", labels = c("No", "Yes")) +
xlab('Age') +
ylab('Number of People in Bin')
#Sex
sex.r <- ggplot(data.test, aes(x = sex, fill = as.factor(predictions))) + geom_bar(position = 'dodge') +
scale_fill_discrete(name = "Recidivated", labels = c("No", "Yes")) +
xlab('Sex') +
ylab('Number of People in Bin')
t.test(predictions ~ sex, data.test)
##
## Welch Two Sample t-test
##
## data: predictions by sex
## t = 11.11, df = 2125.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.1138132 0.1626041
## sample estimates:
## mean in group Male mean in group Female
## 0.2618802 0.1236715
sex | predictions |
---|---|
Male | 0.26 |
Female | 0.12 |
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing missing values (geom_bar).
For this variable, I fit 3 models – Logistic Lasso, regular logistic with lasso-selected predictors, and regular logistic with boruta-selected predictors. A full model is not in this report because the accuracy was the exact same as the boruta selected, and the boruta was a (slightly) less complicated model.
I found that the logistic model with lasso-selected predictors resulted in better accuracy (A = .179 vs. .181). This was also better than the logistic lasso (A = .179 vs. .393).
A Naive Bayes model was also fit in the same way as the logistic regression (i.e., 3 similar models). The lasso-selected predictors resulted in slightly better accuracy than the boruta-selected predictors (A = .200 vs. .207). However, the logistic model with lasso-selected predictors resulted in a better accuracy (A = .179 vs. .200). As such, the logistic model with lasso-selected predictors outperformed Naive Bayes in Two-Year Recidivism.
Was the Model Biased?
Unfortunately, this model was biased. African-Americans were predicted to recidivate more often (M = .31) than other races, and this is significant at the .05 level. When African-Americans were removed from the data set, this difference is no longer significant. Men were also predicted to recidivate more often (M = .261) than women (M = .147), and this is significant at the .05 level. Younger people also were more likely to recidivate than older people. This model definitely did capture the trends we found in the exploratory analysis, but can be considered biased as a result.
Violent Recidivism
Preparing Lasso Feature Selection
Feature Selection for Violent Recidivism
# Split data into test and train
train <- sample(1:nrow(x), nrow(x)/2)
test <- (-train)
y.test <- y[test]
cv.logit <- cv.glmnet(x[train,], y[train], alpha = 1, family = "binomial")
plot(cv.logit)
#Lasso Model
lasso.model <- glmnet(x[train, ], y[train], alpha = 1, family = "binomial", lambda = grid)
lasso.coef <- predict(lasso.model, type = 'coefficients', s = cv.logit$lambda.min)
lasso.coef[lasso.coef != 0] #timeinprison, count.prior.jail, is_recid, c_charge_degree, & juv_misd_count features
## <sparse>[ <logic> ] : .M.sub.i.logical() maybe inefficient
## [1] -5.03469778950 0.01544833729 -0.16031614974 0.22645369846 0.03171181625
## [6] 3.50401926134 0.09861978727 0.00038822236 0.00006599299
lasso.preds <- predict(lasso.model, s = cv.logit$lambda.min, newx = x[test, ], type = "response")
#Boruta Feature Selection
y <- data$is_violent_recid
x <- subset(data, select = -c(is_violent_recid))
boruta_output <- Boruta(x, y)
boruta_output$finalDecision #All Variables, except sex, juv_fel_count, juv_misd_count, juv_other_count, count.prior.jail
## sex race age juv_fel_count
## Rejected Confirmed Confirmed Rejected
## juv_misd_count juv_other_count priors_count c_charge_degree
## Rejected Tentative Confirmed Confirmed
## is_recid count.prior.jail count.prior.prison timeinjail
## Confirmed Rejected Confirmed Tentative
## timeinprison
## Confirmed
## Levels: Tentative Confirmed Rejected
Lasso selected timeinprison, count.prior.jail, is_recid, c_charge_degree, & juv_misd_count as features. Boruta selected all variables except sex, juv_fel_count, juv_misd_count, juv_other_count, and count.prior.jail. I included full models for both Logistic and Naive Bayes because both of these feature selection models removed a large number of potential variables.
Predicting Violent Recidivism – Logistic Regression
#Violent Recidivism -- Logistic and Naive Bayes
# Define x matrix and y vector for use with glmnet
x <- model.matrix(is_recid~., data)[,-1]
y <- data$is_recid
grid=10^seq(10,-2, length =100)
##Create the samples
train <- sample(1:nrow(data), nrow(data)/2)
data.train <- data[train, ]
data.test <- data[-train, ]
#Logistic with Lasso Regression
logit.lasso <- glm(is_violent_recid~timeinprison + count.prior.jail + is_recid + c_charge_degree +
juv_misd_count, data = data.train, family = "binomial")
logit.probs.lasso <- logit.lasso %>% predict(data.test, type = "response")
#Logistic with Boruta
logit.boruta <- glm(is_violent_recid~. -sex -juv_fel_count -juv_misd_count -juv_other_count
-count.prior.jail, data = data.train, family = "binomial")
logit.probs.boruta <- logit.boruta %>% predict(data.test, type = "response")
#Full Logistic Model
logit.full <- glm(is_violent_recid~., data = data.train, family = "binomial")
logit.probs.full <- logit.full %>% predict(data.test, type = "response")
#Create Tables for Logistic and Lasso-Logistic Models
log.lasso.pred <- rep("No Recid", nrow(data.test))
log.boruta.pred <- rep("No Recid", nrow(data.test))
log.full.pred <- rep("No Recid", nrow(data.test))
lass.pred <- rep("No Recid", nrow(data.test))
log.lasso.pred[logit.probs.lasso > .5] <- "Yes Recid"
log.boruta.pred[logit.probs.boruta > .5] <- "Yes Recid"
log.full.pred[logit.probs.full > .5] <- "Yes Recid"
lass.pred[lasso.preds > .5] <- "Yes Recid"
#Find Error Rates
log.lasso.table <- table(log.lasso.pred, data.test$is_violent_recid)
log.lasso.error <- (log.lasso.table[1, 2] + log.lasso.table[2, 1])/nrow(data.test)
log.boruta.table <- table(log.boruta.pred, data.test$is_violent_recid)
log.boruta.error <- (log.boruta.table[1, 2] + log.boruta.table[2, 1])/nrow(data.test)
log.full.table <- table(log.full.pred, data.test$is_violent_recid)
log.full.error <- (log.full.table[1, 2] + log.full.table[2, 1])/nrow(data.test)
lass.table <- table(lass.pred, data.test$is_violent_recid)
if(length(levels(as.factor(lass.pred))) == 1){
lass.error <- lass.table[1, 2] /nrow(data.test)
}else{
lass.error <- (lass.table[1, 2] + lass.table[2, 1])/nrow(data.test)
}
#Compare Logistics & Lasso
log.lasso.error #Choose the logistic lasso-selected model -- it's the least complicated
## [1] 0.07784797
## [1] 0.07784797
## [1] 0.07805176
## [1] 0.07682902
We have four models, and, depending on the random subset via Validation and Test set approach, either the Logistic Lasso or the Logistic Regression with Lasso-Selected coefficients has the lowest misclassification rate. I choose to keep the Logistic Regression with Lasso-Selected coefficients because it is easier to interpret.
Predicting Violent Recidivism – Naive Bayes Classifier
#Naive Bayes Lasso
nb.model.lasso <- NaiveBayes(data = data.train, is_violent_recid~timeinprison + count.prior.jail + is_recid +
c_charge_degree + juv_misd_count, usekernel = TRUE)
nb.pred.lasso <- nb.model.lasso %>% predict(data.test)
#Naive Bayes Boruta
nb.model.boruta <- NaiveBayes(data = data.train, is_violent_recid~. -sex -juv_fel_count -juv_misd_count -juv_other_count
-count.prior.jail, usekernel = TRUE)
nb.pred.boruta <- nb.model.boruta %>% predict(data.test)
#Naive Bayes Full
nb.model.full <- NaiveBayes(data = data.train, is_violent_recid ~., usekernel = TRUE)
nb.pred.full <- nb.model.full %>% predict(data.test)
#Calculate Naive Bayes Error
nb.table.full <- table(nb.pred.full$class, data.test$is_violent_recid)
nb.error.full <- (nb.table.full[1, 2] + nb.table.full[2, 1])/nrow(data.test)
nb.table.lasso <- table(nb.pred.lasso$class, data.test$is_violent_recid)
nb.error.lasso <- (nb.table.lasso[1, 2] + nb.table.lasso[2, 1])/nrow(data.test)
nb.table.boruta <- table(nb.pred.boruta$class, data.test$is_violent_recid)
nb.error.boruta <- (nb.table.boruta[1, 2] + nb.table.boruta[2, 1])/nrow(data.test)
#Compare NB Rates
nb.error.full
## [1] 0.07927451
## [1] 0.07703281
## [1] 0.07927451
## [1] FALSE
Out of the Naive Bayes models, I choose to keep the Naive Bayes with Lasso-Selected coefficients because it has the lowest misclassification rate. However, the Logistic Regression Model with Lasso-Selected coefficients results in a smaller misclassification error than the Naive Bayes Model with Lasso-Selected coefficients. Sometimes the Naive Bayes Lasso is marked better as the Logistic Model Lasso, but this is because the misclassification rates are so similar, and normally depends on the way the dataset is randomly split. Most of the time, Logistic Model with Lasso-Coefficients comes out on top.
10-Fold CV Error for Violent Recidivism
control <- trainControl(method = "cv")
nb.lasso.cv <- 1 - train(is_violent_recid~timeinprison + count.prior.jail + is_recid +
c_charge_degree + juv_misd_count, data = data, trControl = control, method =
'naive_bayes')$results[2, "Accuracy"]
nb.boruta.cv <- 1 - train(is_violent_recid~. -sex -juv_fel_count -juv_misd_count -juv_other_count
-count.prior.jail, data = data, trControl = control,
method = 'naive_bayes')$results[2, "Accuracy"]
nb.full.cv <- 1 - train(is_violent_recid~., data = data, trControl = control,
method = 'naive_bayes')$results[2, "Accuracy"]
logit.boruta.cv <- 1 - train(is_violent_recid~. -sex -juv_fel_count -juv_misd_count -juv_other_count
-count.prior.jail, data = data, trControl = control,
method = 'glm', family = 'binomial')$results[1, "Accuracy"]
logit.lasso.cv <- 1 - train(is_violent_recid~timeinprison + count.prior.jail + is_recid + c_charge_degree +
juv_misd_count, data = data, trControl = control, method = 'glm',
family = 'binomial')$results[1, "Accuracy"]
logit.full.cv <- 1 - train(is_violent_recid~., data = data, trControl = control, method = 'glm',
family = 'binomial')$results[1, "Accuracy"]
lasso.cv <- cv.glmnet(x, y, alpha = 1, lambda = grid, family = "binomial", type.measure = "class")$cvm[100]
which.min(c(nb.lasso.cv, nb.boruta.cv, nb.full.cv, logit.boruta.cv, logit.lasso.cv, logit.full.cv, lasso.cv))
## [1] 6
Cross Validation confirms either Logistic Regression with Lasso-Selected coefficients, full Logistic Model, or Logistic Lasso depending on the random testing sample. At the end of the day, all of these models had very, very similar misclassification rates, so I decided to keep the Logistic Regression with Lasso-Selected coefficients for ease of interpretation.
ROC Curve with Selected Model - Violent Recidivism
log.preds <- logit.lasso %>% predict(data.test, type = "response")
log.perf <- prediction(log.preds, data.test$is_violent_recid)
roc.perf <- performance(log.perf, measure = "tpr", x.measure = "fpr")
auc.train <- performance(log.perf, measure = "auc")
auc.train <- auc.train@y.values
plot(roc.perf)
abline(a=0, b= 1)
text(x = .25, y = .65 ,paste("AUC = ", round(auc.train[[1]],3), sep = ""))
The ROC curve also shows that this is a well performing model for predicting is_violent_recid.
Findings & Bias Analysis
#Violent Recid
data.test$predictions <- log.lasso.pred
data.test$predictions[data.test$predictions == "No Recid"] <- 0
data.test$predictions[data.test$predictions == "Yes Recid"] <- 1
data.test$predictions <- as.numeric(data.test$predictions)
#Race
race.r <- ggplot(data.test, aes(x = race, fill = as.factor(predictions))) + geom_bar(position = 'dodge') +
scale_fill_discrete(name = "Recidivated", labels = c("No", "Yes")) +
xlab('Race') +
ylab('Number of People in Bin')
summary(aov(predictions ~ race, data.test))
## Df Sum Sq Mean Sq F value Pr(>F)
## race 5 0.083 0.01653 4.804 0.00022 ***
## Residuals 4901 16.858 0.00344
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
race | predictions |
---|---|
Caucasian | 0.00 |
African-American | 0.01 |
Asian | 0.00 |
Hispanic | 0.00 |
Native American | 0.07 |
Other | 0.00 |
#Age
age.r <- ggplot(data.test, aes(x = age, fill = as.factor(predictions))) + geom_histogram(position = "fill") +
scale_fill_discrete(name = "Recidivated", labels = c("No", "Yes")) +
xlab('Age') +
ylab('Number of People in Bin')
#Sex
sex.r <- ggplot(data.test, aes(x = sex, fill = as.factor(predictions))) + geom_bar(position = 'dodge') +
scale_fill_discrete(name = "Recidivated", labels = c("No", "Yes")) +
xlab('Sex') +
ylab('Number of People in Bin')
t.test(predictions ~ sex, data.test)
##
## Welch Two Sample t-test
##
## data: predictions by sex
## t = 1.1175, df = 2196.9, p-value = 0.2639
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.001437316 0.005245429
## sample estimates:
## mean in group Male mean in group Female
## 0.003861004 0.001956947
sex | predictions |
---|---|
Male | 0 |
Female | 0 |
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing missing values (geom_bar).
For this variable, I fit 4 models – Logistic Lasso, regular logistic with lasso-selected predictors, regular logistic with boruta-selected predictors, and a full model. A full model was included because both lasso and boruta feature selections resulted in a fair number of predictors being removed, so a full model is included for completeness.
I found that the logistic model with lasso-selected predictors resulted in the best accuracy (A = .075). This was also marginally better than the logistic lasso (A = .075 vs. .076).
A Naive Bayes model was also fit in the same way as the logistic regression (i.e., 4 similar models). The lasso-selected predictors resulted in slightly better accuracy than the boruta-selected predictors (A = .077). However, the logistic model with lasso-selected predictors resulted in a better accuracy (A = .075 vs. .077). As such, the logistic model with lasso-selected predictors outperformed Naive Bayes in Violent Recidivism.
Was the Model Biased?
The model appears to be not as biased as the two year recidivism. However, it predicted Native Americans as being more likely to violently recidivate (M = .06) compared to other races. However, if we removed Native and African Americans from the data set, the differences are no longer statistically significant. It was not biased between men (M = .008) and women (M = .003). The difference was not statistically significant. And finally, it appeared that across age, younger individuals were more likely to violently recidivate than older, although this is hard to tell. This model was unable to capture the exact trends we saw in the violent exploratory analysis, but it does not appear completely biased across groups, but it remains biased for Race.
Model Summary
I find that Logistic Regression produced the best misclassification rates compared to Naive Bayes in both Two Year and Violent recidivism. They both had very similar AUC values, so I picked the model with the lowest misclassification rates. Logistic Regression with Lasso-Selected Coefficients classified recidivism rates really well.
KAY – Linear and Quadratic Discriminant Analysis
To be coherent with the team, I used the dataset that only keeps the necessary variables. I adjusted my data frame by changing the “race” variable to a vector and then re-ordering it by having “Caucasian” listed as the first one, so it could be seen as a baseline. The reason for it is that after I first ran the LDA and QDA model, “African American” was used as the baseline. It was not ideal for the interpretation, especially when one of our goals is to examine whether the system is racist. I also removed the “c_charge_degree” because it gave an error of having constant values when running the LDA model. Two data frames were created for is_recid and is_violent_recid respectively. For is_recid, the variable “is_violent_recid” is omitted because including it as an indicator variable may create bias, considering it could be highly correlated with the outcome variable. The reason is the same for is_violent_recid. There are 12 variables each in the data frames of “is_recid” and “is_violent_recid” for LDA and QDA models.
These 14 variables are selected based on their relevance to recidivism. I did some research on the model selection for discriminant analysis. Generally, they are challenged when 1. there are a large number of predictors; 2. there are more predictors than samples; 3. the dataset has a complicated structure with too many variables present; 4. there are multiple data sources (Zhang, Mai, and Pan 2019).
I also tried a method suggested by Mueller (2019), which was to perform a rank for variables in terms of importance from a glm and then fit these variables to an LDA to see if the accuracy rate improves. My pre-analysis shows that there was barely any change (a 0.001 increase). Considering my findings, I fit a full model to the discriminant analyses.
To evaluate the models, I split the data into a training set and validation set. Instead of using the 50/50 rule as the in-class lab did, I applied one of the common ratios as: 80% training and 20% validation (Darelos 2009). Model evaluations are based on the misclassification rate and the ROC curve, more specifically, the AUC.
Part One: Data exploratory
Read data (csv file), data processing, and important variables exploratory
rm(list=ls())
# read csv file to load the data
recid <- read.csv("recid_data.csv", header = TRUE)
# Delete Irrelevant Columns
# We don't need peoples' name do predict if that person will recid
recid$name=NULL
recid$first=NULL
recid$last=NULL
# We don't need age because we have age catagory
recid$age_cat=NULL
# Remove anything related to COMPAS
recid$compas_screening_date=NULL
recid$decile_score=NULL
recid$c_days_from_compas=NULL
# Case number/Identification number is an identification of the case, so won't be needed
recid$r_case_number=NULL
recid$vr_case_number=NULL
recid$X=NULL
recid$person_id=NULL
# Date of birth not needed becasue we have age category
recid$dob=NULL
# We don't need the what date the person was arrested because getting arrested in specific date has nothing to do with if the person will be recid or not. Also, we have to count of how many times the person was arrested.
recid$c_arrest_date=NULL
recid$c_offense_date=NULL
recid$r_offense_date=NULL
recid$r_jail_in=NULL
recid$r_jail_out=NULL
recid$vr_offense_date=NULL
# vr_charge or r_charge degree is not needed becasue this is a description of violent recidivism
recid$vr_charge_degree=NULL
recid$r_charge_degree=NULL
recid$r_days_from_arrest=NULL
recid$num_r_cases=NULL
# Delete Irrelevant Rows
table(recid$c_charge_degree)
##
## (CO3) (CT) (F1) (F2) (F3) (F5) (F6) (F7) (M1) (M2) (MO3) (NI0) (TCX)
## 1 1 152 821 5265 5 5 94 2680 734 67 6 1
## (X)
## 1
x1 = which(recid$c_charge_degree=="(CO3)")
recid = recid[-x1,]
x2 = which(recid$c_charge_degree=="(CT)")
recid = recid[-x2,]
x3 = which(recid$c_charge_degree=="(TCX)")
recid = recid[-x3,]
x4 = which(recid$c_charge_degree=="(X)")
recid = recid[-x4,]
x5 = which(recid$c_charge_degree=="(F5)")
recid = recid[-x5,]
x6 = which(recid$c_charge_degree=="(F6)")
recid = recid[-x6,]
x7 = which(recid$c_charge_degree=="(NI0)")
recid = recid[-x7,]
table(recid$c_charge_degree)
##
## (CO3) (CT) (F1) (F2) (F3) (F5) (F6) (F7) (M1) (M2) (MO3) (NI0) (TCX)
## 0 0 152 821 5265 0 0 94 2680 734 67 0 0
## (X)
## 0
Part Two: Model training and validation
LDA model 1: “is_recid”
# create a dataframe with variables except "is_violent_recid"
df1 <- subset(recid, select = c(sex, race, age, juv_fel_count, juv_misd_count, juv_other_count, priors_count, is_recid, count.prior.jail, count.prior.prison, timeinjail, timeinprison))
# split into training and validation sets
set.seed(2)
train.is_recid <- sample(1:nrow(df1), 0.8*nrow(df1))
test.is_recid <- setdiff(1:nrow(df1),train.is_recid)
#create a dataframe of test set
df.test <- df1 %>% filter(row_number() %in% test.is_recid)
# fit the lda model to the training data set
lda.fit.is_recid <- lda(is_recid~., data=df1, subset=train.is_recid)
lda.fit.is_recid
## Call:
## lda(is_recid ~ ., data = df1, subset = train.is_recid)
##
## Prior probabilities of groups:
## 0 1
## 0.6681529 0.3318471
##
## Group means:
## sexMale raceAfrican-American raceAsian raceHispanic raceNative American
## 0 0.7717827 0.4551001 0.005910391 0.10238322 0.003431840
## 1 0.8502879 0.6023033 0.003071017 0.06602687 0.002303263
## raceOther age juv_fel_count juv_misd_count juv_other_count priors_count
## 0 0.06730219 36.36244 0.03965682 0.04385129 0.06692088 2.208008
## 1 0.04491363 32.34894 0.10556622 0.14165067 0.16161228 4.781574
## count.prior.jail count.prior.prison timeinjail timeinprison
## 0 1.345853 0.3195424 18.86196 103.9353
## 1 3.042994 0.6034549 23.02783 211.1939
##
## Coefficients of linear discriminants:
## LD1
## sexMale 0.2456270718
## raceAfrican-American 0.1963835441
## raceAsian -0.1171255526
## raceHispanic -0.0245141393
## raceNative American -0.2678651692
## raceOther -0.0398447124
## age -0.0166707913
## juv_fel_count 0.0450023744
## juv_misd_count 0.0707335112
## juv_other_count 0.0560447482
## priors_count 0.0417633413
## count.prior.jail 0.6516008721
## count.prior.prison -0.0208121741
## timeinjail 0.0001075972
## timeinprison 0.0001033417
Prior Probabilities of groups show that 66.81529% of the training data didn’t recidivate and 33.18471% recidivated.
Group means for sexMale indicates the variable Male has a slightly greater influence on recividism (0.8502879) than on non-recividism (0.7717827).
Group means show that Asian, Hispanic, Native American, and Other are less likely to recividate.
However, group mean also shows African Americans are more likely to recividate (0.6023033) than non-recividism (0.4551001)
All variables about criminial history (juv_fel_count, priors_count, count.prior.jail, etc.) indicate that defenders with criminal history are more likely to recividate.
The coefficient of linear discriminants for African American is positive while other races are not, indicating that compared to Caucasian (baseline), being an African American are more likely to be classified as recividism.
We may say that the system is racist against African Americans in terms of recividsm compared to other races.
LDA model 1 evaluation: misclassification rate and ROC curve
# fit the lda model to the test dataframe
lda.pred <- predict(lda.fit.is_recid, newdata = df.test)
names(lda.pred)
## [1] "class" "posterior" "x"
lda.class <- lda.pred$class
# compare the prediction to the test set using a confusion matrix
test.true <- df1$is_recid[test.is_recid]
confusion_matrix.lda <- table(lda.class, test.true)
confusion_matrix.lda
## test.true
## lda.class 0 1
## 0 1237 327
## 1 95 304
# calcualte the misclassification rate
misclass.rate.lda <- (confusion_matrix.lda[1,2]+confusion_matrix.lda[2,1])/nrow(df.test)
misclass.rate.lda
## [1] 0.2149771
The model is fitted to the test data set. Model evaluations are based on the misclassification rate and the ROC curve, more specifically, the AUC. A statistical difference test is conducted to see if the model is biased.
# Construct a ROC curve:
# Get the posteriors as a dataframe
lda.pred.posteriors <- as.data.frame(lda.pred$posterior)
# Evaluate the model
pred <- prediction(lda.pred.posteriors[,2], df.test$is_recid)
roc.perf = performance(pred, measure = "tpr", x.measure = "fpr")
auc.train <- performance(pred, measure = "auc")
auc.train <- auc.train@y.values
# Plot
plot(roc.perf)
abline(a=0, b= 1)
text(x = .25, y = .65 ,paste("AUC = ", round(auc.train[[1]],3), sep = ""))
#Is the model biased in terms of race?
df.test$predictions <- lda.class
df.test$predictions[df.test$predictions == "No Recid"] <- 0
df.test$predictions[df.test$predictions == "Yes Recid"] <- 1
df.test$predictions <- as.numeric(df.test$predictions)
#Race
race.r <- ggplot(df.test, aes(x = race, fill = as.factor(predictions))) + geom_bar(position = 'dodge') +
scale_fill_discrete(name = "Recidivated", labels = c("No", "Yes")) +
xlab('Race') +
ylab('Number of People in Bin')
race.r
## Df Sum Sq Mean Sq F value Pr(>F)
## race 5 19.26 3.853 25.25 <2e-16 ***
## Residuals 1957 298.63 0.153
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
race | predictions |
---|---|
Caucasian | 1.13 |
African-American | 1.30 |
Asian | 1.11 |
Hispanic | 1.06 |
Native American | 1.20 |
Other | 1.07 |
## Df Sum Sq Mean Sq F value Pr(>F)
## race 4 0.73 0.18324 1.886 0.111
## Residuals 1010 98.13 0.09716
The misclassification rate is 21.50%.
The ROC curve has an AUC of 0.833.
The model is biased. African Americans were predicted to recidivate more often than other races (=1.30) and it is significant at the 0.05 level with p-value < 2e-16.
When African Americans are removed from the data set, the difference is not significant anymore with p-value = 0.111
Conclusion 1: LDA model for is_recid gives a misclassfication rate = 21.50%.
LDA model for is_recid gives a ROC curve with AUC = 0.833.
LDA model 2: “is_violent_recid”
# create a dataframe with variables except "is_recid"
df1.violent <- subset(recid, select = c(sex, race, age, juv_fel_count, juv_misd_count, juv_other_count, priors_count, is_violent_recid, count.prior.jail, count.prior.prison, timeinjail,timeinprison))
# split into training and validation sets
set.seed(2)
train.is_violent_recid <- sample(1:nrow(df1.violent), 0.8 * nrow(df1.violent))
test.is_violent_recid <- setdiff(1:nrow(df1.violent),train.is_violent_recid)
#create a dataframe of test set
df.test2 <- df1.violent %>% filter(row_number() %in% test.is_violent_recid)
# fit the lda model to the training data set
lda.fit.is_violent_recid <- lda(is_violent_recid~., data=df1.violent, subset=train.is_violent_recid)
lda.fit.is_violent_recid
## Call:
## lda(is_violent_recid ~ ., data = df1.violent, subset = train.is_violent_recid)
##
## Prior probabilities of groups:
## 0 1
## 0.92343949 0.07656051
##
## Group means:
## sexMale raceAfrican-American raceAsian raceHispanic raceNative American
## 0 0.7911436 0.4944130 0.004966202 0.09325424 0.002896951
## 1 0.8785358 0.6189684 0.004991681 0.05490849 0.004991681
## raceOther age juv_fel_count juv_misd_count juv_other_count priors_count
## 0 0.06028418 35.32570 0.05393847 0.06552628 0.08925369 2.926197
## 1 0.05490849 31.47088 0.15307820 0.20632280 0.20798669 4.700499
## count.prior.jail count.prior.prison timeinjail timeinprison
## 0 1.790592 0.3935715 19.51159 130.4572
## 1 3.337770 0.6572379 29.08319 248.9459
##
## Coefficients of linear discriminants:
## LD1
## sexMale 0.2958845875
## raceAfrican-American 0.1979970319
## raceAsian 0.4095361664
## raceHispanic -0.0688810913
## raceNative American 0.7996058128
## raceOther 0.2187912558
## age -0.0105429079
## juv_fel_count 0.2337230598
## juv_misd_count 0.3598387673
## juv_other_count 0.1308619812
## priors_count -0.0213290796
## count.prior.jail 0.5999233178
## count.prior.prison 0.0293765792
## timeinjail 0.0016781294
## timeinprison 0.0001422081
Prior Probabilities of groups show that 92.34% of the observarions in training dataset didn’t have a violent recidivism, and 7.656% had violent recidivism.
Group mean for sexMale indicates that the variable Male has a slightly greater influence on violent recividism (0.8785358) than without violent recividism (0.7911436).
Group mean for whether Asians had violent recividism are almost equal, and the overall percentages are very small (about 0.4%).
Group means for Caucasian, Hispanic, and Other show that these races are less likely to have violent recividism.
Group mean for African Americans show that there are more violent recividism (0.6189684); the results are similar to is_recid.
All variables about criminial history (juv_fel_count, priors_count, count.prior.jail, etc.) indicate that defenders with criminal history are more likely to recividate.
The coefficients of linear discriminants for African American, Native American, other are positive, indicating that compared to Caucasian (baseline), the system is more likely to classifed these races to violent recividism. It is not necesarily racist against African Americans in terms of violent recividsm.
LDA model 2 evaluation: misclassification rate and ROC curve
# fit the lda model to the test dataframe
lda.pred2 <- predict(lda.fit.is_violent_recid, newdata = df.test2)
names(lda.pred2)
## [1] "class" "posterior" "x"
lda.class2 <- lda.pred2$class
# compare the prediction to test set using confusion matrix
test.true2 <- df1.violent$is_violent_recid[test.is_violent_recid]
confusion_matrix.lda2 <- table(lda.class2, test.true2)
confusion_matrix.lda2
## test.true2
## lda.class2 0 1
## 0 1764 141
## 1 42 16
# calcualte the misclassification rate
misclass.rate.lda2 <- (confusion_matrix.lda2[1,2]+confusion_matrix.lda2[2,1])/nrow(df.test2)
misclass.rate.lda2
## [1] 0.09322466
The LDA model is fitted to the test data set. Model evaluations are based on the misclassification rate and the ROC curve, more specifically, the AUC. A statistical difference test is conducted to see if the model is biased.
# Construct a ROC curve:
# Get the posteriors as a dataframe
lda.pred.posteriors2 <- as.data.frame(lda.pred2$posterior)
# Evaluate the model
pred2 <- prediction(lda.pred.posteriors2[,2], df.test2$is_violent_recid)
roc.perf2 = performance(pred2, measure = "tpr", x.measure = "fpr")
auc.train2 <- performance(pred2, measure = "auc")
auc.train2 <- auc.train2@y.values
# Plot
plot(roc.perf2)
abline(a=0, b= 1)
text(x = .25, y = .65 ,paste("AUC = ", round(auc.train2[[1]],3), sep = ""))
#Is the model biased in terms of race?
df.test2$predictions <- lda.class2
df.test2$predictions[df.test2$predictions == "No Recid"] <- 0
df.test2$predictions[df.test2$predictions == "Yes Recid"] <- 1
df.test2$predictions <- as.numeric(df.test2$predictions)
#Race
race.r <- ggplot(df.test2, aes(x = race, fill = as.factor(predictions))) + geom_bar(position = 'dodge') +
scale_fill_discrete(name = "Recidivated", labels = c("No", "Yes")) +
xlab('Race') +
ylab('Number of People in Bin')
race.r
## Df Sum Sq Mean Sq F value Pr(>F)
## race 5 0.31 0.06144 2.148 0.0573 .
## Residuals 1957 55.98 0.02860
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
race | predictions |
---|---|
Caucasian | 1.02 |
African-American | 1.04 |
Asian | 1.00 |
Hispanic | 1.01 |
Native American | 1.00 |
Other | 1.02 |
## Df Sum Sq Mean Sq F value Pr(>F)
## race 4 0.014 0.003485 0.199 0.939
## Residuals 1010 17.667 0.017492
Conclusion 2: LDA model for is_violent_recid gives a misclassfication rate = 9.322%.
LDA model for is_violent_recid gives a ROC curve with AUC = 0.786.
The misclassification rate is 9.322%.
The ROC curve has an AUC of 0.786.
The coefficients of linear discriminants for African American, Native American, and Other are positive, indicating that compared to Caucasian (baseline), the system is more likely to classify these races to violent recidivism. It is not necessarily biased against African Americans in terms of violent recidivism.
A significance analysis also shows that the model is not biased. Although African Americans are predicted to have recidivism slightly higher than other races, the difference is not significant at a 0.05 level. After removing African Americans, we no longer have statistical significance.
QDA model 1: “is_recid”
# create a dataframe without is_violent_recid to avoid colinearity
df2 <- recid[,c("is_recid", "sex", "race", "age", "juv_fel_count", "juv_misd_count", "juv_other_count", "priors_count", "count.prior.jail", "count.prior.prison", "timeinjail","timeinprison")]
set.seed(2)
ind = sample(2,nrow(df2),replace=TRUE,prob=c(0.7,0.3))
train.is_recid.qda = df2[ind==1,]
test.is_recid.qda = df2[ind==2,]
# fit the lda model to the training data set
qda.fit.is_recid <- qda(is_recid~., data=train.is_recid.qda)
qda.fit.is_recid
## Call:
## qda(is_recid ~ ., data = train.is_recid.qda)
##
## Prior probabilities of groups:
## 0 1
## 0.668913 0.331087
##
## Group means:
## sexMale raceAfrican-American raceAsian raceHispanic raceNative American
## 0 0.7645642 0.4500657 0.006351292 0.10183968 0.002628121
## 1 0.8530973 0.5933628 0.003539823 0.06327434 0.002654867
## raceOther age juv_fel_count juv_misd_count juv_other_count priors_count
## 0 0.06811213 36.31559 0.03613666 0.04402102 0.06373193 2.168857
## 1 0.04557522 32.07257 0.10707965 0.14955752 0.16504425 4.711947
## count.prior.jail count.prior.prison timeinjail timeinprison
## 0 1.339028 0.2989488 18.73007 99.41272
## 1 3.049558 0.5597345 23.65398 199.60752
QDA model contains group means but not coefficients of the estimates because the QDA classifier involves a quadratic, rather than a linear, function of the predictors.
From the group means, we can see that male still have more two-year-recidivism than non-two-year-recidivism.
We also observe that African Americans are more likely to have recidivism than non-recidivism. The ratio is very close to the LDA model (non-two-year-recidivism = 0.4500657; two-year-recidivism = 0.5933628).
Native Americans are equally likely to either recividate or not. Asian, Hispanic, and Other are less likely to recidivate.
All variables about criminal history (juv_fel_count, priors_count, count.prior.jail, etc.) indicate that defenders with a criminal history are more likely to have violent recidivism.
QDA model 1 evaluation: misclassification rate and ROC curve
# fit the lda model to the test dataframe
qda.pred1 <- predict(qda.fit.is_recid, test.is_recid.qda)
names(qda.pred1)
## [1] "class" "posterior"
qda.class1 <- qda.pred1$class
confusion_matrix.qda1 <- table(qda.class1,test.is_recid.qda$is_recid)
confusion_matrix.qda1
##
## qda.class1 0 1
## 0 1752 523
## 1 259 453
misclassification.rate.qda1 <- (confusion_matrix.qda1[1,2]+confusion_matrix.qda1[2,1])/nrow(df2)
misclassification.rate.qda1
## [1] 0.07969021
The QDA model is fitted to the test data set. Model evaluations are based on the misclassification rate and the ROC curve, more specifically, the AUC. A statistical difference test is conducted to see if the model is biased.
# Construct a ROC curve:
# Get the posteriors as a dataframe
qda.pred.posteriors1 <- as.data.frame(qda.pred1$posterior)
# Evaluate the model
pred3 <- prediction(qda.pred.posteriors1[,2], test.is_recid.qda$is_recid)
roc.perf3 = performance(pred3, measure = "tpr", x.measure = "fpr")
auc.train3 <- performance(pred3, measure = "auc")
auc.train3 <- auc.train3@y.values
# Plot
plot(roc.perf3)
abline(a=0, b= 1)
text(x = .25, y = .65 ,paste("AUC = ", round(auc.train3[[1]],3), sep = ""))
#Is the model biased in terms of race?
test.is_recid.qda$predictions <- qda.class1
test.is_recid.qda$predictions[test.is_recid.qda$predictions == "No Recid"] <- 0
test.is_recid.qda$predictions[test.is_recid.qda$predictions == "Yes Recid"] <- 1
test.is_recid.qda$predictions <- as.numeric(test.is_recid.qda$predictions)
#Race
race.r <- ggplot(test.is_recid.qda, aes(x = race, fill = as.factor(predictions))) + geom_bar(position = 'dodge') +
scale_fill_discrete(name = "Recidivated", labels = c("No", "Yes")) +
xlab('Race') +
ylab('Number of People in Bin')
race.r
## Df Sum Sq Mean Sq F value Pr(>F)
## race 5 37.4 7.471 44.11 <2e-16 ***
## Residuals 2981 504.9 0.169
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
race | predictions |
---|---|
Caucasian | 1.16 |
African-American | 1.34 |
Asian | 1.00 |
Hispanic | 1.08 |
Native American | 1.09 |
Other | 1.05 |
## Df Sum Sq Mean Sq F value Pr(>F)
## race 4 3.27 0.8167 7.316 0.00000783 ***
## Residuals 1474 164.55 0.1116
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Age
ggplot(test.is_recid.qda, aes(x = age, fill = as.factor(predictions))) +
geom_histogram(position = 'fill') +
scale_fill_discrete(name = "Recidivated", labels = c("No", "Yes")) +
xlab('Age') +
ylab('Number of People in Bin')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing missing values (geom_bar).
#Sex
ggplot(test.is_recid.qda, aes(x = sex, fill = as.factor(predictions))) +
geom_bar(position = 'dodge') +
scale_fill_discrete(name = "Recidivated", labels = c("No", "Yes")) +
xlab('Sex') +
ylab('Number of People in Bin')
##
## Welch Two Sample t-test
##
## data: predictions by sex
## t = -11.947, df = 1431.8, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.2088107 -0.1499096
## sample estimates:
## mean in group Female mean in group Male
## 1.095935 1.275295
The misclassification rate is 7.969%
The ROC curve has an AUC of 0.763.
A significance analysis doesn’t necessarily show that the model is racist. African Americans are predicted to have recidivism higher than other races and the difference is statistically significant at a 0.05 level. BUt after removing African Americans, the difference is still statistically significant.
Younger people are predicted to have a higher probability of recidivating compared to older people.
Males are also predicted to have a significantly higher chance of recidivating than females.
QDA accurately captures the trends we see in the EDA.
Conclusion 3: QDA model for is_recid gives a misclassfication rate = 7.969%.
QDA model for is_recid gives a ROC curve with AUC = 0.763.
QDA model 2: “is_violent_recid”
# create a dataframe without is_recid to avoid colinearity
df3 <- recid[,c("is_violent_recid", "sex", "race", "age", "juv_fel_count", "juv_misd_count", "juv_other_count", "priors_count", "count.prior.jail", "count.prior.prison", "timeinjail","timeinprison")]
set.seed(2)
ind = sample(2,nrow(df3),replace=TRUE,prob=c(0.7,0.3))
train.is_violent_recid.qda = df3[ind==1,]
test.is_violent_recid.qda = df3[ind==2,]
# fit the lda model to the training data set
qda.fit.is_violent_recid <- qda(is_violent_recid~., data=train.is_violent_recid.qda)
qda.fit.is_violent_recid
## Call:
## qda(is_violent_recid ~ ., data = train.is_violent_recid.qda)
##
## Prior probabilities of groups:
## 0 1
## 0.92396718 0.07603282
##
## Group means:
## sexMale raceAfrican-American raceAsian raceHispanic raceNative American
## 0 0.7864278 0.4864436 0.005549390 0.09227842 0.002378310
## 1 0.8843931 0.6319846 0.003853565 0.05009634 0.005780347
## raceOther age juv_fel_count juv_misd_count juv_other_count priors_count
## 0 0.06151895 35.18392 0.05216426 0.06849532 0.08974156 2.867449
## 1 0.05009634 31.59152 0.15028902 0.20616570 0.18882466 4.753372
## count.prior.jail count.prior.prison timeinjail timeinprison
## 0 1.784208 0.3689551 19.58380 125.4194
## 1 3.377649 0.5838150 29.79672 219.6753
From the group means, we can see that Male still has a greater influence on two-year-recidivism (0.8843931) than on non-two-year-recidivism (0.7864278). African Americans are also more likely to have violent recidivism (0.6319846) than without violent-recidivism (0.4864436). The difference is slightly larger than the two-year-recidivism. Native Americans are more likely to have violent recidivism. Asian, Hispanic, and Other are less likely to have violent recidivism.
All variables about criminal history (juv_fel_count, priors_count, count.prior.jail, etc.) indicate that defenders with a criminal history are more likely to have violent recidivism.
LDA and QDA have essentially the same finding for African American. The reason could be that African Americans are arrested more, and therefore, there are more African Americans in the system than other races.
QDA model contains group means but not coefficients of the estimates because the QDA classifier involves a quadratic, rather than a linear, function of the predictors.
QDA model 2: validation and misclassification rate
# fit the lda model to the test dataframe
qda.pred2 <- predict(qda.fit.is_violent_recid, test.is_violent_recid.qda)
names(qda.pred2)
## [1] "class" "posterior"
qda.class2 <- qda.pred2$class
confusion_matrix.qda2 <- table(qda.class2,test.is_violent_recid.qda$is_violent_recid)
confusion_matrix.qda2
##
## qda.class2 0 1
## 0 2551 190
## 1 197 49
misclassification.rate.qda2 <- (confusion_matrix.qda2[1,2]+confusion_matrix.qda2[2,1])/nrow(df3)
misclassification.rate.qda2
## [1] 0.03943748
The QDA model is fitted to the test data set. Model evaluations are based on the misclassification rate and the ROC curve, more specifically, the AUC. A statistical difference test is conducted to see if the model is biased.
# Construct a ROC curve:
# Get the posteriors as a dataframe
qda.pred.posteriors2 <- as.data.frame(qda.pred2$posterior)
# Evaluate the model
pred4 <- prediction(qda.pred.posteriors2[,2], test.is_violent_recid.qda$is_violent_recid)
roc.perf4 = performance(pred4, measure = "tpr", x.measure = "fpr")
auc.train4 <- performance(pred4, measure = "auc")
auc.train4 <- auc.train4@y.values
# Plot
plot(roc.perf4)
abline(a=0, b= 1)
text(x = .25, y = .65 ,paste("AUC = ", round(auc.train4[[1]],3), sep = ""))
qda.pred.is_recid <- qda.class1
qda.pred.vis_recid <- qda.class2
qda.test.data.r <- test.is_recid.qda$is_recid
qda.test.data.v <- test.is_violent_recid.qda$is_violent_recid
#Is the model biased in terms of race?
test.is_violent_recid.qda$predictions <- qda.class2
test.is_violent_recid.qda$predictions[test.is_violent_recid.qda$predictions == "No Recid"] <- 0
test.is_violent_recid.qda$predictions[test.is_violent_recid.qda$predictions == "Yes Recid"] <- 1
test.is_violent_recid.qda$predictions <- as.numeric(test.is_violent_recid.qda$predictions)
#Race
race.r <- ggplot(test.is_violent_recid.qda, aes(x = race, fill = as.factor(predictions))) + geom_bar(position = 'dodge') +
scale_fill_discrete(name = "Recidivated", labels = c("No", "Yes")) +
xlab('Race') +
ylab('Number of People in Bin')
race.r
## Df Sum Sq Mean Sq F value Pr(>F)
## race 5 14.74 2.9485 41.66 <2e-16 ***
## Residuals 2981 211.00 0.0708
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
kable(test.is_violent_recid.qda %>%
group_by(race) %>%
summarize_at(c("predictions"), mean), digits = 2)
race | predictions |
---|---|
Caucasian | 1.04 |
African-American | 1.12 |
Asian | 1.00 |
Hispanic | 1.03 |
Native American | 2.00 |
Other | 1.02 |
## Df Sum Sq Mean Sq F value Pr(>F)
## race 4 10.27 2.5669 74.24 <2e-16 ***
## Residuals 1474 50.96 0.0346
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Age
ggplot(test.is_violent_recid.qda, aes(x = age, fill = as.factor(predictions))) +
geom_histogram(position = 'fill') +
scale_fill_discrete(name = "Recidivated", labels = c("No", "Yes")) +
xlab('Age') +
ylab('Number of People in Bin')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing missing values (geom_bar).
#Sex
ggplot(test.is_violent_recid.qda, aes(x = sex, fill = as.factor(predictions))) +
geom_bar(position = 'dodge') +
scale_fill_discrete(name = "Recidivated", labels = c("No", "Yes")) +
xlab('Sex') +
ylab('Number of People in Bin')
##
## Welch Two Sample t-test
##
## data: predictions by sex
## t = -8.7563, df = 1981.8, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.09185107 -0.05823599
## sample estimates:
## mean in group Female mean in group Male
## 1.022764 1.097808
The misclassification rate is 3.9437%.
The ROC curve has an AUC of 0.696.
A significance analysis doesn’t necessarily show that the model is racist. African Americans are predicted to have recidivism higher than other races and the difference is statistically significant at a 0.05 level. BUt after removing African Americans, the difference is still statistically significant.
Younger people are predicted to violently recidivate more than older people.
Males are predicted to significantly violently recidivate more than females.
This QDA model accurately captures the trends we observed in the Violent Recidivism EDA.
Conclusion 4: QDA model for is_violent_recid gives a misclassfication rate = 3.9437%.
QDA model for is_violent_recid gives a ROC curve with AUC = 0.696.
LDA QDA Model Summary:
is_recid: QDA has a lower misclassification rate than LDA; LDA has a larger AUC than QDA is_violent_recid: QDA has a lower misclassfication rate than LDA; LDA has a larger AUC than QDA We may only say that the LDA model for is_recid could be racist against African American.
DANIEL – Random Forest and Neural Networks
Before I started, I removed all the columns that are information regarding that specific individual or the charge for both random forest model and neural network model. “C_charge_degree” variables with less than 10 in the data set were removed because we concluded that is not enough sample to train our model to be used for prediction. The outcome variables, “is_recid” and “is_violent_recid”, had to be converted to a factor for random forest model and to a boolean for neural network model. I separated the test dataset and training dataset by choosing the variables by random and 70% used for training and and 30% used for testing for both models. The Neural Network model needed extra cleaning because it only took quantitative inputs for every variable and they had to be scaled front [0,1]. All the factor variables were turned into dummy columns and all independent variables were normalized.
XNormalized=(X−Xmin)(Xmax−Xmin)
We also had to change the name of the columns that included the “-” symbol because the neuralnet() function could take that symbol. Neural network models were repeated 4 or 5 times to find the best model with the lowest error rate and while keeping the number of repetition low to avoid overfitting the model to training dataset.
We settled on the use of 14 variables in our data set. As John mentioned above, I didn’t use all 14 variables on the Random Forest model to avoid collinear and other issues. I chose to use the square root of 12. I chose this number because when I used the model to train and created a chart of importance of each variable
Import CSV files
Function to Analyze Confusion Matrix
classMetrics <- function(score, y,
type = c("all", "accuracy", "sensitivity",
"specificity", "ppv", "npv", "precision",
"recall")) {
# This command throws an error if the user specifies a "type" that
# isn't supported by this function
type <- match.arg(type, several.ok = TRUE)
predicted = score
observed = y
#Confidence Matrix
conf.matrix = table(predicted,observed)
#Types
A = conf.matrix[1,1]
B = conf.matrix[1,2]
C = conf.matrix[2,1]
D = conf.matrix[2,2]
acc = (A+D)/(A+B+C+D)
spec = A/(A+C)
sens = D/(B+D)
ppv = D/(C+D)
npv = A/(A+B)
precision = ppv
recall = sens
m.error <- (B + C)/(A+B+C+D)
nice = data.frame("accuracy"=acc, "sensitivity"=sens, "specificity"=spec, "ppv"=ppv, "npv"=npv, "precision"=precision,"recall"=recall,"misclassification rate"=m.error)
if(type != "all"){
nice = nice[type]
nice = t(nice)
colnames(nice) = c("value")
}else{
nice = t(nice)
colnames(nice) = c("value")
}
list(conf.mat=conf.matrix,perf = nice)
}
Exploratory Phase
Delete Irrelevant Columns
# We don't need peoples' name do predict if that person will recid
recid$name=NULL
recid$first=NULL
recid$last=NULL
# We don't need age because we have age catagory
####recid$age=NULL
recid$age_cat=NULL
# Remove anything related to COMPAS
recid$compas_screening_date=NULL
recid$decile_score=NULL
recid$c_days_from_compas=NULL
# Case number/Identification number is an identification of the case, so won't be needed
recid$r_case_number=NULL
recid$vr_case_number=NULL
recid$X=NULL
recid$person_id=NULL
# Date of birth not needed becasue we have age category
recid$dob=NULL
# We don't need the what date the person was arrested because getting arrested in specific date has nothing to do with if the person will be recid or not. Also, we have to count of how many times the person was arrested.
recid$c_arrest_date=NULL
recid$c_offense_date=NULL
recid$r_offense_date=NULL
recid$r_jail_in=NULL
recid$r_jail_out=NULL
recid$vr_offense_date=NULL
# vr_charge or r_charge degree is not needed becasue this is a description of violent recidivism
recid$vr_charge_degree=NULL
recid$r_charge_degree=NULL
# Not Necessary
recid$r_days_from_arrest=NULL
recid$num_r_cases=NULL
# reorder the race variables and make Caucasian the baseline
recid$race = factor(recid$race, levels = c("Caucasian","African-American","Asian","Hispanic","Native American","Other"))
Remove columns that describe the individual or the charge status
Delete Irrelevant Rows
##
## (CO3) (CT) (F1) (F2) (F3) (F5) (F6) (F7) (M1) (M2) (MO3) (NI0) (TCX)
## 1 1 152 821 5265 5 5 94 2680 734 67 6 1
## (X)
## 1
x1 = which(recid$c_charge_degree=="(CO3)")
recid = recid[-x1,]
x2 = which(recid$c_charge_degree=="(CT)")
recid = recid[-x2,]
x3 = which(recid$c_charge_degree=="(TCX)")
recid = recid[-x3,]
x4 = which(recid$c_charge_degree=="(X)")
recid = recid[-x4,]
x5 = which(recid$c_charge_degree=="(F5)")
recid = recid[-x5,]
x6 = which(recid$c_charge_degree=="(F6)")
recid = recid[-x6,]
x7 = which(recid$c_charge_degree=="(NI0)")
recid = recid[-x7,]
table(recid$c_charge_degree)
##
## (CO3) (CT) (F1) (F2) (F3) (F5) (F6) (F7) (M1) (M2) (MO3) (NI0) (TCX)
## 0 0 152 821 5265 0 0 94 2680 734 67 0 0
## (X)
## 0
The factors with quantities of single digit will not be effective when predicting if the specific factor makes a significant impact on
is_recid
Random Forest
# Change outcome variables to a factor
recid$is_recid[recid$is_recid==1] = "yes"
recid$is_recid[recid$is_recid==0] = "no"
recid$is_violent_recid[recid$is_violent_recid==1] = "yes"
recid$is_violent_recid[recid$is_violent_recid==0] = "no"
Change the outcome variables to be a factor
set.seed(222)
ind = sample(2,nrow(recid),replace=TRUE,prob=c(0.7,0.3))
recid.train = recid[ind==1,]
recid.test = recid[ind==2,]
Divide the
recid
dataset into two data sets, train and test. The training dataset is made up of 70% randomcly chosen from therecid
dataset. The rest of the 30% will be assigned as a test dataset
Use the randomForest
command to fit a random forest to recid.train
of is_recid
recid.rf.1 = randomForest(as.factor(is_recid) ~.-is_violent_recid, data = recid.train,mytr=sqrt(12),importance=TRUE)
recid.rf.1
##
## Call:
## randomForest(formula = as.factor(is_recid) ~ . - is_violent_recid, data = recid.train, mytr = sqrt(12), importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 20.39%
## Confusion matrix:
## no yes class.error
## no 3943 650 0.1415197
## yes 758 1553 0.3279965
Perform
randomForest
on theis_recid
and convert the outcome into a factor
Use the randomForest
command to fit a random forest to recid.train
of is_violent_recid
recid.rf.2 = randomForest(as.factor(is_violent_recid) ~.-is_recid, data = recid.train,mytr=sqrt(12),importance=TRUE)
recid.rf.2
##
## Call:
## randomForest(formula = as.factor(is_violent_recid) ~ . - is_recid, data = recid.train, mytr = sqrt(12), importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 7.94%
## Confusion matrix:
## no yes class.error
## no 6347 17 0.002671276
## yes 531 9 0.983333333
Perform
randomForest
on theis_violent_recid
and convert the outcome into a factor
Finding Standard Classification Error of Random Forest
model on the Training data of is_recid
rf.misclass.1 = 1-(recid.rf.1$confusion[1,1]+recid.rf.1$confusion[2,2])/(recid.rf.1$confusion[1,1]+recid.rf.1$confusion[2,2]+recid.rf.1$confusion[1,2]+recid.rf.1$confusion[2,1])
rf.misclass.1
## [1] 0.2039397
The misclassification rate of Random Forest model on
is_recid
is 20.4%
Finding Standard Classification Error of Random Forest
model on the Training data of is_violent_recid
rf.misclass.2 = 1-(recid.rf.2$confusion[1,1]+recid.rf.2$confusion[2,2])/(recid.rf.2$confusion[1,1]+recid.rf.2$confusion[2,2]+recid.rf.2$confusion[1,2]+recid.rf.2$confusion[2,1])
rf.misclass.2
## [1] 0.07937428
The misclassification rate of Random Forest model on
is_violent_recid
is 7.94%
#### Construct a variable importance plot of is_recid
count.prior.jail
variable is the most important variable in theis_recid
training dataset
#### Construct a variable importance plot of is_volent_recid
count.prior.jail
andc_change_degree
variables are the most important variable in theis_violent_recid
training dataset
Use the predict
command to obtain probability estimates on the test data is_recid
predict.rf.1 = predict(recid.rf.1,newdata=recid.test,type="class")
predict.rf.prob.1 = predict(recid.rf.1,newdata=recid.test,type="prob")
#------
classMetrics(predict.rf.1,recid.test$is_recid, type = "all")
## $conf.mat
## observed
## predicted no yes
## no 1689 299
## yes 295 626
##
## $perf
## value
## accuracy 0.7958061
## sensitivity 0.6767568
## specificity 0.8513105
## ppv 0.6796960
## npv 0.8495976
## precision 0.6796960
## recall 0.6767568
## misclassification.rate 0.2041939
Accuracy of Random Forest model applied to test dataset of
is_recid
is 79.5% and the misclassification rate is 20.5%
Use the predict
command to obtain probability estimates on the test data is_violent_recid
predict.rf.2 = predict(recid.rf.2,newdata=recid.test,type="class")
predict.rf.prob.2 = predict(recid.rf.2,newdata=recid.test,type="prob")
#------
classMetrics(predict.rf.2,recid.test$is_violent_recid, type = "all")
## $conf.mat
## observed
## predicted no yes
## no 2685 215
## yes 6 3
##
## $perf
## value
## accuracy 0.92402888
## sensitivity 0.01376147
## specificity 0.99777035
## ppv 0.33333333
## npv 0.92586207
## precision 0.33333333
## recall 0.01376147
## misclassification.rate 0.07597112
Accuracy of Random Forest model applied to test dataset of
is_recid
is 92.4% and the misclassification rate is 7.6%
The ROC curve for the pruned tree of is_recid
#roc.rf.1 = roc(recid.test$is_recid,predict.rf.prob.1[,2])
#plot(roc.rf.1,col="steelblue")
# Evaluate the model
pred <- prediction(predict.rf.prob.1[,2], recid.test$is_recid)
roc.perf = performance(pred, measure = "tpr", x.measure = "fpr")
auc.train <- performance(pred, measure = "auc")
auc.train <- auc.train@y.values
# Plot
plot(roc.perf)
abline(a=0, b= 1)
text(x = .25, y = .65 ,paste("AUC = ", round(auc.train[[1]],3), sep = ""))
AUC of Random Forest model on
is_recid
is 0.842
The ROC curve for the pruned tree of is_violent_recid
#roc.rf.2 = roc(recid.test$is_violent_recid,predict.rf.prob.2[,2])
#plot(roc.rf.2,col="green")
# Evaluate the model
pred <- prediction(predict.rf.prob.2[,2], recid.test$is_violent_recid)
roc.perf = performance(pred, measure = "tpr", x.measure = "fpr")
auc.train <- performance(pred, measure = "auc")
auc.train <- auc.train@y.values
# Plot
plot(roc.perf)
abline(a=0, b= 1)
text(x = .25, y = .65 ,paste("AUC = ", round(auc.train[[1]],3), sep = ""))
AUC of Random Forest model on
is_recid
is 0.795
Artificial Neural Network
Add New Dictionary
##
## Attaching package: 'neuralnet'
## The following object is masked from 'package:ROCR':
##
## prediction
## The following object is masked from 'package:dplyr':
##
## compute
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
Delete Irrelevant Columns
# We don't need peoples' name do predict if that person will recid
recid$name=NULL
recid$first=NULL
recid$last=NULL
# We don't need age because we have age catagory
####recid$age=NULL
recid$age_cat=NULL
# Remove anything related to COMPAS
recid$compas_screening_date=NULL
recid$decile_score=NULL
recid$c_days_from_compas=NULL
# Case number/Identification number is an identification of the case, so won't be needed
recid$r_case_number=NULL
recid$vr_case_number=NULL
recid$X=NULL
recid$person_id=NULL
# Date of birth not needed becasue we have age category
recid$dob=NULL
# We don't need the what date the person was arrested because getting arrested in specific date has nothing to do with if the person will be recid or not. Also, we have to count of how many times the person was arrested.
recid$c_arrest_date=NULL
recid$c_offense_date=NULL
recid$r_offense_date=NULL
recid$r_jail_in=NULL
recid$r_jail_out=NULL
recid$vr_offense_date=NULL
# vr_charge or r_charge degree is not needed becasue this is a description of violent recidivism
recid$vr_charge_degree=NULL
recid$r_charge_degree=NULL
# Not Necessary
recid$r_days_from_arrest=NULL
recid$num_r_cases=NULL
# reorder the race variables and make Caucasian the baseline
recid$race = factor(recid$race, levels = c("Caucasian","African-American","Asian","Hispanic","Native American","Other"))
Delete Irrelevant Rows
##
## (CO3) (CT) (F1) (F2) (F3) (F5) (F6) (F7) (M1) (M2) (MO3) (NI0) (TCX)
## 1 1 152 821 5265 5 5 94 2680 734 67 6 1
## (X)
## 1
x1 = which(recid$c_charge_degree=="(CO3)")
recid = recid[-x1,]
x2 = which(recid$c_charge_degree=="(CT)")
recid = recid[-x2,]
x3 = which(recid$c_charge_degree=="(TCX)")
recid = recid[-x3,]
x4 = which(recid$c_charge_degree=="(X)")
recid = recid[-x4,]
x5 = which(recid$c_charge_degree=="(F5)")
recid = recid[-x5,]
x6 = which(recid$c_charge_degree=="(F6)")
recid = recid[-x6,]
x7 = which(recid$c_charge_degree=="(NI0)")
recid = recid[-x7,]
table(recid$c_charge_degree)
##
## (CO3) (CT) (F1) (F2) (F3) (F5) (F6) (F7) (M1) (M2) (MO3) (NI0) (TCX)
## 0 0 152 821 5265 0 0 94 2680 734 67 0 0
## (X)
## 0
Create a New Tab for is_recid outcome in Boolean to operate “Neural Network” Package
# Change the Outcome Variables to Boolean to operate ANN
recid.new = recid %>% mutate(is_recid = ifelse(is_recid=="yes",TRUE,FALSE),is_violent_recid = ifelse(is_violent_recid=="yes",TRUE,FALSE))
# Change Categorical Variables to Dummy Variables
recid.new = fastDummies::dummy_cols(recid.new)
# Change Numberical variables to fit from [0,1]
scale01 <- function(x){
(x - min(x)) / (max(x) - min(x))
}
recid.new = recid.new %>% mutate(juv_fel_count=scale01(juv_fel_count), juv_misd_count=scale01(juv_misd_count),juv_other_count=scale01(juv_other_count),priors_count=scale01(priors_count),count.prior.jail=scale01(count.prior.jail),count.prior.prison=scale01(count.prior.prison),age=scale01(age))
# Remove Redundant Columns
recid.new$sex=NULL
recid.new$race=NULL
recid.new$timeinjail=NULL
recid.new$timeinprison=NULL
recid.new$is.recid=NULL
recid.new$c_charge_degree=NULL
# Change Names of Columns
names(recid.new)[8] = "count_prior_jail"
names(recid.new)[9] = "count_prior_prison"
names(recid.new)[13] = "race_african_american"
names(recid.new)[16] = "race_native_american"
names(recid.new)[21] = "f2"
names(recid.new)[20] = "f1"
names(recid.new)[22] = "f3"
names(recid.new)[25] = "f7"
names(recid.new)[26] = "m1"
names(recid.new)[27] = "m2"
names(recid.new)[28] = "mo3"
# Remove extra unnecessary columns
names(recid.new)[18] = "co3"
names(recid.new)[19] = "ct"
names(recid.new)[23] = "f5"
names(recid.new)[24] = "f6"
names(recid.new)[29] = "ni0"
names(recid.new)[30] = "tcx"
names(recid.new)[31] = "x"
recid.new$co3 =NULL
recid.new$ct=NULL
recid.new$f5=NULL
recid.new$f6=NULL
recid.new$ni0=NULL
recid.new$tcx=NULL
recid.new$x=NULL
Neural Networks only deal with quantitative variables so I had to clean up the data. All the factor variables were created as dummny variables, then the original factor variables were deleted. The independent variables were normailzed so that they are scaled in an equal manner
Create Train and Test data set
Plot the Neural Network of is_recid
The Neural Network model on
is_recid
, African-American has weight of 5.48 and Caucasian has weight of 6.64. The model is not racially bias towards African-American
Plot the Neural Network of is_violent_recid
The Neural Network model on
is_violent_recid
, African-American has weight of 4.28 and Caucasian has weight of 4.82. The model is not racially bias towards African-American
Confusion Matrix and Misclassification Rate on Training & Testing Data on NN2
# Remove the output variable before using the function "compute"
ann.test.out.1 = neuralnet::compute(recid.ann.1,recid.testing[,c(-6,-7)],rep=1)
# Create a Confusion Matrix to find Misclassification Rate
ann.test.out.result.1 = ann.test.out.1$net.result
ann.test.pred.1 = ifelse(ann.test.out.result.1>0.5,1,0)
ann.test.tab.1 = table(recid.test.1$is_recid,ann.test.pred.1)
ann.test.tab.1
## ann.test.pred.1
## 0
## 0 1984
## 1 925
#------
#classMetrics(ann.test.pred.1,recid.test.1$is_recid, type = "all")
mis.1 = 1-sum(diag(ann.test.tab.1))/sum(ann.test.tab.1)
mis.1
## [1] 0.3179787
The Neural Network model on
is_recid
has misclassification rate of 31.8%
# Remove the output variable before using the function "compute"
ann.test.out.2 = neuralnet::compute(recid.ann.2,recid.testing[,c(-6,-7)],rep=1)
# Create a Confusion Matrix to find Misclassification Rate
ann.test.out.result.2 = ann.test.out.2$net.result
ann.test.pred.2 = ifelse(ann.test.out.result.2>0.5,1,0)
ann.test.tab.2 = table(recid.test.1$is_violent_recid,ann.test.pred.2)
ann.test.tab.2
## ann.test.pred.2
## 0
## 0 2691
## 1 218
#------
#classMetrics(ann.test.pred.2,recid.test.1$is_violent_recid, type = "all")
mis.2 = 1-sum(diag(ann.test.tab.2))/sum(ann.test.tab.2)
mis.2
## [1] 0.07493984
The Neural Network model on
is_violent_recid
has misclassification rate of 7.49%
Bar Plot ofis_recid
# Bar plot of results
class.ann = tibble('Network' = rep(c("NN2"), each = 3),
'Metric' = rep(c('AIC', 'BIC', 'ce Error'), length.out = 3),
'Value' = c(recid.ann.1$result.matrix[4,1],
recid.ann.1$result.matrix[5,1], recid.ann.1$result.matrix[1,1]))
class.ann %>%
ggplot(aes(Network, Value, fill = Metric)) +
geom_col(position = 'dodge') +
ggtitle("AIC, BIC, and Cross-Entropy Error of the Classification ANNs for Recidivism")
Bar Plot ofis_violent_recid
# Bar plot of results
class.ann = tibble('Network' = rep(c("NN2"), each = 3),
'Metric' = rep(c('AIC', 'BIC', 'ce Error'), length.out = 3),
'Value' = c(recid.ann.2$result.matrix[4,1],
recid.ann.2$result.matrix[5,1], recid.ann.2$result.matrix[1,1]))
class.ann %>%
ggplot(aes(Network, Value, fill = Metric)) +
geom_col(position = 'dodge') +
ggtitle("AIC, BIC, and Cross-Entropy Error of the Classification ANNs for Volent Recidivism")
Random Forest model & Neural Network Findings & Summary
Findings Two Year Recidivism I found that the Random Forest model resulted in better accuracy compared to the Neural Network model(A = 0.795 vs. 0.791). The misclassification rate was 0.205 for the Random Forest model and 0.318 for the Neural Network model, which the Random Forest model performed better. I would select to use the Random Forest model to predict recidivism.
Was the Model Biased?
Unfortunately, it is difficult to measure if the model is biased with the Random Forest model because the measure of importance of each variable can be measured but which specific race will result in higher “is_recid” can’t be found. However, I can estimate the race bias in the model by predicting the model on the test dataset when it is divided by African-American and Caucasian subgroups. The FPR of Caucasian individuals is ~10% lower than the FPR of African-American individuals. This result suggests the model predicted a African-American group as will recidivate in the future but had a higher proportion of the predicted individual was false. From this analysis, my Random Forest model for “is_recid” is racially biased.
Neural network models are less biased because the weight on African-American individuals is 5.48; in contrast, the weight on Caucasian individuals is 6.64.
Violent Recidivism I found that the Random Forest model resulted in better accuracy compared to the Neural Network model(A = 0.924). The misclassification rate was 0.076 for the Random Forest model and 0.075 for the Neural Network model, which the neural network model performed better. However, there is very small difference in the misclassification rate but a lot better accuracy for Random Forest model so I will pick the Random Forest model for “is_violent_recid”
Was the Model Biased?
Unfortunately, it is difficult to measure if the model is biased with the Random Forest model because the measure of importance of each variable can be measured but which specific race will result in higher “is_violent_recid” can’t be found. However, I can estimate the race bias in the model by predicting the model on the test dataset when it is divided by African-American and Caucasian subgroups. The FPR of Caucasian individuals is ~0.022% lower than the FPR of African-American individuals. This result suggests my Random Forest model for “is_violent_recid” is racially unbiased.
Neural network models are less biased because the weight on African-American individuals is 5.48; in contrast, the weight on Caucasian individuals is 6.64.
In Summary
Random Forest model performs better than Neural Network model with lower misclassification rate (0.21 vs. 0.32) [`is_recid]
Random Forest model performs similar compared to Neural Network model with misclassification rate (0.076 vs. 0.075) [`is_violent_recid]
By comparing the two models, I can conclude that the Random Forest model performs better than the Neural Network model
JOSE – Linear Regression & K-Nearest-Neighbors Classifier
My part of the analysis consists of taking the recidivism data set and applying simple linear regression and k-nearest neighbor as classification methods. The analysis is applied by breaking down the data set into violent and non-violent recidivism. Before applying these methods, however, I use k-means methods to the data to discern whether there are any discernible clusters of observations within the data set.
Data Selection
rm(list= ls()[!(ls() %in% c('qda.pred.is_recid', "qda.pred.vis_recid",
"qda.test.data.r", "qda.test.data.v"))])
recid <- read.csv(file = 'recid_data.csv', row.names = 1)
recid <- recid[which(recid$decile_score >= 0),]
# We don't need peoples' name do predict if that person will recid
recid$name=NULL
recid$first=NULL
recid$last=NULL
# We don't need age because we have age catagory
####recid$age=NULL
recid$age_cat=NULL
# Remove anything related to COMPAS
recid$compas_screening_date=NULL
recid$decile_score=NULL
recid$c_days_from_compas=NULL
# Case number/Identification number is an identification of the case, so won't be needed
recid$r_case_number=NULL
recid$vr_case_number=NULL
recid$X=NULL
recid$person_id=NULL
# Date of birth not needed becasue we have age category
recid$dob=NULL
# We don't need the what date the person was arrested because getting arrested in specific date has nothing to do with if the person will be recid or not. Also, we have to count of how many times the person was arrested.
recid$c_arrest_date=NULL
recid$c_offense_date=NULL
recid$r_offense_date=NULL
recid$r_jail_in=NULL
recid$r_jail_out=NULL
recid$vr_offense_date=NULL
# vr_charge or r_charge degree is not needed becasue this is a description of violent recidivism
recid$vr_charge_degree=NULL
recid$r_charge_degree=NULL
# FOR NOW
#recid$c_charge_degree=NULL
recid$r_days_from_arrest=NULL
# Change na to 0
#recid$num_r_cases[is.na(recid$num_r_cases)]=0
recid$num_r_cases=NULL
names(recid)
## [1] "sex" "race" "age"
## [4] "juv_fel_count" "juv_misd_count" "juv_other_count"
## [7] "priors_count" "c_charge_degree" "is_recid"
## [10] "is_violent_recid" "count.prior.jail" "count.prior.prison"
## [13] "timeinjail" "timeinprison"
Divide data intro training and test sets before modeling
K-Means exploration
: Principal Components Analysis
# example ISLR page 401
#extract numeric values
pca.nv <- recid$age
pca.nv <- as.data.frame(pca.nv)
pca.nv <- rename(pca.nv, c("pca.nv" = "age"))
pca.nv["juv_fel_count"] = recid$juv_fel_count
pca.nv["juv_misd_count"] = recid$juv_misd_count
pca.nv["juv_other_count"] = recid$juv_other_count
pca.nv["priors_count"] = recid$priors_count
pca.nv["juv_fel_count"] = recid$juv_fel_count
pca.nv["count.prior.jail"] = recid$count.prior.jail
pca.nv["count.prior.prison"] = recid$count.prior.prison
pca.nv["timeinjail"] = recid$timeinjail
pca.nv["timeinprison"] = recid$timeinprison
#mean
apply(pca.nv, 2, mean)
## age juv_fel_count juv_misd_count juv_other_count
## 34.91855849 0.05792528 0.07910007 0.09935865
## priors_count count.prior.jail count.prior.prison timeinjail
## 3.03512165 1.91499542 0.39784180 20.48976891
## timeinprison
## 138.10882622
## age juv_fel_count juv_misd_count juv_other_count
## 140.6728616 0.1993525 0.2141661 0.2446575
## priors_count count.prior.jail count.prior.prison timeinjail
## 20.7427085 2.4261883 1.5700707 3618.7831625
## timeinprison
## 168264.9606154
## PC1 PC2 PC3 PC4 PC5
## age -0.01754909 0.53452473 -0.45883480 0.41365522 -0.086655431
## juv_fel_count -0.26237068 -0.02158468 0.10344439 0.04769020 0.934473892
## juv_misd_count -0.26035814 -0.45663548 -0.11649834 0.43686920 -0.081033793
## juv_other_count -0.18119197 -0.51597092 0.04576765 0.42033952 -0.148293842
## priors_count -0.50987211 0.02805504 -0.38693821 -0.05851043 0.005857272
## count.prior.jail -0.29737071 -0.26350659 -0.28722719 -0.66787581 -0.133321438
## count.prior.prison -0.47943029 0.31915187 -0.10500275 0.05839410 -0.055675891
## timeinjail -0.28724014 0.21899113 0.60282639 0.05217608 -0.164606037
## timeinprison -0.41167652 0.13797691 0.39554705 -0.05883212 -0.206610688
## PC6 PC7 PC8 PC9
## age -0.09083068 -0.17101758 -0.53062065 -0.07596484
## juv_fel_count -0.06581351 -0.03197515 -0.18725028 -0.06373914
## juv_misd_count 0.65954683 0.11268494 -0.11336789 -0.22836911
## juv_other_count -0.68533686 -0.16959654 -0.03306592 0.02508948
## priors_count 0.10168261 -0.15551986 0.26663153 0.69314229
## count.prior.jail -0.06260217 -0.21662852 -0.38023086 -0.31682855
## count.prior.prison -0.16664623 0.19493972 0.52772251 -0.55361532
## timeinjail 0.18332356 -0.66321869 -0.03553270 -0.03987111
## timeinprison -0.08421912 0.61586917 -0.41714736 0.22017842
pr.var =pr.out$sdev ^2
pve=pr.var/sum(pr.var)
plot(pve,xlab="Principal Component",ylab="Proportion of Variance Explained", ylim=c(0,1),type='b')
PCA analysis shows PC1 = priors_count PC2 = juv_other_count and PC3 = age
K-means classification using priors_count and juv_other_count
Are there discernable groups or clusters of data within the recid data set? If so, can those clusters help us discern between recividist and non-recividist observations.
# extract age and prior count and place in one data frame
kmdf <- recid$priors_count
kmdf <- as.data.frame(kmdf)
kmdf["priors_count"] <- recid$priors_count
kmdf["juv_other_count"] = recid$juv_other_count
#kmdf
kmdf <- kmdf[,-1]
#dim(kmdf)
km.out = kmeans(kmdf,2,nstart = 10)
#km.out$cluster
plot(kmdf, col =(km.out$cluster+1) , main="K-Means Clustering Results with K=2", xlab ="priors_count", ylab="juv_other_count", pch= 20, cex =1)
#plot for violent recid
ggplot(recid, aes(x = priors_count, y = juv_other_count, col = as.factor(is_violent_recid), shape = as.factor(is_violent_recid))) + geom_point() + scale_color_manual(values = c("green", "red")) + ggtitle("Violent Recid: priors_count vs juv_other_count")
K-means using priors_count and age
# extract age and prior count and place in one data frame
kmdf <- recid$age
kmdf <- as.data.frame(kmdf)
kmdf["age"] <- recid$age
kmdf["priors_count"] = recid$priors_count
#kmdf
kmdf <- kmdf[,-1]
#kmdf.nv["is_recid"] <- nv_recid$is_recid
#dim(kmdf.nv)
km.out = kmeans(kmdf,2,nstart = 10)
#km.out$cluster
plot(kmdf, col =(km.out$cluster+1) , main="K-Means Clustering Results with K=2", xlab ="age", ylab="priors_count", pch= 20, cex =2)
#plot for violent recid
ggplot(recid, aes(x = age, y = priors_count, col = as.factor(is_violent_recid), shape = as.factor(is_violent_recid))) + geom_point() + scale_color_manual(values = c("green", "red")) + ggtitle("Violent Recid")
Using K-means methods for the violent and non-violent recividists data sets using age and prior_counts seems to be an ineffective method of identifying clusters within the data sets as the distributions of recividists and non-recividists for the variables age, and priors_count seem to overlap. On the other hand, applying K-means using priors_count and juv_other_count for non-violent recividists seems to be a good option to identify clusters of observations.
Simple Linear Regression
# Find best predictors of recividism for linear regression
regfit.violent <- regsubsets(is_violent_recid ~. -is_recid, recid, nvmax = 5)
#summary(regfit.violent)
regfit.nonviolent <- regsubsets(is_recid ~. -is_violent_recid, recid, nvmax = 5)
#summary(regfit.nonviolent)
The following variables are the strongest predictors of recividism in a linear regression context: - Violent: count.prior.jail, c_charge_degree(F3), juv_misd_count - Nonviolent: count.prior.jail, age, priors_count.
Next, we’ll produce a linear fit using the best predictors listed above for both the violent and non-violent recividist data set
lmfit.nv <- lm(is_recid~count.prior.jail+age+priors_count, data = recid, subset = train)
mse.nv <- with(recid, mean((is_recid-predict(lmfit.nv,recid))[-train]^2))
mse.nv
## [1] 0.1591918
lmfit.v <- lm(is_violent_recid~count.prior.jail + juv_misd_count, data = recid, subset = train)
mse.v <- with(recid, mean((is_violent_recid-predict(lmfit.v,recid))[-train]^2))
mse.v
## [1] 0.06472581
#linear regression plots
qplot(data = recid, x = count.prior.jail, y = is_recid) + stat_smooth(method = "lm", aes(colour = "linear")) + stat_smooth(method = "lm", formula = y~poly(x,2), aes(colour = "quadratic")) + stat_smooth(method = "lm" , formula = y~poly(x,3), aes(colour = "cubic")) + scale_colour_discrete("Model") + theme_bw()
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
K-fold validation
#K-10 CV for non-violent recividism
cv.error10.nv <- rep(0,10)
for (i in 1:10) {
glm.nv <- glm(is_recid~poly(count.prior.jail,i), data = recid)
cv.error10.nv[i] <- cv.glm(recid, glm.nv, K = 10)$delta[1]
}
cv.error10.nv
## [1] 0.1617823 0.1522842 0.1510242 0.2317933 4.1815818
## [6] 15.0627720 76.8895228 349.8387882 3126.1044113 3261.4593437
cv.df10nv <- data.frame(deg=1:10, cv_error = cv.error10.nv)
qplot(data=cv.df10nv, x = deg, y = cv.error10.nv, geom = "line", ylab = "10-fold CV Error estimate", main = "Non-violent recid") + geom_point()
## [1] 3
#K-10 CV for violent recividism
cv.error10.v <- rep(0,10)
for (i in 1:10) {
glm.v <- glm(is_violent_recid~poly(count.prior.jail,i), data = recid)
cv.error10.v[i] <- cv.glm(recid, glm.v, K = 10)$delta[1]
}
cv.error10.v
## [1] 0.06633744 0.06583484 0.06600587 0.07469503 0.49547235
## [6] 0.25464351 25.16558619 0.11717836 1.96607643 78265.09530934
cv.df10v <- data.frame(deg=1:10, cv_error = cv.error10.v)
qplot(data=cv.df10v, x = deg, y = cv.error10.v, geom = "line", ylab = "10-fold CV Error estimate", main = "Violent recid") + geom_point()
## [1] 2
K-NN Classification
#split data sets into violent and non-violent recividists (sex, race, and charge degree removed as non-numeric and thus cannot be normalized)
recid.nv <- recid[c("is_recid","age", "juv_fel_count", "juv_misd_count","juv_other_count", "priors_count","count.prior.jail","count.prior.prison","timeinjail","timeinprison")]
recid.v <- recid[c("is_violent_recid","age", "juv_fel_count", "juv_misd_count","juv_other_count", "priors_count","count.prior.jail","count.prior.prison","timeinjail","timeinprison")]
#Normalization of data
normalize <- function(x) {
return((x-min(x))/(max(x) - min(x))) }
recid.nv.norm <- as.data.frame(lapply(recid.nv[,2:10], normalize))
recid.v.norm <- as.data.frame(lapply(recid.v[,2:10], normalize))
#create training and test data sets for non-violent recividist classification
train.nv <- recid.nv.norm[1:4911,]
test.nv <- recid.nv.norm[4912:9822,]
#create training and test data sets for violent recividists classification
train.v <- recid.v.norm[1:4911,]
test.v <- recid.v.norm[4912:9822,]
recid.nv.pred <- knn(train.nv, test.nv, recid.nv[1:4911,1], k= 90, prob = TRUE)
conf.nv <- table(recid.nv.pred, recid.nv[4912:9822,1])
conf.nv
##
## recid.nv.pred 0 1
## 0 2933 768
## 1 328 882
tpr.nv <- (conf.nv[2,2])/(conf.nv[2,1]+conf.nv[2,2])
fpr.nv <- (conf.nv[1,2])/(conf.nv[1,1]+conf.nv[1,2])
acc.nv <- (conf.nv[2,2]+conf.nv[1,1])/(conf.nv[1,1]+conf.nv[1,2]+conf.nv[2,2]+conf.nv[2,1])
acc.nv
## [1] 0.7768275
## [1] 0.7289256
## [1] 0.2075115
#KNN prediction for violent recividists
recid.v.pred <- knn(train.v, test.v, recid.v[1:4911,1], k= 7, prob = TRUE)
conf.v <- table(recid.v.pred, recid.v[4912:9822,1])
conf.v
##
## recid.v.pred 0 1
## 0 4481 379
## 1 37 14
tpr.v <- (conf.v[2,2])/(conf.v[2,1]+conf.v[2,2])
fpr.v <- (conf.v[1,2])/(conf.v[1,1]+conf.v[1,2])
acc.v <- (conf.v[2,2]+conf.v[1,1])/(conf.v[1,1]+conf.v[1,2]+conf.v[2,2]+conf.v[2,1])
acc.v
## [1] 0.9152922
## [1] 0.2745098
## [1] 0.07798354
##ROC CURVES
#non-violent recid
prob.nv <- attr(recid.nv.pred, "prob")
roc.nv <- roc(recid.nv[4912:9822,1], prob.nv)
## Setting levels: control = 0, case = 1
## Setting direction: controls > cases
## Setting levels: control = 0, case = 1
## Setting direction: controls > cases
K-NN classifier seems to perform well for both violent and non-violent recividist data sets. - For non-violent recividist, a K value calculated as sqrt of n observations seems to maximize accuracy. - For violent recividist, large numbers of K seem to affect accuracy. Settling for K value of 7.
Does the K-NN method tend to have a higher FPR for African-Americans compared to Caucasians? We perform this analysis by looking at FPRs for both group data subsets
recid.nv <- recid[c("race","is_recid","age", "juv_fel_count", "juv_misd_count","juv_other_count", "priors_count","count.prior.jail","count.prior.prison","timeinjail","timeinprison")]
recid.v <- recid[c("race","is_violent_recid","age", "juv_fel_count", "juv_misd_count","juv_other_count", "priors_count","count.prior.jail","count.prior.prison","timeinjail","timeinprison")]
# Non-violent recividist classification: African-American vs. Caucasian
b.nv <- recid.nv[which(recid.nv$race == "African-American"),]
w.nv <- recid.nv[which(recid.nv$race == "Caucasian"),]
recid.nv.b <- b.nv[c("is_recid","age", "juv_fel_count", "juv_misd_count","juv_other_count", "priors_count","count.prior.jail","count.prior.prison","timeinjail","timeinprison")]
recid.nv.w <- w.nv[c("is_recid","age", "juv_fel_count", "juv_misd_count","juv_other_count", "priors_count","count.prior.jail","count.prior.prison","timeinjail","timeinprison")]
recid.nvb.norm <- as.data.frame(lapply(recid.nv.b[,2:10], normalize))
recid.nvw.norm <- as.data.frame(lapply(recid.nv.w[,2:10], normalize))
#training sets for both black and white groups
train.nvb <- recid.nvb.norm[1:2456,]
test.nvb <- recid.nvb.norm[2457:4912,]
train.nvw <- recid.nvw.norm[1:1679,]
test.nvw <- recid.nvw.norm[1680:3358,]
#KNN classification for non-violent black recividists
recid.nvb.pred <- knn(train.nvb, test.nvb, recid.nv.b[1:2456,1], k= 50)
conf.nvb <- table(recid.nvb.pred, recid.nv.b[2457:4912,1])
fpr.nvb <- (conf.nvb[1,2])/(conf.nvb[1,1]+conf.nvb[1,2])
conf.nvb
##
## recid.nvb.pred 0 1
## 0 1198 367
## 1 275 616
## [1] 0.2345048
#KNN classification for non-violent white recividists
recid.nvw.pred <- knn(train.nvw, test.nvw, recid.nv.w[1:1679,1], k= 50)
conf.nvw <- table(recid.nvw.pred, recid.nv.w[1680:3358,1])
conf.nvw
##
## recid.nvw.pred 0 1
## 0 1161 298
## 1 47 173
## [1] 0.2042495
## Violent recividist classification: African-American vs. Caucasian
# Obtain data subsets for violent recividists for both groups
b.v <- recid.v[which(recid.v$race == "African-American"),]
w.v <- recid.v[which(recid.v$race == "Caucasian"),]
recid.v.b <- b.v[c("is_violent_recid","age", "juv_fel_count", "juv_misd_count","juv_other_count", "priors_count","count.prior.jail","count.prior.prison","timeinjail","timeinprison")]
recid.v.w <- w.v[c("is_violent_recid","age", "juv_fel_count", "juv_misd_count","juv_other_count", "priors_count","count.prior.jail","count.prior.prison","timeinjail","timeinprison")]
#normalize data sets for violent recid for both white and black groups
recid.vb.norm <- as.data.frame(lapply(recid.v.b[,2:10], normalize))
recid.vw.norm <- as.data.frame(lapply(recid.v.w[,2:10], normalize))
#obtain training and test datasets for both groups
train.vb <- recid.vb.norm[1:2456,]
test.vb <- recid.vb.norm[2457:4913,]
train.vw <- recid.vw.norm[1:1679,]
test.vw <- recid.vw.norm[1680:3358,]
#KNN prediction model for violent black
recid.vb.pred <- knn(train.vb, test.vb, recid.v.b[1:2456,1], k= 12)
conf.vb <- table(recid.vb.pred, recid.v.b[2457:4913,1])
fpr.vb <- (conf.vb[1,2])/(conf.vb[1,1]+conf.vb[1,2])
conf.vb
##
## recid.vb.pred 0 1
## 0 2204 247
## 1 4 2
## [1] 0.1007752
#KNN prediction model for violent white
recid.vw.pred <- knn(train.vw, test.vw, recid.v.w[1:1679,1], k= 12)
conf.vw <- table(recid.vw.pred, recid.v.w[1680:3358,1])
conf.vw
##
## recid.vw.pred 0 1
## 0 1581 97
## 1 1 0
## [1] 0.05780691
Findings and Summary
K-means Clustering
Principal component analysis (PCA) reveals the principal components of variance in the data set to be as follows: PC1 = priors_count PC2 = juv_other_count and PC3 = age. I apply these variables in the K-means method to explore whether we can find discernable groups of clusters in the data that might be in line with differences between recividists and non-recividist. The main findings under this method are:
Using K-means methods for the violent and non-violent recividists data sets using age and prior_counts seems to be an ineffective method of identifying clusters within the data sets as the distributions of recividists and non-recividists for the variables age, and priors_count seem to overlap.
Applying K-means using priors_count and juv_other_count for non-violent recividists seems to be a good option to identify clusters of observations for the non-violent recividist data set.. Recividists seem to be concentrated among observations with high count of arrests (>10).
Simple Linear Regression
I begin by using the regsubsets fuction to find the best 5 predictors of both violent and non-violent recividism. The output reveals strongest predictors by group as follows:
Violent: count.prior.jail, c_charge_degree(F3), juv_misd_count Nonviolent: count.prior.jail, age, priors_count.
Next, a linear regression is performed using the variables specified above for the non-violent and violent reciviist data sets. The model is trained using a random subset of the data and is tested on the unseen observations. Two plots are shown in the output for the strongest predictor of recividism (count.prior.jail) in linear, quadratic and cubed form.
The next code chunk performs a 10-fold cross validation to determine the n-degree polynomial with the lowest test error. The results are:
A degree level 2 polynomial yields lowest estimate of test error for the non-violent recividism data set A degree level 2 polynomial yields lowest estimate of test error for the violent recividism data set.
K-NN Classification
For this method, I begin by, again, splitting the data set into violent and non-violent recividists. Since the K-NN method required normalization of all observations, all non-numeric variables are removed from the data sets, which include: sex, race, and charge degree. All observations are normalized once the data modifications listed above are performed.
Next, I extract a training and test data set for both violent and non-violent recividists. K-NN is done on both data sets. A confusion matrix, TPR, FPR, and accuracy rates are computed and its respective ROC is shown. K-NN classification reveals the following: > K-NN classifier seems to perform well for both violent and non-violent recividist data sets. - For non-violent recividist, a K value calculated as sqrt of n observations seems to maximize accuracy. - For violent recividist, large numbers of K seem to affect accuracy. Settling for K value of 7.
I continue to use K-NN to answer the following question: Does the K-NN method tend to have a higher FPR for African-Americans compared to Caucasians? In other words, is the model biased?
Was the Model Biased?
Both the linear and K-NN methods predict or classify African-American individuals as recividists at a higher rate than Caucasians. I believe this is due to the fact that the proportion of African-American recividists (both violent and non-violent) are higher than any other group in the data set. Even when not using race as a predictor, the models seem to classify or predict recividism at a higher rate for African-Americans.
For K-NN classification, I attempt to analyze bias in a different way. Instead of looking at the proportion of individuals classified as likely to be revividists, I use the false positive rate to see if African-Americans are being misclassified as recividists at a higher rate than Caucasians. I do this by dividing the non-violent and violent recividist data sets by race group (African-Americans and Caucasians). The data is then put through the same process of normalization and the K-NN method is applied. The result is 4 confusion matrices:
A comparison of FPRs for data subsets for African-American and Caucasian observations reveal that the difference in misclassification as non-violent recividists is small. For the African-American group, K-NN FPR is 0.233. For the Caucasian subset, FPR is 0.206. The difference of .03 percentage points does not seem significant.
A comparison of FPRs for the non-violent recid data set suggests that the K-NN does not erroneously classify African-Americans as recividists at a higher rate than the Caucasian group. From this analysis, I conclude that the models themselves are not biased. Rather, bias might be inherent in the way African-Americans are arrested at higher-rates by law enforcement.
Model Selection
Based on all of our models we selected the QDA for our model of choice. It had the lowest misclassification rates out of all of all of the models that we decided to explore, at misclassification rates of roughly 8% for Two-Year Recidivism (compared to the average of 18 - 20%), and about 4% for Violent Recidivism (compared to the average of 7 - 8%). Again, we decided to go with the lowest misclassification rate, because we believed that COMPAS had an accuracy problem – the COMPAS algorithm was faced with misclassifying Black defendants more-so than White defendants. Our QDA model is also not necessarily biased by groups:
We see that, even after removing African Americans from the dataset and running an ANOVA, that the differences are still statistically significant at the .05 level. This means that, visually, White offenders would be the next highest difference compared to other races. We take this to mean that the QDA model predicted Two Year Recidivism Rates unbiased across racial groups. In other words, it is correctly capturing the trends we see in the Two Year Recidivism EDA.
The QDA also predicts younger people as recidivating more than older people, as well as males more than females. This accurately captures the trends we saw in the Two Year Recidivism EDA.
Similarly, we see that after removing African Americans from the dataset and running an ANOVA on Violent Recidivism, significant differences do exist. Again, we take this to mean that the QDA model predicted Violent Recidivism Rates unbiased across racial groups. In other words, QDA is correctly capturing the trends that we saw in Violent Recidivism EDA.
Younger people are also predicted to violently recidivate more than older people, as with males more than females. We believe that the QDA model accurately captures the trends we observed in the Violent Recidivism EDA.
Further, in order to accurately compare our model with the COMPAS predictive accuracy, we needed to calculate the concordance rate. This is because the cox model used by Northpointe calculates predictive accuracy via concordance. What follows is our calculation for both Two Year Recidivism and Violent Recidivism.
Concordance for Comparison with COMPAS
#Is_Recid
qda.pred.is_recid <- as.numeric(qda.pred.is_recid)
qda.recid.conc <- Concordance(qda.test.data.r, qda.pred.is_recid)
qda.recid.conc
## $Concordance
## [1] 0.4043621
##
## $Discordance
## [1] 0.5956379
##
## $Tied
## [1] 0
##
## $Pairs
## [1] 1962736
According to Angwin and colleagues (2016), Northpointe quoted a concordance rate of 68%, but in ProPublica’s analysis, they were only able to achieve a 63% concordance rate. Unfortunately, our concordance rate for Two Year Recidivism is only 40.4% concordance, a 40.59% decrease.
#Is_Violent_Recid
qda.pred.vis_recid<- as.numeric(qda.pred.vis_recid)
qda.vrecid.conc <- Concordance(qda.test.data.v, qda.pred.vis_recid)
qda.vrecid.conc
## $Concordance
## [1] 0.1903233
##
## $Discordance
## [1] 0.8096767
##
## $Tied
## [1] 0
##
## $Pairs
## [1] 656772
Angwin and colleagues (2016) also found a concordance rate of about 65.1% for Violent Recidivism. However, the QDA model was only able to achieve 19.03% concordance, a 70.77% decrease.
In the end, our model was indeed not as accurate as Northpointe’s cox model in both Two Year and Violent Recidivism.
Important Predictors for Two-Year and Violent Recidivism Rates
In terms of importance, we believed that the Random Forest importance plots would show us the most important predictors for both the two year and violent recidivism rates. These specific plots can be found here and here in the document, respectively. We found that:
1.count.prior.jail
was the most important predictor of Two-Year Recidivism rates AND 2. count.prior.jail
was also the most important predictor of Violent Recidivism rates
Both of these variables are described as the ‘number of times defendant was booked in jail’ and shows the importance that the more arrests that an individual has, the more likely the person is to recidivate, and this is intuitive.
How do these compare? The count.prior.jail
is the most important variable compared to both cases. However, the MeanDecreaseAccuracy for is_recid
’s count.prior.jail
is close to 200 compared to 40 for is_violent_recid
. This suggests the is_recid
Random Forest model highly depend on the count.prior.jail
variable. In contrast, is_violent_recid
Random Forest model give more importance to other 6 variables.
Comparing Classifications to COMPAS
Unfortunately, the QDA model doesn’t easily output coefficients for us to compare to the COMPAS model. We also don’t have the exact Northpointe COMPAS model and their classifications to compare. However, looking at the predicted graphs here, we can easily see that the QDA predicts African-Americans as recidivating (and violently recidivating) more than White defendants. Angwin and colleagues (2016) found the same issues in the COMPAS model, for both two-year recidivism and violent recidivism. As such, we do not have any huge systematic differences between our chosen model and the COMPAS model because they produce similar classifications for the same race, sex, and age groups.
Discussion
Data Limitations
In terms of the limitations of the data in terms of the bias problem, we wanted to bring to attention the fact that the data would be considered ‘biased’. We found in our EDA that African-Americans, younger people, and males all tend to recidivate more. Angwin and colleagues used Northpointe’s COMPAS scores to determine if COMPAS itself was biased. Northpointe would use these scores and plug them into a cox model to predict recidivism. It’s clear that using the score could lead to bias. However, even without using the COMPAS score included in the data set, we can see that young African-American males will typically recidivate more. A model trained on this data will train itself to capture the trends it sees in the data. If the data appear this way, it can make our models appear biased racially, and also could be considered biased by sex.
If a model is trained on this type of data, the model will be considered biased. However, we want to consider that all models trained on similar data will typically be considered bias, because of a systematic bias. Take for example a project that John worked on in undergrad (Lloyd et al. 2017). We found that White participants held an innate bias against Black targets, such that White participants initially believed that Black targets were telling lies more often than Whites. Applied to the criminal justice system, it is more than likely that African-Americans are arrested more because of an innate, systematic bias against Blacks. In all actuality, Black defendants may be no more likely to recidivate than their White counterparts, but the justice system holds the capability to skew the data so that this is no longer the case. Models fit on data collected by the justice system (such as the data used in this project and in ProPublica’s dataset) will reasonably be expected to be biased.
Although our models were thus considered to be biased, the group believes QDA, with its low misclassification rates, stands the best chance to capture the trends in similar recidivism data.
Model Limitations
Our model did not stand a chance against the cox model that Northpointe produced. Unfortunately, this may be due to the fact that we did not create our own evaluation metric such as Northpointe, but rather created a model based on the provided data. There are many different models and Data Mining methods that we did not use in this project, especially a cox model. It is entirely possible that QDA is not the best model to predict recidivism, but perhaps a cox model is the best for this type of data and prediction. We decided to focus on the specific Data Mining methods we have learned in class, with the exception of the Neural Network, to showcase what we have learned over the mini. Unfortunately, a cox model is something that we did not learn (or might have touched on briefly), so such a model was not included in our analysis and discussion.
Conclusion
We undertook 8 different classification models in the task of classifying a specific dataset of individuals on whether or not they are predicted to recidivate after two years and also if they were predicted to violently recidivate. The COMPAS model has a misclassification problem – Black defendants were more likely to be classified to recidivate than White defendants. We worked to narrow this misclassification problem, and found that QDA had the smallest misclassification rates for both two-year and violent recidivism at 8% and 4%, respectively. However, as all of us found in our personal analyses, the QDA model did suffer from bias – it predicted Black defendants as recidivating more than White defendants, males more than females, and younger individuals than older individuals. However, we discuss that perhaps such biases are not a fault of the models, but an issue in the systematic bias that may be present in the justice system’s dataset. Despite this, we believe that the QDA models best captured the trends that we observed in the datset, and we believe that QDA might be the best for predicting trends in recidivism rates.
Citations
Angwin, Julia, Jeff Larson, Surya Mattu, Lauren Kirchner, and ProPublica. “Machine Bias.” ProPublica, May 23, 2016. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing.
Logan, Brent. “ROC Curves and the C Statistic” 19, no. 4 (2013): 5.
Peter Flom and David Cassell, “Stopping Stepwise: Why Stepwise and Similar Selection Methods Are Bad, and What You Should Use” (Paper, March 23, 2013), https://web.archive.org/web/20130323174011/http://www.nesug.org/proceedings/nesug07/sa/sa07.pdf.
Pan, Yuqing, Qing Mai, and Xin Zhang. “TULIP: A Toolbox for Linear Discriminant Analysis with Penalties.” ArXiv:1904.03469 [Stat], April 6, 2019. http://arxiv.org/abs/1904.03469.
“Classification: LDA and QDA Approaches.” Accessed May 12, 2020. https://pages.mtu.edu/~shanem/psy5220/daily/Day12/classification.html.
Glass Box. “Best Use of Train/Val/Test Splits, with Tips for Medical Data,” September 15, 2019. https://glassboxmedicine.com/2019/09/15/best-use-of-train-val-test-splits-with-tips-for-medical-data/.
E. Paige Lloyd et al., “Black and White Lies: Race-Based Biases in Deception Judgments,” Psychological Science 28, no. 8 (August 1, 2017): 1125–36, https://doi.org/10.1177/0956797617705399.