The Cook County Assessor’s Office (CCAO) uses LightGBM for its residential and condominium valuation models. LightGBM is a machine learning framework that works by iteratively growing many decision trees. It is especially suited to data with many categorical variables and complex interactions, such as housing data.
However, LightGBM is also complicated. Its outputs can be difficult to explain and interpret, especially as model complexity grows. Techniques such as SHAP values can help, but aren’t intuitive to the average person.
This vignette outlines a novel technique to explain LightGBM outputs specifically for housing data. The technique finds comparable sales from the model training data by exploiting the tree structure of a LightGBM model. Its goal is to help diagnose model issues and answer a common question from property owners, “What comparable sales did you use to value my property?”
To understand the comparable sale (comp) finding technique, you must
first understand how decision trees determine a property’s value. Below
is a simple decision tree trained on the Ames Housing Dataset. The
training processes uses sales data to discover patterns in how property
variables, like neighborhood
and gr_liv_area
,
contribute to sale price. Once trained, the tree can be used to predict
a property’s value, even if that property hasn’t sold. Let’s see
how.
This tree is made up of rectangular splits and oval leaves.
Each split has a
property variable (i.e. gr_liv_area
) and a
rule (i.e. <= 1224.5
). The
bold arrow points to the node you should go to when the
rule is true about a given property. For example, at Split 4 in the tree
above, properties with a living area of less than 1224.5
should proceed to Split 9.
Each leaf shows the value a property will get from the tree after going through all the splits. For example, the Ames property below (ID = 2) follows the purple path through through the tree based on its characteristics. Its predicted value from the tree is $184,306, as shown at the terminal leaf node (Leaf 4).
ID | Livable Area | Lot Area | Year Built | Condition | Neighborhood | Pred. Value |
---|---|---|---|---|---|---|
2 | 1,629 | 13,830 | 1997 | 4 | 6 | $184,306 |
Other properties may have a different path through the tree. Below, property number 75 takes the green path through the tree and receives a predicted value of $195,367.
ID | Livable Area | Lot Area | Year Built | Condition | Neighborhood | Pred. Value |
---|---|---|---|---|---|---|
2 | 1,629 | 13,830 | 1997 | 4 | 6 | $184,306 |
75 | 1,968 | 12,003 | 2009 | 4 | 14 | $195,367 |
The process of “predicting” with a decision tree is just running each property through the rules established by the splits. Properties with similar split outcomes will also have similar characteristics and will end up in the same leaf node of the tree. As such, extracting comparable sales from the training data of single tree is simple: just find all the properties that share a leaf node with your target property.
Let’s use the property with ID = 2 as our target. As we saw above, it ends up in Leaf 4. Here are some comparable sales from the same leaf node:
ID | Livable Area | Lot Area | Year Built | Condition | Neighborhood | Pred. Value |
---|---|---|---|---|---|---|
3 | 1,604 | 9,978 | 1998 | 5 | 6 | $184,306 |
14 | 1,960 | 7,851 | 2002 | 4 | 6 | $184,306 |
21 | 2,110 | 8,880 | 1994 | 4 | 9 | $184,306 |
48 | 1,675 | 15,263 | 1959 | 4 | 18 | $184,306 |
55 | 1,694 | 10,475 | 2008 | 4 | 1 | $184,306 |
60 | 1,978 | 10,389 | 2003 | 4 | 1 | $184,306 |
61 | 2,098 | 9,375 | 1997 | 4 | 1 | $184,306 |
62 | 1,661 | 12,137 | 1998 | 4 | 1 | $184,306 |
70 | 1,652 | 19,645 | 1994 | 5 | 12 | $184,306 |
84 | 1,571 | 7,837 | 1993 | 6 | 6 | $184,306 |
All of these properties follow the same purple path through the tree and end up receiving the same predicted value from Leaf 4. However, their characteristics aren’t actually very comparable. They have vastly different lot areas and ages. Any appraiser seeing these comps would probably laugh at you.
So what’s the issue? Why aren’t our properties in the same leaf node actually comparable? It’s because our model is too simple. It doesn’t yet have enough rules to distinguish a house built in 1959 from one built in 2008. We can solve this by adding more splits to our single tree, by adding more trees, or by doing both.
In practice, frameworks like LightGBM don’t use just one decision tree. Instead, they combine many decision trees together, either by taking the average or adding their results. This can create incredibly complex rulesets that are difficult for humans to follow, but which lead to much better predictions. Let’s extend our decision tree from before with an additional tree.
Tree 0 is the same as before, and the purple path shows the rules followed by our target property (ID = 2).
Tree 1 is a new tree. The target property still follows the purple path, but notice the new values at each split and leaf. These values are substantially less than the values from Tree 0. That’s because they are added to Tree 0’s results. Using LightGBM, the final predicted value for a given property is the sum of predicted values from all trees.
In the case of our target property, the final predicted value is $187,627, which comes from adding Tree 0: Leaf 4 and Tree 1: Leaf 4.
ID | Livable Area | Lot Area | Year Built | Condition | Neighborhood | Pred. Value (Tree 0) | Pred. Value (Tree 1) | Pred. Value (Final) |
---|---|---|---|---|---|---|---|---|
2 | 1,629 | 13,830 | 1997 | 4 | 6 | $184,306 | $3,322 | $187,627 |
What about our comparable sales from earlier? Like before, they all share the same first leaf node (in Tree 0), but now they start to differ in the second tree. The newest property (ID = 55) receives a higher final value and the property with the smallest livable area (ID = 84) receives a lower final value.
ID | Livable Area | Lot Area | Year Built | Condition | Neighborhood | Pred. Value (Tree 0) | Pred. Value (Tree 1) | Pred. Value (Final) |
---|---|---|---|---|---|---|---|---|
3 | 1,604 | 9,978 | 1998 | 5 | 6 | $184,306 | $3,322 | $187,627 |
14 | 1,960 | 7,851 | 2002 | 4 | 6 | $184,306 | $3,322 | $187,627 |
21 | 2,110 | 8,880 | 1994 | 4 | 9 | $184,306 | $3,322 | $187,627 |
48 | 1,675 | 15,263 | 1959 | 4 | 18 | $184,306 | $3,322 | $187,627 |
55 | 1,694 | 10,475 | 2008 | 4 | 1 | $184,306 | $10,207 | $194,512 |
60 | 1,978 | 10,389 | 2003 | 4 | 1 | $184,306 | $3,322 | $187,627 |
61 | 2,098 | 9,375 | 1997 | 4 | 1 | $184,306 | $3,322 | $187,627 |
62 | 1,661 | 12,137 | 1998 | 4 | 1 | $184,306 | $3,322 | $187,627 |
70 | 1,652 | 19,645 | 1994 | 5 | 12 | $184,306 | $3,322 | $187,627 |
84 | 1,571 | 7,837 | 1993 | 6 | 6 | $184,306 | -$1,044 | $183,262 |
Properties that match the target property in both trees will be more similar than ones that match in just one tree. In this case, all properties that landed in Leaf 4 will be most comparable to our target (ID = 2).
However, as more trees are added, the number of properties that share all their leaf nodes with the target will shrink rapidly. Here’s what happens to our target and comparable sold properties if we add 10 trees to the model.
The green cells below show leaf nodes shared with the target property. Note that by the final tree, there are no comparable sales that share all the target’s leaf nodes.
ID | Tree 0 | Tree 1 | Tree 2 | Tree 3 | Tree 4 | Tree 5 | Tree 6 | Tree 7 | Tree 8 | Tree 9 |
---|---|---|---|---|---|---|---|---|---|---|
2 | $184,306 | $3,322 | $2,791 | $1,779 | $1,853 | $1,928 | $1,398 | $1,483 | -$450 | $605 |
ID | Tree 0 | Tree 1 | Tree 2 | Tree 3 | Tree 4 | Tree 5 | Tree 6 | Tree 7 | Tree 8 | Tree 9 |
---|---|---|---|---|---|---|---|---|---|---|
3 | $184,306 | $3,322 | $2,791 | $1,779 | $1,853 | $1,928 | -$1,645 | $1,483 | -$450 | $605 |
14 | $184,306 | $3,322 | $9,008 | $1,779 | $4,435 | $7,398 | -$1,645 | $1,483 | $1,928 | $1,288 |
21 | $184,306 | $3,322 | $9,008 | $1,779 | $4,435 | $7,398 | $2,164 | $6,202 | $1,928 | $605 |
48 | $184,306 | $3,322 | -$2,739 | $1,779 | $1,853 | -$2,057 | $2,164 | -$1,975 | $3,036 | $605 |
55 | $184,306 | $10,207 | $7,507 | $1,779 | $1,853 | $1,928 | $2,164 | $5,202 | $5,038 | $3,666 |
60 | $184,306 | $3,322 | $9,008 | $1,779 | $4,435 | $7,398 | $2,164 | $1,483 | $3,036 | $3,666 |
61 | $184,306 | $3,322 | $9,008 | $1,779 | $4,435 | $7,398 | $2,164 | $6,202 | $3,036 | $605 |
62 | $184,306 | $3,322 | $2,791 | $1,779 | $1,853 | $1,928 | $2,164 | $1,483 | $3,036 | $605 |
70 | $184,306 | $3,322 | $2,791 | $1,779 | $1,853 | $5,629 | $2,164 | $1,483 | $3,036 | $605 |
84 | $184,306 | -$1,044 | $2,791 | $1,779 | $1,853 | $1,928 | -$1,645 | $1,483 | -$450 | $605 |
Given that individual properties are unlikely to share all of their leaf nodes with a target property, we need a way to measure comparability that doesn’t rely on perfect matching.
Now that we’ve seen how LightGBM trees are structured, we can use them to create a similarity score. This score is simply the percentage of shared leaf nodes between a target and comparable property, weighted by the importance of each tree.
Let’s see it in action. Here is the same green table from above, but with predicted values replaced by a boolean (T) when the property shares the same leaf node as the target within each tree. The rightmost columns show the number and percentage of leaf nodes shared with the target.
ID | Tree 0 | Tree 1 | Tree 2 | Tree 3 | Tree 4 | Tree 5 | Tree 6 | Tree 7 | Tree 8 | Tree 9 | Number | Percent |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 4 | 4 | 4 | 2 | 4 | 7 | 6 | 4 | 6 | 4 |
ID | Tree 0 | Tree 1 | Tree 2 | Tree 3 | Tree 4 | Tree 5 | Tree 6 | Tree 7 | Tree 8 | Tree 9 | Number | Percent |
---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | T | T | T | T | T | T | F | F | F | F | 6 / 10 | 60% |
14 | T | T | F | F | F | F | F | F | F | F | 2 / 10 | 20% |
21 | T | T | F | F | F | F | F | F | F | F | 2 / 10 | 20% |
48 | T | T | F | F | F | F | F | F | F | F | 2 / 10 | 20% |
55 | T | F | F | F | F | F | F | F | F | F | 1 / 10 | 10% |
60 | T | T | F | F | F | F | F | F | F | F | 2 / 10 | 20% |
61 | T | T | F | F | F | F | F | F | F | F | 2 / 10 | 20% |
62 | T | T | T | T | T | T | F | F | F | F | 6 / 10 | 60% |
70 | T | T | T | T | T | F | F | F | F | F | 5 / 10 | 50% |
84 | T | F | T | T | T | T | F | F | F | F | 6 / 10 | 60% |
The observations with a high percentage of shared leaf nodes should be most comparable to the target property. However, this assumes that all trees in the model are weighted equally, which is rarely the case. In LightGBM, trees typically have diminishing importance, i.e. each successive tree has less impact on the overall error of the model. To find accurate comparables, we need to quantify each tree’s importance and use it to weight each leaf node match.
Here is a simple function to determine tree importance based on how much each tree contributes to the overall decrease in model error. We can apply it to our training data to get weights for each tree.
get_weights <- function(metric, model, train, outcome_col, num_trees) {
model$params$metric <- list(metric)
train_lgb <- lgb.Dataset(as.matrix(train), label = train[[outcome_col]])
trained_model <- lgb.train(
params = model$params,
data = train_lgb,
valids = list(test = train_lgb),
nrounds = num_trees
)
# Get the initial error for base model before first tree
# this NEEDS to be after the model is trained
# (or else it won't train correctly)
set_field(train_lgb, "init_score", as.matrix(train[[outcome_col]]))
initial_predictions <- get_field(train_lgb, "init_score")
init_score <- mean(initial_predictions)
# Index into the errors list, and un-list so it is a flat/1dim list
errors <- unlist(trained_model$record_evals$test[[metric]]$eval)
errors <- c(init_score, errors)
diff_in_errors <- diff(errors, 1, 1)
# Take proportion of diff in errors over total diff in
# errors from all trees
weights <- diff_in_errors / sum(diff_in_errors)
return(weights)
}
# Prepare data using a tidymodels recipe
ames_train_prep <- bake(prep(ames_recp), ames_train)
# Get the decrease in error caused by each successive tree
ames_tree_weights <- get_weights(
metric = "rmse",
model = ames_fit_eng_10,
train = ames_train_prep,
outcome_col = "sale_price",
num_trees = 10
)
|
Tree 0 | Tree 1 | Tree 2 | Tree 3 | Tree 4 | Tree 5 | Tree 6 | Tree 7 | Tree 8 | Tree 9 |
---|---|---|---|---|---|---|---|---|---|---|
Weight | 71.2% | 4.7% | 4.2% | 3.8% | 3.4% | 3.1% | 2.8% | 2.5% | 2.2% | 2.0% |
These weights are then multiplied row-wise by the boolean matching
matrix from above. This means that for each tree and comparable sale,
the weight value is either kept (when matching / TRUE
) or
zeroed out (when not matching / FALSE
). The final
similarity score for our target property (ID = 2) is the sum of each
comparable row.
ID | Tree 0 | Tree 1 | Tree 2 | Tree 3 | Tree 4 | Tree 5 | Tree 6 | Tree 7 | Tree 8 | Tree 9 | Sim. Score |
---|---|---|---|---|---|---|---|---|---|---|---|
3 | 71.2% | 4.7% | 4.2% | 3.8% | 3.4% | 3.1% | 0.0% | 0.0% | 0.0% | 0.0% | 90.46% |
14 | 71.2% | 4.7% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 75.88% |
21 | 71.2% | 4.7% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 75.88% |
48 | 71.2% | 4.7% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 75.88% |
55 | 71.2% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 71.18% |
60 | 71.2% | 4.7% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 75.88% |
61 | 71.2% | 4.7% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 75.88% |
62 | 71.2% | 4.7% | 4.2% | 3.8% | 3.4% | 3.1% | 0.0% | 0.0% | 0.0% | 0.0% | 90.46% |
70 | 71.2% | 4.7% | 4.2% | 3.8% | 3.4% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 87.37% |
84 | 71.2% | 0.0% | 4.2% | 3.8% | 3.4% | 3.1% | 0.0% | 0.0% | 0.0% | 0.0% | 85.76% |
Now we can simply sort by similarity score to get the observations most similar to our target property (ID = 2).
ID | Livable Area | Lot Area | Year Built | Condition | Neighborhood | Sale Price | Sim. Score |
---|---|---|---|---|---|---|---|
2 | 1,629 | 13,830 | 1997 | 4 | 6 | $189,900 |
ID | Livable Area | Lot Area | Year Built | Condition | Neighborhood | Sale Price | Sim. Score |
---|---|---|---|---|---|---|---|
3 | 1,604 | 9,978 | 1998 | 5 | 6 | $195,500 | 90.46% |
62 | 1,661 | 12,137 | 1998 | 4 | 1 | $224,900 | 90.46% |
70 | 1,652 | 19,645 | 1994 | 5 | 12 | $203,135 | 87.37% |
84 | 1,571 | 7,837 | 1993 | 6 | 6 | $178,000 | 85.76% |
14 | 1,960 | 7,851 | 2002 | 4 | 6 | $216,500 | 75.88% |
21 | 2,110 | 8,880 | 1994 | 4 | 9 | $205,000 | 75.88% |
48 | 1,675 | 15,263 | 1959 | 4 | 18 | $173,000 | 75.88% |
60 | 1,978 | 10,389 | 2003 | 4 | 1 | $318,000 | 75.88% |
61 | 2,098 | 9,375 | 1997 | 4 | 1 | $240,000 | 75.88% |
55 | 1,694 | 10,475 | 2008 | 4 | 1 | $245,350 | 71.18% |
Now we’re talking! Properties 3 and 62 are nearly identical to our target property, while 60 and 61 aren’t too similar. But what about property 55? It looks fairly similar to our target, but has the lowest similarity score.
Property 55 reveals another advantage of this approach to finding
comparable sales: variables get implicitly weighted by their importance.
Variables that aren’t predictive in the LightGBM model are less
likely to appear in splits and are therefore less likely to
determine whether two properties share a terminal leaf node. In the case
of property 55, Year Built
is an important predictor of
value and appears in many splits, so it receives a low similarity score
due to the large difference in age compared to the target.
The comparables finding approach does have some disadvantages. Mainly, it requires you to generate a boolean matching matrix for every target property to every possible comparable. This many-to-many relationship quickly blows up the compute required for comp finding, particularly for large/complex models. However, with some clever coding, this isn’t too hard to work around.
Overall, this approach is robust, relatively intuitive, and delivers good comparables that will finally let us answer the question, “What comparable sales did you use to value my property?”