Finding comparables with LightGBM • lightsnip

The Cook County Assessor’s Office (CCAO) uses LightGBM for its residential and condominium valuation models. LightGBM is a machine learning framework that works by iteratively growing many decision trees. It is especially suited to data with many categorical variables and complex interactions, such as housing data.

However, LightGBM is also complicated. Its outputs can be difficult to explain and interpret, especially as model complexity grows. Techniques such as SHAP values can help, but aren’t intuitive to the average person.

This vignette outlines a novel technique to explain LightGBM outputs specifically for housing data. The technique finds comparable sales from the model training data by exploiting the tree structure of a LightGBM model. Its goal is to help diagnose model issues and answer a common question from property owners, “What comparable sales did you use to value my property?”

How Decision Trees Work

To understand the comparable sale (comp) finding technique, you must first understand how decision trees determine a property’s value. Below is a simple decision tree trained on the Ames Housing Dataset. The training processes uses sales data to discover patterns in how property variables, like neighborhood and gr_liv_area, contribute to sale price. Once trained, the tree can be used to predict a property’s value, even if that property hasn’t sold. Let’s see how.

Ames Housing Data: Example Decision Tree

This tree is made up of rectangular splits and oval leaves.

Each split has a property variable (i.e. gr_liv_area) and a rule (i.e. <= 1224.5). The bold arrow points to the node you should go to when the rule is true about a given property. For example, at Split 4 in the tree above, properties with a living area of less than 1224.5 should proceed to Split 9.

Each leaf shows the value a property will get from the tree after going through all the splits. For example, the Ames property below (ID = 2) follows the purple path through through the tree based on its characteristics. Its predicted value from the tree is $184,306, as shown at the terminal leaf node (Leaf 4).

ID	Livable Area	Lot Area	Year Built	Condition	Neighborhood	Pred. Value
2	1,629	13,830	1997	4	6	$184,306

Other properties may have a different path through the tree. Below, property number 75 takes the green path through the tree and receives a predicted value of $195,367.

ID	Livable Area	Lot Area	Year Built	Condition	Neighborhood	Pred. Value
2	1,629	13,830	1997	4	6	$184,306
75	1,968	12,003	2009	4	14	$195,367

The process of “predicting” with a decision tree is just running each property through the rules established by the splits. Properties with similar split outcomes will also have similar characteristics and will end up in the same leaf node of the tree. As such, extracting comparable sales from the training data of single tree is simple: just find all the properties that share a leaf node with your target property.

Let’s use the property with ID = 2 as our target. As we saw above, it ends up in Leaf 4. Here are some comparable sales from the same leaf node:

ID	Livable Area	Lot Area	Year Built	Condition	Neighborhood	Pred. Value
3	1,604	9,978	1998	5	6	$184,306
14	1,960	7,851	2002	4	6	$184,306
21	2,110	8,880	1994	4	9	$184,306
48	1,675	15,263	1959	4	18	$184,306
55	1,694	10,475	2008	4	1	$184,306
60	1,978	10,389	2003	4	1	$184,306
61	2,098	9,375	1997	4	1	$184,306
62	1,661	12,137	1998	4	1	$184,306
70	1,652	19,645	1994	5	12	$184,306
84	1,571	7,837	1993	6	6	$184,306

All of these properties follow the same purple path through the tree and end up receiving the same predicted value from Leaf 4. However, their characteristics aren’t actually very comparable. They have vastly different lot areas and ages. Any appraiser seeing these comps would probably laugh at you.

So what’s the issue? Why aren’t our properties in the same leaf node actually comparable? It’s because our model is too simple. It doesn’t yet have enough rules to distinguish a house built in 1959 from one built in 2008. We can solve this by adding more splits to our single tree, by adding more trees, or by doing both.

How LightGBM Works

In practice, frameworks like LightGBM don’t use just one decision tree. Instead, they combine many decision trees together, either by taking the average or adding their results. This can create incredibly complex rulesets that are difficult for humans to follow, but which lead to much better predictions. Let’s extend our decision tree from before with an additional tree.

Tree 0 is the same as before, and the purple path shows the rules followed by our target property (ID = 2).

Ames Housing Data: Tree 0

Tree 1 is a new tree. The target property still follows the purple path, but notice the new values at each split and leaf. These values are substantially less than the values from Tree 0. That’s because they are added to Tree 0’s results. Using LightGBM, the final predicted value for a given property is the sum of predicted values from all trees.

In the case of our target property, the final predicted value is $187,627, which comes from adding Tree 0: Leaf 4 and Tree 1: Leaf 4.

Ames Housing Data: Tree 1

ID	Livable Area	Lot Area	Year Built	Condition	Neighborhood	Pred. Value (Tree 0)	Pred. Value (Tree 1)	Pred. Value (Final)
2	1,629	13,830	1997	4	6	$184,306	$3,322	$187,627

What about our comparable sales from earlier? Like before, they all share the same first leaf node (in Tree 0), but now they start to differ in the second tree. The newest property (ID = 55) receives a higher final value and the property with the smallest livable area (ID = 84) receives a lower final value.

ID	Livable Area	Lot Area	Year Built	Condition	Neighborhood	Pred. Value (Tree 0)	Pred. Value (Tree 1)	Pred. Value (Final)
3	1,604	9,978	1998	5	6	$184,306	$3,322	$187,627
14	1,960	7,851	2002	4	6	$184,306	$3,322	$187,627
21	2,110	8,880	1994	4	9	$184,306	$3,322	$187,627
48	1,675	15,263	1959	4	18	$184,306	$3,322	$187,627
55	1,694	10,475	2008	4	1	$184,306	$10,207	$194,512
60	1,978	10,389	2003	4	1	$184,306	$3,322	$187,627
61	2,098	9,375	1997	4	1	$184,306	$3,322	$187,627
62	1,661	12,137	1998	4	1	$184,306	$3,322	$187,627
70	1,652	19,645	1994	5	12	$184,306	$3,322	$187,627
84	1,571	7,837	1993	6	6	$184,306	-$1,044	$183,262

Properties that match the target property in both trees will be more similar than ones that match in just one tree. In this case, all properties that landed in Leaf 4 will be most comparable to our target (ID = 2).

However, as more trees are added, the number of properties that share all their leaf nodes with the target will shrink rapidly. Here’s what happens to our target and comparable sold properties if we add 10 trees to the model.

The green cells below show leaf nodes shared with the target property. Note that by the final tree, there are no comparable sales that share all the target’s leaf nodes.

ID	Tree 0	Tree 1	Tree 2	Tree 3	Tree 4	Tree 5	Tree 6	Tree 7	Tree 8	Tree 9
2	$184,306	$3,322	$2,791	$1,779	$1,853	$1,928	$1,398	$1,483	-$450	$605

ID	Tree 0	Tree 1	Tree 2	Tree 3	Tree 4	Tree 5	Tree 6	Tree 7	Tree 8	Tree 9
3	$184,306	$3,322	$2,791	$1,779	$1,853	$1,928	-$1,645	$1,483	-$450	$605
14	$184,306	$3,322	$9,008	$1,779	$4,435	$7,398	-$1,645	$1,483	$1,928	$1,288
21	$184,306	$3,322	$9,008	$1,779	$4,435	$7,398	$2,164	$6,202	$1,928	$605
48	$184,306	$3,322	-$2,739	$1,779	$1,853	-$2,057	$2,164	-$1,975	$3,036	$605
55	$184,306	$10,207	$7,507	$1,779	$1,853	$1,928	$2,164	$5,202	$5,038	$3,666
60	$184,306	$3,322	$9,008	$1,779	$4,435	$7,398	$2,164	$1,483	$3,036	$3,666
61	$184,306	$3,322	$9,008	$1,779	$4,435	$7,398	$2,164	$6,202	$3,036	$605
62	$184,306	$3,322	$2,791	$1,779	$1,853	$1,928	$2,164	$1,483	$3,036	$605
70	$184,306	$3,322	$2,791	$1,779	$1,853	$5,629	$2,164	$1,483	$3,036	$605
84	$184,306	-$1,044	$2,791	$1,779	$1,853	$1,928	-$1,645	$1,483	-$450	$605

Given that individual properties are unlikely to share all of their leaf nodes with a target property, we need a way to measure comparability that doesn’t rely on perfect matching.

Quantifying Comparables

Now that we’ve seen how LightGBM trees are structured, we can use them to create a similarity score. This score is simply the percentage of shared leaf nodes between a target and comparable property, weighted by the importance of each tree.

Let’s see it in action. Here is the same green table from above, but with predicted values replaced by a boolean (T) when the property shares the same leaf node as the target within each tree. The rightmost columns show the number and percentage of leaf nodes shared with the target.

ID	Tree 0	Tree 1	Tree 2	Tree 3	Tree 4	Tree 5	Tree 6	Tree 7	Tree 8	Tree 9	Number	Percent
2	4	4	4	2	4	7	6	4	6	4

ID	Tree 0	Tree 1	Tree 2	Tree 3	Tree 4	Tree 5	Tree 6	Tree 7	Tree 8	Tree 9	Number	Percent
3	T	T	T	T	T	T	F	F	F	F	6 / 10	60%
14	T	T	F	F	F	F	F	F	F	F	2 / 10	20%
21	T	T	F	F	F	F	F	F	F	F	2 / 10	20%
48	T	T	F	F	F	F	F	F	F	F	2 / 10	20%
55	T	F	F	F	F	F	F	F	F	F	1 / 10	10%
60	T	T	F	F	F	F	F	F	F	F	2 / 10	20%
61	T	T	F	F	F	F	F	F	F	F	2 / 10	20%
62	T	T	T	T	T	T	F	F	F	F	6 / 10	60%
70	T	T	T	T	T	F	F	F	F	F	5 / 10	50%
84	T	F	T	T	T	T	F	F	F	F	6 / 10	60%

The observations with a high percentage of shared leaf nodes should be most comparable to the target property. However, this assumes that all trees in the model are weighted equally, which is rarely the case. In LightGBM, trees typically have diminishing importance, i.e. each successive tree has less impact on the overall error of the model. To find accurate comparables, we need to quantify each tree’s importance and use it to weight each leaf node match.

Here is a simple function to determine tree importance based on how much each tree contributes to the overall decrease in model error. We can apply it to our training data to get weights for each tree.

get_weights <- function(metric, model, train, outcome_col, num_trees) {
  model$params$metric <- list(metric)
  train_lgb <- lgb.Dataset(as.matrix(train), label = train[[outcome_col]])

  trained_model <- lgb.train(
    params = model$params,
    data = train_lgb,
    valids = list(test = train_lgb),
    nrounds = num_trees
  )

  # Get the initial error for base model before first tree
  # this NEEDS to be after the model is trained
  # (or else it won't train correctly)
  set_field(train_lgb, "init_score", as.matrix(train[[outcome_col]]))
  initial_predictions <- get_field(train_lgb, "init_score")
  init_score <- mean(initial_predictions)

  # Index into the errors list, and un-list so it is a flat/1dim list
  errors <- unlist(trained_model$record_evals$test[[metric]]$eval)
  errors <- c(init_score, errors)
  diff_in_errors <- diff(errors, 1, 1)

  # Take proportion of diff in errors over total diff in
  # errors from all trees
  weights <- diff_in_errors / sum(diff_in_errors)

  return(weights)
}

# Prepare data using a tidymodels recipe
ames_train_prep <- bake(prep(ames_recp), ames_train)

# Get the decrease in error caused by each successive tree
ames_tree_weights <- get_weights(
  metric = "rmse",
  model = ames_fit_eng_10,
  train = ames_train_prep,
  outcome_col = "sale_price",
  num_trees = 10
)

	Tree 0	Tree 1	Tree 2	Tree 3	Tree 4	Tree 5	Tree 6	Tree 7	Tree 8	Tree 9
Weight	71.2%	4.7%	4.2%	3.8%	3.4%	3.1%	2.8%	2.5%	2.2%	2.0%

These weights are then multiplied row-wise by the boolean matching matrix from above. This means that for each tree and comparable sale, the weight value is either kept (when matching / TRUE) or zeroed out (when not matching / FALSE). The final similarity score for our target property (ID = 2) is the sum of each comparable row.

ID	Tree 0	Tree 1	Tree 2	Tree 3	Tree 4	Tree 5	Tree 6	Tree 7	Tree 8	Tree 9	Sim. Score
3	71.2%	4.7%	4.2%	3.8%	3.4%	3.1%	0.0%	0.0%	0.0%	0.0%	90.46%
14	71.2%	4.7%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	75.88%
21	71.2%	4.7%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	75.88%
48	71.2%	4.7%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	75.88%
55	71.2%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	71.18%
60	71.2%	4.7%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	75.88%
61	71.2%	4.7%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	75.88%
62	71.2%	4.7%	4.2%	3.8%	3.4%	3.1%	0.0%	0.0%	0.0%	0.0%	90.46%
70	71.2%	4.7%	4.2%	3.8%	3.4%	0.0%	0.0%	0.0%	0.0%	0.0%	87.37%
84	71.2%	0.0%	4.2%	3.8%	3.4%	3.1%	0.0%	0.0%	0.0%	0.0%	85.76%

Final Results

Now we can simply sort by similarity score to get the observations most similar to our target property (ID = 2).

ID	Livable Area	Lot Area	Year Built	Condition	Neighborhood	Sale Price	Sim. Score
2	1,629	13,830	1997	4	6	$189,900

ID	Livable Area	Lot Area	Year Built	Condition	Neighborhood	Sale Price	Sim. Score
3	1,604	9,978	1998	5	6	$195,500	90.46%
62	1,661	12,137	1998	4	1	$224,900	90.46%
70	1,652	19,645	1994	5	12	$203,135	87.37%
84	1,571	7,837	1993	6	6	$178,000	85.76%
14	1,960	7,851	2002	4	6	$216,500	75.88%
21	2,110	8,880	1994	4	9	$205,000	75.88%
48	1,675	15,263	1959	4	18	$173,000	75.88%
60	1,978	10,389	2003	4	1	$318,000	75.88%
61	2,098	9,375	1997	4	1	$240,000	75.88%
55	1,694	10,475	2008	4	1	$245,350	71.18%

Now we’re talking! Properties 3 and 62 are nearly identical to our target property, while 60 and 61 aren’t too similar. But what about property 55? It looks fairly similar to our target, but has the lowest similarity score.

Property 55 reveals another advantage of this approach to finding comparable sales: variables get implicitly weighted by their importance. Variables that aren’t predictive in the LightGBM model are less likely to appear in splits and are therefore less likely to determine whether two properties share a terminal leaf node. In the case of property 55, Year Built is an important predictor of value and appears in many splits, so it receives a low similarity score due to the large difference in age compared to the target.

The comparables finding approach does have some disadvantages. Mainly, it requires you to generate a boolean matching matrix for every target property to every possible comparable. This many-to-many relationship quickly blows up the compute required for comp finding, particularly for large/complex models. However, with some clever coding, this isn’t too hard to work around.

Overall, this approach is robust, relatively intuitive, and delivers good comparables that will finally let us answer the question, “What comparable sales did you use to value my property?”