R/chars_funs.R
chars_sparsify.Rd
The ADDCHARS SQL table includes individual rows listing the PIN, start date, and characteristic updates associated with a 288 Home Improvement Exemption. This data format is difficult to work with and complicated by the fact that multiple 288s can be active at the same time for different periods, and some columns from ADDCHARS add to existing characteristics, while some overwrite existing characteristics. Additionally, 288s can be active for multi-code properties, meaning that one building on a PIN could have an improvement while the others remain untouched.
The goal of this function is to transform these single row records into a sparse data frame which lists a row and characteristic update per PIN per YEAR.
The transformation caused by this function is most easily visualized with a mock dataset.
The base ADDCHARS data looks like this:
qu_pin | year | qu_town | qu_upload_date | qu_sqft_bld | qu_rooms |
12345 | 2013 | 77 | 130702 | 200 | 0 |
12345 | 2015 | 77 | 150703 | 300 | 1 |
This function will transform it into this:
qu_pin | year | qu_sqft_bld | qu_rooms |
12345 | 2013 | 200 | 0 |
12345 | 2014 | 200 | 0 |
12345 | 2015 | 500 | 1 |
12345 | 2016 | 500 | 1 |
12345 | 2017 | 500 | 1 |
12345 | 2018 | 300 | 1 |
12345 | 2019 | 300 | 1 |
12345 | 2020 | 300 | 1 |
Each PIN will have a row for each year that a 288 is active and a column
for each characteristic specified by additive_source
and
replacement_source
. This table can be easily left joined to modeling
data to update characteristic values.
chars_sparsify(
data,
pin_col,
year_col,
town_col,
upload_date_col,
additive_source,
replacement_source
)
A data frame containing ADDCHARS columns.
A column name specifying the PIN column.
A column name specifying the year column.
A column name specifying the town column. This is used to
determine the length of the 288, given the starting date specified
by year_col
.
A column name specifying the upload date of the 288, this is use to determine which replacement characteristics take precedence if multiple 288s were filed in the same year (the latest one is chosen).
A tidyselect selection of columns which contains
additive characteristic values. These are values such as square feet which
get ADDED to existing characteristics. For example, if qu_sqft_bld is equal
to 100, then 100 sqft get added to the existing square footage of the
property. The easiest way to specify all additive columns is to use the
built-in crosswalk, any_of(ccao::chars_cols$add$source)
.
A tidyselect selection of columns which contains
replacement characteristic values. These are values such as number of rooms
which get OVERWRITE existing characteristics. For example, if QU_ROOMS is
equal to 3, then the property will be update to have 3 total rooms.
The easiest way to specify all replacement columns is to use the
built-in crosswalk, any_of(ccao::chars_cols$replace$source)
.
A sparsified data frame of characteristic updates per PIN per year. See example.
Use dplyr/tidyselect syntax for specifying column names. For example,
use qu_sqft_bld
instead of "qu_sqft_bld"
, or use a tidyselect
function such as starts_with("qu_")
.
Other chars_funs:
chars_288_active()
,
chars_fix_age()
,
chars_update()
# Load the included example dataset
data("chars_sample_hie")
chars_sparsify(
chars_sample_hie[1:3, ],
pin_col = pin,
year_col = year,
town_col = as.character(qu_town),
upload_date_col = qu_upload_date,
additive_source = any_of(chars_cols$add$source),
replacement_source = any_of(chars_cols$replace$source)
)
#> # A tibble: 11 × 31
#> pin year qu_rooms qu_beds qu_full_bath qu_half_bath qu_sqft_bld
#> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 10254170360000 2017 1 1 1 0 483
#> 2 10254170360000 2018 1 1 1 0 483
#> 3 10254170360000 2019 1 1 1 0 483
#> 4 10254170360000 2020 1 1 1 0 483
#> 5 13253230040000 2017 4 2 2 0 1170
#> 6 13253230040000 2018 4 2 2 0 1170
#> 7 13253230040000 2019 4 2 2 0 1170
#> 8 13253230040000 2020 4 2 2 0 1170
#> 9 13253230040000 2021 0 0 0 0 0
#> 10 13253230040000 2022 0 0 0 0 0
#> 11 13253230040000 2023 0 0 0 0 0
#> # ℹ 24 more variables: qu_lnd_sqft <dbl>, qu_no_com_unit <dbl>,
#> # qu_type_of_res <chr>, qu_use <chr>, qu_exterior_wall <chr>, qu_roof <chr>,
#> # qu_basement_type <chr>, qu_basement_finish <chr>, qu_heat <chr>,
#> # qu_air <chr>, qu_fire_place <dbl>, qu_attic_type <chr>,
#> # qu_attic_finish <chr>, qu_type_plan <chr>, qu_type_design <chr>,
#> # qu_construct_quality <chr>, qu_garage_const <chr>,
#> # qu_garage_attached <chr>, qu_garage_area <chr>, qu_porch <chr>, …