Replace numerically coded variables with human-readable values#

ccao.vars_recode(data: DataFrame, cols: List[str] | None = None, code_type: str = 'long', as_factor: bool = True, dictionary: DataFrame | None = None) DataFrame#

Replace numerically coded variables with human-readable values.

The system of record stores characteristic values in a numerically encoded format. This function can be used to translate those values into a human-readable format. For example, EXT_WALL = 2 will become EXT_WALL = “Masonry”. Note that the values and their translations must be specified via a user-defined dictionary. The default dictionary is vars_dict.

Options for code_type are:

  • "long", which transforms EXT_WALL = 1 to EXT_WALL = Frame

  • "short", which transforms EXT_WALL = 1 to EXT_WALL = FRME

  • "code", which keeps the original values (useful for removing improperly coded values, see the note below)

Parameters:
  • data (pandas.DataFrame) – A pandas DataFrame with columns to have values replaced.

  • cols (list[str]) – A list of column names to be transformed, or None to select all columns.

  • code_type (str) – The recoding type. See description above for options.

  • as_factor (bool) – If True, re-encoded values will be returned as categorical variables (pandas Categorical). If False, re-encoded values will be returned as plain strings.

  • dictionary (pandas.DataFrame) – A pandas DataFrame representing the dictionary used to translate encodings.

Raises:

ValueError – If the dictionary is missing required columns or if invalid input is provided.

Returns:

The input DataFrame with re-encoded values for the specified columns.

Return type:

pandas.DataFrame

Note

Values which are in the data but are NOT in the dictionary will be converted to NaN.

Example:

import ccao

sample_data = ccao.sample_athena

# Defaults to `long` code type
ccao.vars_recode(data=sample_data)

# Recode to `short` code type
ccao.vars_recode(data=sample_data, code_type="short")

# Recode only specified columns
ccao.vars_recode(data=sample_data, cols="GAR1_SIZE")