skmiscpy.cbs¶

Functions¶

`compute_smd`(→ pandas.DataFrame)	Computes the standardized mean difference (SMD) for a list of variables.

Module Contents¶

skmiscpy.cbs.compute_smd(data: pandas.DataFrame, vars: list[str], group: str, wt_var: str = None, cat_vars: list[str] = None, std_binary: bool = False, estimand: str = 'ATE') → pandas.DataFrame[source]¶

Computes the standardized mean difference (SMD) for a list of variables.

Parameters:¶

data : pd.DataFrame¶: A pandas DataFrame containing the columns specified in vars, group, and optionally wt_var.
vars : List[str]¶: A list of strings representing the variables names for which to calculate the SMD, where the variables should be either continuous or binary. The values of the binary variable could be either string type or numerical, they would be converted into 0 and 1 (if they are not already 0-1), where lower value converted into 0 and higher value converted into 1. To compute SMD for a discrete variable with more than two categories, pass that variable name in a list to the cat_vars parameter.
group : str¶: The name of the binary group column based on which the mean differences will be calculated.
wt_var : str, optional¶: The name of the column containing weights. Defaults to None.
cat_vars : List[str], optional¶: A list of strings representing the categorical (i.e. discrete) variables among the list specified in the vars parameter.
std_binary : bool¶: Should the mean differences for binary variables (i.e., difference in proportion) be standardized or not. Default is False. See notes.
estimand : str, optional¶: The estimand type. Currently, only "ATE" (Average Treatment Effect) is supported. Defaults to "ATE".

Returns:¶

A DataFrame with columns:

variables: The name of the variable.
var_types: The type of the variable (Continuous or Binary).
unadjusted_smd: The standardized mean difference without adjustment.
adjusted_smd: The standardized mean difference with adjustment (if wt_var is provided).

Return type:¶

pd.DataFrame

Notes

The mean differences for continuous variables are standardized so that they are on the same scale and so that they can be compared across variables, and they allow for a simple interpretation even when the details of the variable’s original scale are unclear to the analyst.

None of these advantages are passed to binary variables because binary variables are already on the same scale (i.e., a proportion), and the scale is easily interpretable. In addition, the details of standardizing the proportion difference of a binary variable involve dividing the proportion difference by a variance, but the variance of a binary variable is a function of its proportion. Standardizing the proportion difference of a binary variable can yield the following counterintuitive result: if P_T= 0.2 and P_C= 0.3, the standardized difference in proportion would be different from that if P_T= 0.5 and P_C= 0.6, even though the expectation is that the balance statistic should be the same for both scenarios because both would yield the same degree of bias in the effect estimate. If still you want the standardized mean difference for binary variables, use std_binary = True in compute_smd().

Examples

>>> import pandas as pd
>>> from skmiscpy import compute_smd
>>> import numpy as np

>>> sample_df = pd.DataFrame({
...     'age': np.random.randint(18, 66, size=100),
...     'weight': np.round(np.random.uniform(120, 200, size=100), 1),
...     'gender': np.random.choice(['male', 'female'], size=100),
...     'race': np.random.choice(
...         ['white', 'black', 'hispanic'],
...         size=100, p=[0.4, 0.3, 0.3]
...     ),
...     'educ_level': np.random.choice(
...         ['bachelor', 'master', 'doctorate'],
...         size=100, p=[0.3, 0.4, 0.3]
...     ),
...     'ps_wts': np.round(np.random.uniform(0.1, 1.0, size=100), 2),
...     'group': np.random.choice(['treated', 'control'], size=100),
...     'date': pd.date_range(start='2024-01-01', periods=100, freq='D')
... })

Basic usage with unadjusted SMD only:

>>> compute_smd(sample_df, vars=['age', 'weight', 'gender'], group='group')
# Returns a DataFrame with unadjusted SMD values for 'age' and 'weight'.

Including weights for adjusted SMD:

>>> compute_smd(sample_df, vars=['age', 'weight', 'gender'], group='group', wt_var='ps_wts')
# Returns a DataFrame with both unadjusted and adjusted SMD values for 'age' and 'weight'.

Including categorical variables for adjusted SMD:

>>> compute_smd(
...     sample_df,
...     vars=['age', 'weight', 'gender'],
...     group='group',
...     wt_var='ps_wts',
...     cat_vars=['race', 'educ_level']
... )
# Returns a DataFrame with unadjusted and adjusted SMD values for 'age', 'weight', 'race', and 'educ_level'.