skmiscpy¶

Submodules¶

Attributes¶

`__version__`

Functions¶

`here`(→ str)	Construct an absolute path relative to the project root directory.
`compute_smd`(→ pandas.DataFrame)	Computes the standardized mean difference (SMD) for a list of variables.
`plot_smd`(→ None)	Plots the standardized mean difference (SMD) for variables as a point plot (also known as a love plot),
`plot_mirror_histogram`(→ None)	Plots a mirror histogram of a variable by another grouping binary variable.

Package Contents¶

skmiscpy.__version__¶

skmiscpy.here(path: str) → str[source]¶

Construct an absolute path relative to the project root directory. Requires an activated virtual environment to determine the project root.

Parameters:¶

path : str¶: A relative path to be resolved from the project root.

Returns:¶

The absolute path constructed from the project root directory.

Return type:¶

str

Raises:¶

OSError – If the script is not running inside an activated virtual environment, or if the VIRTUAL_ENV environment variable is not set, empty, or points to a non-existent directory.
TypeError – If the path parameter is not a string.
ValueError – If the path parameter is empty or is an absolute path.

Examples

Constructing a path to a file in the project:

>>> from skmiscpy import here
>>> here("data/input.csv")
# If the project root is `/home/user/my_project` where you have a virtual env directory,
# this will return an absolute path like
# `/home/user/my_project/data/input.csv`.

Constructing a path to a subdirectory:

>>> here("src/my_module")
# If the project root is `/home/user/my_project`, this will return an absolute path
# like `/home/user/my_project/src/my_module`.

Handling errors with an empty path:

>>> here("")
# Raises ValueError: The `path` parameter cannot be an empty string.

Handling errors with an absolute path:

>>> here("/absolute/path/to/file")
# Raises ValueError: The `path` parameter must be relative, not absolute.

Handling errors with an invalid path type:

>>> here(123)
# Raises TypeError: The `path` parameter must be a string.

skmiscpy.compute_smd(data: pandas.DataFrame, vars: list[str], group: str, wt_var: str = None, cat_vars: list[str] = None, std_binary: bool = False, estimand: str = 'ATE') → pandas.DataFrame[source]¶

Computes the standardized mean difference (SMD) for a list of variables.

Parameters:¶

data : pd.DataFrame¶: A pandas DataFrame containing the columns specified in vars, group, and optionally wt_var.
vars : List[str]¶: A list of strings representing the variables names for which to calculate the SMD, where the variables should be either continuous or binary. The values of the binary variable could be either string type or numerical, they would be converted into 0 and 1 (if they are not already 0-1), where lower value converted into 0 and higher value converted into 1. To compute SMD for a discrete variable with more than two categories, pass that variable name in a list to the cat_vars parameter.
group : str¶: The name of the binary group column based on which the mean differences will be calculated.
wt_var : str, optional¶: The name of the column containing weights. Defaults to None.
cat_vars : List[str], optional¶: A list of strings representing the categorical (i.e. discrete) variables among the list specified in the vars parameter.
std_binary : bool¶: Should the mean differences for binary variables (i.e., difference in proportion) be standardized or not. Default is False. See notes.
estimand : str, optional¶: The estimand type. Currently, only "ATE" (Average Treatment Effect) is supported. Defaults to "ATE".

Returns:¶

A DataFrame with columns:

variables: The name of the variable.
var_types: The type of the variable (Continuous or Binary).
unadjusted_smd: The standardized mean difference without adjustment.
adjusted_smd: The standardized mean difference with adjustment (if wt_var is provided).

Return type:¶

pd.DataFrame

Notes

The mean differences for continuous variables are standardized so that they are on the same scale and so that they can be compared across variables, and they allow for a simple interpretation even when the details of the variable’s original scale are unclear to the analyst.

None of these advantages are passed to binary variables because binary variables are already on the same scale (i.e., a proportion), and the scale is easily interpretable. In addition, the details of standardizing the proportion difference of a binary variable involve dividing the proportion difference by a variance, but the variance of a binary variable is a function of its proportion. Standardizing the proportion difference of a binary variable can yield the following counterintuitive result: if P_T= 0.2 and P_C= 0.3, the standardized difference in proportion would be different from that if P_T= 0.5 and P_C= 0.6, even though the expectation is that the balance statistic should be the same for both scenarios because both would yield the same degree of bias in the effect estimate. If still you want the standardized mean difference for binary variables, use std_binary = True in compute_smd().

Examples

>>> import pandas as pd
>>> from skmiscpy import compute_smd
>>> import numpy as np

>>> sample_df = pd.DataFrame({
...     'age': np.random.randint(18, 66, size=100),
...     'weight': np.round(np.random.uniform(120, 200, size=100), 1),
...     'gender': np.random.choice(['male', 'female'], size=100),
...     'race': np.random.choice(
...         ['white', 'black', 'hispanic'],
...         size=100, p=[0.4, 0.3, 0.3]
...     ),
...     'educ_level': np.random.choice(
...         ['bachelor', 'master', 'doctorate'],
...         size=100, p=[0.3, 0.4, 0.3]
...     ),
...     'ps_wts': np.round(np.random.uniform(0.1, 1.0, size=100), 2),
...     'group': np.random.choice(['treated', 'control'], size=100),
...     'date': pd.date_range(start='2024-01-01', periods=100, freq='D')
... })

Basic usage with unadjusted SMD only:

>>> compute_smd(sample_df, vars=['age', 'weight', 'gender'], group='group')
# Returns a DataFrame with unadjusted SMD values for 'age' and 'weight'.

Including weights for adjusted SMD:

>>> compute_smd(sample_df, vars=['age', 'weight', 'gender'], group='group', wt_var='ps_wts')
# Returns a DataFrame with both unadjusted and adjusted SMD values for 'age' and 'weight'.

Including categorical variables for adjusted SMD:

>>> compute_smd(
...     sample_df,
...     vars=['age', 'weight', 'gender'],
...     group='group',
...     wt_var='ps_wts',
...     cat_vars=['race', 'educ_level']
... )
# Returns a DataFrame with unadjusted and adjusted SMD values for 'age', 'weight', 'race', and 'educ_level'.

skmiscpy.plot_smd(data: pandas.DataFrame, add_ref_line: bool = False, ref_line_value: int | float = 0.1, *args, **kwargs) → None[source]¶

Plots the standardized mean difference (SMD) for variables as a point plot (also known as a love plot), displaying unadjusted (and adjusted, if provided) SMDs. Optionally includes a vertical reference line.

Parameters:¶

data : pd.DataFrame¶: A pandas DataFrame with at least two columns: variables and unadjusted_smd, containing the variable names and their associated unadjusted SMD values. To include the adjusted SMD in the plot, the DataFrame must also contain a column adjusted_smd with the adjusted SMD values. The column names must be variables, unadjusted_smd, and adjusted_smd.
add_ref_line : bool, optional¶: Whether to add a vertical reference line. Defaults to False.
ref_line_value : int or float, optional¶: The value at which to draw the vertical reference line. Defaults to 0.1. Must be between 0 and 1.
*args: Additional positional arguments passed to Seaborn’s pointplot.
**kwargs: Additional keyword arguments passed to Seaborn’s pointplot.

Raises:¶

ValueError – If ref_line_value is not between 0 and 1, or if the input DataFrame is empty.
TypeError – If data is not a pandas DataFrame, or if add_ref_line is not a boolean. Additionally, raises TypeError if ref_line_value is not an integer or float.

Examples

Basic usage with only unadjusted SMD:

>>> import pandas as pd
>>> from skmiscpy import plot_smd

>>> data = pd.DataFrame({
...     'variables': ['var1', 'var2', 'var3'],
...     'unadjusted_smd': [0.2, 0.5, 0.3]
... })

>>> plot_smd(data)
# This will plot the unadjusted SMD values with default settings.

Including adjusted SMD with a reference line:

>>> data = pd.DataFrame({
...     'variables': ['var1', 'var2', 'var3'],
...     'unadjusted_smd': [0.2, 0.5, 0.3],
...     'adjusted_smd': [0.1, 0.4, 0.2]
... })

>>> plot_smd(data, add_ref_line=True, ref_line_value=0.3)
# This will plot both unadjusted and adjusted SMD values, with a vertical reference line at 0.3.

Customizing the plot appearance:

>>> data = pd.DataFrame({
...     'variables': ['var1', 'var2', 'var3'],
...     'unadjusted_smd': [0.2, 0.5, 0.3],
...     'adjusted_smd': [0.1, 0.4, 0.2]
... })

>>> plot_smd(
...     data,
...     add_ref_line=True,
...     ref_line_value=0.2,
...     palette='husl',
...     markers=['o', 'D'],
...     linestyle='--'
... )
# This will plot the SMD values with custom color palette, markers, and linestyle for the plot.

skmiscpy.plot_mirror_histogram(data: pandas.DataFrame, var: str, group: str, bins: int = 50, weights: str | None = None, xlabel: str | None = None, ylabel: str | None = None, title: str | None = None) → None[source]¶

Plots a mirror histogram of a variable by another grouping binary variable.

Parameters:¶

data : pd.DataFrame¶: A pandas DataFrame containing the var and group column.
var : str¶: Name of the column for which the histogram needs to be drawn.
group : str¶: Name of the binary column based on which the histogram will be mirrored.
bins : int, optional¶: Number of bins for the histograms. Default is 50.
weights : str, optional¶: Name of the column based on which the histogram will be weighted. Default is None.
xlabel : str, optional¶: Label for the x-axis. If not provided, defaults to the name of the var column.
ylabel : str, optional¶: Label for the y-axis. If not provided, defaults to “Frequency”.
title : str, optional¶: Title of the plot. If not provided, defaults to “Mirror Histogram of var by group”.

Raises:¶

TypeError – If var, group, weights, xlabel, ylabel, or title are not of type str. If data is not a pandas DataFrame. If var is not numerical. If weights is not numerical.
ValueError – If the bins parameter is not a positive integer. If the data DataFrame is empty. If the group column does not contain exactly two unique, non-NaN values.

Examples

Example 1: Basic usage with numerical data.

>>> import pandas as pd
>>> import seaborn as sns
>>> import numpy as np
>>> from skmiscpy import plot_mirror_histogram

>>> data = pd.DataFrame({
...     'group': [1, 1, 0, 0, 1, 0],
...     'var': [2.0, 3.5, 3.0, 2.2, 2.2, 3.3]
... })
>>> plot_mirror_histogram(data=data, var='var', group='group')

Example 2: With weights and custom labels.

>>> data = pd.DataFrame({
...     'group': [1, 1, 0, 0, 1, 0],
...     'var': [2.0, 3.5, 3.0, 2.2, 2.2, 3.3],
...     'weights': [1.0, 1.5, 2.0, 1.2, 1.1, 0.8]
... })
>>> plot_mirror_histogram(
...     data=data, var='var', group='group', weights='weights',
...     xlabel='Variable', ylabel='Count', title='Weighted Mirror Histogram'
... )