Introduction

The tmle3 package differs from previous tmle software efforts in that it attempts to directly model the key objects defined in the mathematical TMLE framework. That is, rather than focus on implementing a specific TML estimator, or a small set of related estimators, the focus is on model the TMLE framework itself.

Therefore, we explicitly define objects to model the NPSEM, the factorized likelihood, counterfactual interventions, parameters, and TMLE update procedures. The hope is that, in so doing, it will be possible to support a substantial subset of the vast array of TML estimators currently present in the literature (cite TL books), as well as those that have yet to be developed. In this vignette, we describe these mathematical objets, their software analogs in tmle3, and illustrate with a motivating example, described below. At the end, we describe how these objects can be bundled up into a specification of a TML estimation procedure that can be easily applied by an end user.

Motivating Example

We use data from the Collaborative Perinatal Project (CPP), available in the sl3 package. To simplify this example, we define a binary intervention variable, parity01 – an indicator of having one or more children before the current child and a binary outcome, haz01 – an indicator of having an above average height for age.

library(tmle3)
library(sl3)
data(cpp)
cpp <- cpp[!is.na(cpp[, "haz"]), ]
cpp$parity01 <- as.numeric(cpp$parity > 0)
cpp[is.na(cpp)] <- 0
cpp$haz01 <- as.numeric(cpp$haz > 0)

NPSEM

TMLE requires the specification of a Nonparametric Structual Equation Model (NPSEM), which specifies our knowledge of relationships between the variables.

We start with a set of endogenous variables, \(X=(X_1,\ldots,X_J)\), that we want to model the relationship between. Each \(X_j\) is at least partially observed in the dataset. The NPSEM defines each variable (\(X_j\)) by a deterministic function (\(f_{X_j}\)) of its parent nodes (\(Pa(X_j)\)) and an exogenous random variable (\(U_{X_j}\)):

\[X_j=f_{X_j}(Pa(X_j), U_{X_j}),\;\; j\in \{1, \ldots, J\}\]

The exact functional form of the functions \(f_{X_j}\) is left unspecified at this step. If there is a priori knowledge for some of these functions, that can be specified during the likelihood step below.

Causal Considerations

The collection of exogenous random variables defined by the NPSEM is \(U=(U_{X_1}, \ldots,U_{X_J})\). Typically, non-testable assumptions about the joint distribution of \(U\) are necessary for identifiability of causal parameters with statistical parameters of the observed data. These assumptions are not managed in the tmle3 framework, which instead focus on the statistical estimation problem. Therefore, those developing tools for end users need to be clear about the additional causal assumptions necessary for causal intepretation of estimates.

Example

In the case of our CPP example, we use the classic point treatment NPSEM which defines three nodes: \(X=(W,A,Y)\), where \(W\) is a set of baseline covariates, \(A\) is our exposure of interest (parity01), and \(Y\) is our outcome of interest (haz01). We define the following SCM:

\[W=f_W(U_W)\] \[A=f_A(W,U_A)\] \[Y=f_Y(W,U_Y)\]

In tmle3, this is done using the define_node function for each node. define_node allows a user to specify the node_name, which columns in the data comprise the node, and a list of parent nodes.

npsem <- list(
  define_node("W", c(
    "apgar1", "apgar5", "gagebrth", "mage",
    "meducyrs", "sexn"
  )),
  define_node("A", c("parity01"), c("W")),
  define_node("Y", c("haz01"), c("A", "W"))
)

Nodes also track information about the data types of the variables (continuous, categorical, binomial, etc). Here, that information is being estimated automatically from the data. In the future, each node will also contain information about censoring indicators, where applicable, but this is not yet implemented.

tmle3_Task

A tmle3_Task is an object comprised of data, and the NPSEM defined above:

task <- tmle3_Task$new(cpp, npsem = npsem)

This task object contains methods to help subset the data as needed for various tmle steps:

#get the outcome node data
head(task$get_tmle_node("Y"))
## [1] 1 1 1 0 0 1
#get the sl3 task corresponding to an outcome regression
task$get_regression_task("Y")
## A sl3 Task with 1441 obs and these nodes:
## $covariates
##          A         W1         W2         W3         W4         W5 
## "parity01"   "apgar1"   "apgar5" "gagebrth"     "mage" "meducyrs" 
##         W6 
##     "sexn" 
## 
## $outcome
## [1] "haz01"
## 
## $id
## NULL
## 
## $weights
## NULL
## 
## $offset
## NULL

Likelihood

Having defined the NPSEM, we can now define a joint likelihood (probability density function) over the observed variables \(X\): \[P(X_1, \ldots,X_J\in D)=\int_D f_{X_1, \ldots,X_J}(x_1,\ldots,x_J) dx_1,\ldots,dx_J\] This can then be factorized into a series of conditional densities according to the NPSEM: \[f_{X_1, \ldots,X_J}=\prod_j^J f_{X_j|Pa(X_j)}(x|Pa(x_j))\]

Where each \(f_{X_j|Pa(X_j)\) is a conditional pdf (or probability mass function for discrete \(X_j\)), where the conditioning set is all parent nodes as defined in the NPSEM. We refer to these objects as likelihood factors.

TMLE depends on estimates (or a priori knowledge) of the functional form of these likelihood factors. However, not all factors of the likelihood are always necessary for estimation, and only those necessary will be estimated.

Likelihood Factor Objects

tmle3 models this likelihood as a list of likelihood factor objects, where each likeliood factor object describes either a priori knowledge or an estimation strategy for the corresponding likelihood factor. These objects all inherit from the LF_base base class, and there are different types depending on which of a range of estimation strategies or a priori knowledge is appropriate.

In some cases, a full conditional density for a particular factor is not necessary. Instead, a conditional mean (a much easier quantity to estimate), is all that’s required. Although conditional means are not truely likelihood factors, conditional means are also modelled using using likelihood factor objects.

LF_emp

LF_emp represents a likelihood factor to be estimated using NP-MLE. That is, probability mass \(1/n\) is placed on each observation in the observed dataset. Going forward, weights will be used if specified, although this is not yet supported. \(LF_emp\) only supports marginal densities. That is the conditioning set \(Pa(X_j)=\{\emptyset\}\) must be empty. Therefore, it is only appropriate for estimation of the marginal density of baseline covariates.

LF_fit

LF_fit represents a likelihood factor to be estimated using the sl3 framework. Based on the learner type used, this can fit a pmf (for binomial or categorical data, see sl3_list_learners("binomial") and sl3_list_learners("categorical") for lists), a conditional mean (most learners), or a conditional density (using condensier via Lrnr_condensier). LF_fit takes a sl3 learner object as an argument, which is fit to the data in the tmle3_Task automatically.

Example

Counterfactual Likelihoods

In tmle3, interventions are modeled as likelihoods where one or more likelihood factors is replaced with a counterfactual version representing some intervention.

tmle3 defines the CF_Likelihood class, which inherits from Likelihood, and takes an observed_likelihood and an intervention_list.

Below, we describe some examples of additional likelihood factors intended to be used to describe interventions. We expect this list to grow as tmle3 is extended to additional use cases.

LF_static

Likelihood factor for a static intervention, where all observations are set do a single intervention value: \(a'\). These factors are indicator functions (f(a’|pa(A)) ### LF_rule

LF_shift

Still to come

Target Parameter

Psi(Pn)

Definition of a target parameter requires specification of a mapping Ψ ap- plied to P0. Ψ maps any P ∈ M into a vector of numbers Ψ(P). We write the mapping as Ψ : M → Rd for a d-dimensional parameter. • ψ0 is the evaluation of Ψ(P0), i.e., the true value of our parameter. The statistical estimation problem is to map the observed data O1, . . . , On into an estimator of Ψ(P0) that incorporates the knowledge that P0 ∈ M, accompanied by an assessment of the uncertainty in the estimator.

Efficient Influence Function (EIF)

Update Procedure

Submodel

Loss

Solver

Putting it all together

tmle3_Fit object

TMLE Specification

This object defines a particular TMLE estimation procedure, with options

Conclusion

Bundling a TMLE definition for the end user

tmle_spec

  • How to document these

The Delta method

Future Work

  • Support for fluctuating other likelihood factors
  • TMLE for multiple parameters
  • One-step (recursive TMLE)
  • Weights-based fluctuation
  • Support for dynamic rule and stochastic interventions
  • Extension to longitudinal case
  • Simplified user interface