Pool qualified clients from results of multiple definitions

This function filters and pools, i.e., row bind, qualified clients/groups from different source with an option to summarize by client. Unlike bind_source(), no need to supply variable names; the function will guess what should be included and their names from the supplied definition from build_def(). Whether a client is qualified relies on the flag variables set by define_case(). Therefore, this function is intended to be use only with the built-in define_case() as def_fn in build_def().

Usage

pool_case(
  data,
  def,
  output_lvl = c("raw", "clnt"),
  include_src = c("all", "has_valid", "n_per_clnt"),
  ...
)

Arguments

data

A list of data.frame or remote table which should be output from execute_def().

def

A tibble of case definition generated by build_def().

output_lvl

Either:

"raw" - output all records (default),
or "clnt" - output one record per client with summaries including date and source of the first valid record ('first_valid_date/src'), and the latest record ('last_entry_date/src'). Source-specific record counts are also provided (see the return section).

include_src

Character. It determines records from which sources should be included. This matters when clients were identified only from, not all, but some of the sources. This choice will not impact the number of client that would be identified but has impact on the number of records and the latest entry date. The options are one of:

"all" - records from all sources are included;
"has_valid" - for each client, records from sources that contain at least one valid record are included;
"n_per_clnt" - for each client, if they had fewer than n_per_clnt records in a source (see restrict_n()), then records from that source are removed.

...

Additional arguments passing to bind_source()

Value

A data.frame or remote table with clients that satisfied the predefined case definition. Columns started with "raw_in_" are source-specific counts of raw records, and columns started with "valid_in_" are the number of valid entries (or the number of flags) in each source.

Examples

# toy data
df1 <- make_test_dat()
df2 <- make_test_dat()

# use build_def to make a toy definition
sud_def <- build_def("SUD", # usually a disease name
  src_lab = c("src1", "src2"), # identify from multiple sources, e.g., hospitalization, ED visits.
  # functions that filter the data with some criteria
  def_fn = define_case,
  fn_args = list(
    vars = starts_with("diagx"),
    match = "start", # "start" will be applied to all sources as length = 1
    vals = list(c("304"), c("305")),
    clnt_id = "clnt_id", # list()/c() could be omitted for single element
    # c() can be used in place of list
    # if this argument only takes one value for each source
    n_per_clnt = c(2, 3)
  )
)

# save the definition for re-use
# saveRDS(sud_def, file = some_path)

# execute definition
sud_by_src <- sud_def %>% execute_def(with_data = list(src1 = df1, src2 = df2))
#> 
#> Actions for definition SUD using source df1:
#> → --------------Inclusion step--------------
#> ℹ Identify records with condition(s):
#> • where at least one of the diagx, diagx_1, diagx_2 column(s) in each record
#> • contains a value satisfied regular expression: ^304
#> 
#> All unique value(s) and frequency in the result (as the conditions require just one of the columns containing target values; irrelevant values may come from other vars columns): 
#>  304 3040 3041 3042 3043 3044 3045 3046 3047 3048 3049  305 3050 3051 3052 3053 
#>    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
#> 3054 3055 3056 3057 3058 3059  999 NA's 
#>    1    1    1    1    1    1    1    1 
#> → --------------No. rows restriction--------------
#> ℹ Of the 29 clients in the input, 20 were flagged as 0 by restricting that each client must have at least 2 records 
#> → -------------- Output all records--------------
#> 
#> Actions for definition SUD using source df2:
#> → --------------Inclusion step--------------
#> ℹ Identify records with condition(s):
#> • where at least one of the diagx, diagx_1, diagx_2 column(s) in each record
#> • contains a value satisfied regular expression: ^305
#> 
#> All unique value(s) and frequency in the result (as the conditions require just one of the columns containing target values; irrelevant values may come from other vars columns): 
#>  304 3040 3041 3042 3043 3045 3046 3047 3048 3049  305 3050 3051 3052 3053 3054 
#>    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
#> 3055 3056 3057 3058 3059  999 NA's 
#>    1    1    1    1    1    1    1 
#> → --------------No. rows restriction--------------
#> ℹ Of the 24 clients in the input, 21 were flagged as 0 by restricting that each client must have at least 3 records 
#> → -------------- Output all records--------------

# pool results from src1 and src2 together at client level
pool_case(sud_by_src, sud_def, output_lvl = "clnt")
#> # A tibble: 12 × 6
#>    def   clnt_id raw_in_src1 raw_in_src2 valid_in_src1 valid_in_src2
#>    <chr>   <int>       <dbl>       <dbl>         <int>         <int>
#>  1 SUD         1           0           1             0             1
#>  2 SUD         4           1           1             1             0
#>  3 SUD         7           1           0             1             0
#>  4 SUD        20           0           1             0             1
#>  5 SUD        21           1           1             1             0
#>  6 SUD        22           1           0             1             0
#>  7 SUD        23           1           1             1             0
#>  8 SUD        25           1           0             1             0
#>  9 SUD        29           1           0             1             0
#> 10 SUD        34           1           1             0             1
#> 11 SUD        39           1           1             1             0
#> 12 SUD        42           1           1             1             0