This function filters and pools, i.e., row bind, qualified clients/groups from different source with an option to summarize by client. Unlike bind_source()
, no need to supply variable names; the function will guess what should be included and their names from the supplied definition from build_def()
. Whether a client is qualified relies on the flag variables set by define_case()
. Therefore, this function is intended to be use only with the built-in define_case()
as def_fn
in build_def()
.
Arguments
- data
A list of data.frame or remote table which should be output from
execute_def()
.- def
A tibble of case definition generated by
build_def()
.- output_lvl
Either:
"raw" - output all records (default),
or "clnt" - output one record per client with summaries including date of first valid record ('first_valid_date'), date of the latest record ('last_entry_date'), and sources that contain valid records.
- include_src
Character. It determines records from which sources should be included. This matters when clients were identified only from, not all, but some of the sources. This choice will not impact the number of client that would be identified but has impact on the number of records and the latest entry date. The options are one of:
"all" - records from all sources are included;
"has_valid" - for each client, records from sources that contain at least one valid record are included;
"n_per_clnt" - for each client, if they had fewer than
n_per_clnt
records in a source (seerestrict_n()
), then records from that source are removed.
- ...
Additional arguments passing to
bind_source()
Value
A data.frame or remote table with clients that satisfied the predefined case definition. Columns started with "raw_in_" are source-specific counts of raw records, and columns started with "valid_in_" are the number of valid entries (or the number of flags) in each source.
Examples
# toy data
df1 <- make_test_dat()
df2 <- make_test_dat()
# use build_def to make a toy definition
sud_def <- build_def("SUD", # usually a disease name
src_lab = c("src1", "src2"), # identify from multiple sources, e.g., hospitalization, ED visits.
# functions that filter the data with some criteria
def_fn = define_case,
fn_args = list(
vars = starts_with("diagx"),
match = "start", # "start" will be applied to all sources as length = 1
vals = list(c("304"), c("305")),
clnt_id = "clnt_id", # list()/c() could be omitted for single element
# c() can be used in place of list
# if this argument only takes one value for each source
n_per_clnt = c(2, 3)
)
)
# save the definition for re-use
# saveRDS(sud_def, file = some_path)
# execute definition
sud_by_src <- sud_def %>% execute_def(with_data = list(src1 = df1, src2 = df2))
#>
#> Actions for definition SUD using source df1:
#> → --------------Inclusion step--------------
#> ℹ Identify records with condition(s):
#> • where at least one of the diagx, diagx_1, diagx_2 column(s) in each record
#> • contains a value satisfied regular expression: ^304
#>
#> All unique value(s) and frequency in the result (as the conditions require just one of the columns containing target values; irrelevant values may come from other vars columns):
#> 304 3040 3041 3042 3043 3044 3045 3046 3047 3048 3049 305 3050 3051 3052 3053
#> 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> 3054 3055 3056 3057 3058 3059 999 NA's
#> 1 1 1 1 1 1 1 1
#> → --------------No. rows restriction--------------
#> ℹ Of the 29 clients in the input, 20 were flagged as 0 by restricting that each client must have at least 2 records
#> → -------------- Output all records--------------
#>
#> Actions for definition SUD using source df2:
#> → --------------Inclusion step--------------
#> ℹ Identify records with condition(s):
#> • where at least one of the diagx, diagx_1, diagx_2 column(s) in each record
#> • contains a value satisfied regular expression: ^305
#>
#> All unique value(s) and frequency in the result (as the conditions require just one of the columns containing target values; irrelevant values may come from other vars columns):
#> 304 3040 3041 3042 3043 3045 3046 3047 3048 3049 305 3050 3051 3052 3053 3054
#> 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> 3055 3056 3057 3058 3059 999 NA's
#> 1 1 1 1 1 1 1
#> → --------------No. rows restriction--------------
#> ℹ Of the 24 clients in the input, 21 were flagged as 0 by restricting that each client must have at least 3 records
#> → -------------- Output all records--------------
# pool results from src1 and src2 together at client level
pool_case(sud_by_src, sud_def, output_lvl = "clnt")
#> # A tibble: 12 × 6
#> def clnt_id raw_in_src1 raw_in_src2 valid_in_src1 valid_in_src2
#> <chr> <int> <dbl> <dbl> <int> <int>
#> 1 SUD 1 0 1 0 1
#> 2 SUD 4 1 1 1 0
#> 3 SUD 7 1 0 1 0
#> 4 SUD 20 0 1 0 1
#> 5 SUD 21 1 1 1 0
#> 6 SUD 22 1 0 1 0
#> 7 SUD 23 1 1 1 0
#> 8 SUD 25 1 0 1 0
#> 9 SUD 29 1 0 1 0
#> 10 SUD 34 1 1 0 1
#> 11 SUD 39 1 1 1 0
#> 12 SUD 42 1 1 1 0