---
title: "Introduction to the msSPChelpR package - from long dataset to SIR analyses"
author: "Marian Eberl"
date: "26 October 2020"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to the msSPChelpR package - from long dataset to SIR analyses}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r , include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```


# Introduction

This vignette explains how to use the functions:

 * `calc_futime()` to calculate follow-up time from index event until next event, death or end of follow-up date
 * `pat_status()` to determine patient status at end of follow-up
 * `renumber_time_id()` to calculate a consecutive index of events per case ID
 * `reshape_long()` to transpose dataset in wide format to data in long format
 * `reshape_wide()` to transpose dataset in long format to data in wide format (the wide format is required for many package functions)
 * `sir_byfutime()` to calculate standardized incidence ratios (SIRs) with custom grouping variables stratified by follow-up time
 * `summarize_sir_results()` to summarize detailed SIR results produced by `sir_byfutime()`
 * `vital_status()` to determine vital status whether patient is alive or dead at end of follow-up

For some functions there are multiple variants of the same function using varying frameworks. They give the same results but will differ in execution time and memory use: 
 
 * [`tidytable`](https://markfairbanks.github.io/tidytable/) variants carry the suffix `_tt()`
 * [`tidyr`](https://tidyr.tidyverse.org) variants carry the suffix `_tidyr()`
 

# Recommended Workflows

## Calculate follow-up times {#wf-calc-fu}

It is recommended to run the following steps in the correct order to obtain accurate follow-up time calculations

 1. [Use the long version dataset](#step-long)

 2. [Filter all cases](#step-filter) in the long version of the dataset that are relevant for your analysis. Make sure that: 
 
   * for each `case_id` the index event (e.g. First Cancer FC) is still included and is the one remaining row in the dataset with the smallest `case_id` (`TUMID3` variable for ZfKD data, and `SEQ_NUM` for SEER data)
   * all `case_id`s might or might not get a countable incident event (e.g. Second Primary Cancer SPC). This event should be the second entry per `case_id` (second smallest `case_id`) if it is to be counted
   * in the long version dataset a `count_var` should indicate whether the countable incident event (SPC) has occurred or not. Coded `0` for non-occurrence (or not counted event) and `1` for a counted incident event.
   
 3. [Renumber filtered long dataset](#step-renumber): In the filter long dataset, you should run the helper function `msSPChelpR::renumber_time_id_dt()` (or non-data.table variant `msSPChelpR::renumber_time_id()`) that will renumber all events per `case_id` and (if step 1 is fulfilled) will assign each index event with `time_var_new = 1` and each second (possibly countable incident event) with `time_var_new = 2`. Any SIR related function will only count the second event, if additionally to `time_var_new = 2` for this row also `count_var = 1` is true.
 
 4. [Reshape dataset](#step-reshape-wide): Run  `msSPChelpR::reshape_wide_dt()` or non-data.table-variant `msSPChelpR::reshape_wide()`, so that dataset is transposed to wide format (1 row per `case_id`, creating variables such as `count_var.2`).
 
 5. [Set flag for Second Primary Cancer diagnosis](#step-spc): After filtering and reshaping it is essential to set `p_spc` again. This variable will be used by later steps of the analysis.
 
 6. [Determine patient status](#step-pat-status) at a defined end of follow-up by using the `msSPChelpR::pat_status()` function. This date for end of follow-up must:
   * be in "YYYY-MM-DD" format and is always defined via the `fu_end =` parameter
   * must precede the end of data collection. E.g. if the last incident events for the dataset you are using are collected at the end of 2014, your `fu_end` must be `fu_end = "2014-12-15"` or earlier.
   
   * Based on the newly calculated patient status, you might want to [exclude cases for which patient status cannot be determined](#step-filter-pat-status)
   
 7. [Calculate follow-up time](#step-calc-futime) for the same dataset by using the `msSPChelpR::calc_futime()` function and the same `fu_end` as for step 6. By standard all functions of the `msSPChelpR` package require follow-up times as numeric years.
 

## Calculate Standardized Incidence Ratios (SIR) 

In order to calculate SIR using the package functions, the following data structure is needed:
  * Wide format data `wide_df` with one row per patient that has encountered the index event (i.e. diagnosed with a first primary cancer FC)

 * The dataset `wide_df` needs to contain the following variables (columns) per patient (row):
   * `region_var` - variable in df that contains information on region where case was incident.
   * `agegroup_var` - variable in df that contains information on age-group.
   * `sex_var` - variable in df that contains information on biological sex.
   * `year_var` - variable in df that contains information on year or year-period when case was incident.
   * `site_var` - variable in df that contains information on case (count event) diagnosis. Cases are usually the second cancers. Diagnoses can use any coding system (e.g. ICD) but coding system between dataset and reference data must be coherent.
   * `futime_var` - variable in df that contains follow-up time per person between date of first cancer and any of death, date of event (case), end of FU date (in years; whatever event comes first). In case you have not calculated the FU time yet, you can use the workflow described in the previous [chapter](#wf-calc-fu).
   
If your data has the required structure, you can calculate and summarize SIRs with the following two steps:

 8. [Calculate SIR](#step-sir-byfutime) per SPC diagnosis with age, sex, region, period-specific strata using the `msSPChelpR::sir_byfutime()` function. For this calculation usually a reference dataset is required that defines the population standard rates. `refrates_df` must use the same category coding of age, sex, region, year and cancer_site as `agegroup_var`, `sex_var`, `region_var`, `year_var` and `site_var`
   * The theory behind calculating stratified SIRs is explained in the chapter on [basics on SIRs](#theo-sir)
   
 9. [Summarize SIR results](#step-summarize-sir) using the `msSPChelpR::summarize_sir_results()` function on the stratified sir results produced by the previous step.
 
# Theory behind SIRs {#theo-sir}

In the next version of this vignette the theoretical considerations how SIRs are calculated will be explained in this chapter.

# Examples

## SEER lung cancer

### Step 1 - Long dataset {#step-long}


```{r}
library(dplyr)
library(magrittr)
library(msSPChelpR)
#Load synthetic dataset of patients with cancer to demonstrate package functions
data("us_second_cancer")

#This dataset is in long format, so each tumor is a separate row in the data
us_second_cancer
```


### Step 2 - Filter long dataset {#step-filter}

```{r}
#filter for lung cancer
ids <- us_second_cancer %>%
  #detect ids with any lung cancer
  filter(t_site_icd == "C34") %>%
  select(fake_id) %>%
  as.vector() %>%
  unname() %>%
  unlist()

filtered_usdata <- us_second_cancer %>%
  #filter according to above detected ids with any lung cancer diagnosis
  filter(fake_id %in% ids) %>%
  arrange(fake_id)

filtered_usdata
```


### Step 3 - Renumber `time_id` {#step-renumber}

```{r}
renumbered_usdata <- filtered_usdata %>%
  renumber_time_id(new_time_id_var = "t_tumid", 
                   dattype = "seer",
                   case_id_var = "fake_id")

renumbered_usdata %>%
   select(fake_id, sex, t_site_icd, t_datediag, t_tumid)
```


### Step 4 - Reshape to wide dataset {#step-reshape-wide}

```{r}
usdata_wide <- renumbered_usdata %>%
  reshape_wide_tidyr(case_id_var = "fake_id", time_id_var = "t_tumid", timevar_max = 10)

#now the data is in the wide format as required by many package functions. 
#This means, each case is a row and several tumors per case ID are 
#add new columns to the data using the time_id as column name suffix.
usdata_wide
```

### Step 5 - Recalculate `p_spc` {#step-spc}

```{r}

usdata_wide <- usdata_wide %>%
  dplyr::mutate(p_spc = dplyr::case_when(is.na(t_site_icd.2)   ~ "No SPC",
                         !is.na(t_site_icd.2)           ~ "SPC developed",
                         TRUE ~ NA_character_)) %>%
  #create the same information as numeric variable count_spc
  dplyr::mutate(count_spc = dplyr::case_when(is.na(t_site_icd.2)   ~ 1,
                            TRUE ~ 0))
usdata_wide %>%
   dplyr::select(fake_id, sex.1, p_spc, count_spc, t_site_icd.1, 
                 t_datediag.1, t_site_icd.2, t_datediag.2)

```

### Step 6 - Determine patient status at end of FU {#step-pat-status}

```{r}
usdata_wide <- usdata_wide %>%
  pat_status(., fu_end = "2017-12-31", dattype = "seer",
             status_var = "p_status", life_var = "p_alive.1",
             spc_var = "p_spc", birthdat_var = "datebirth.1",
             lifedat_var = "datedeath.1", fcdat_var = "t_datediag.1",
             spcdat_var = "t_datediag.2", life_stat_alive = "Alive",
             life_stat_dead = "Dead", spc_stat_yes = "SPC developed",
             spc_stat_no = "No SPC", lifedat_fu_end = "2019-12-31",
             use_lifedatmin = FALSE, check = TRUE, 
             as_labelled_factor = TRUE)

usdata_wide %>%
   dplyr::select(fake_id, p_status, p_alive.1, datedeath.1, t_site_icd.1, t_datediag.1, 
                 t_site_icd.2, t_datediag.2)

#alternatively, you can impute the date of death using lifedatmin_var
usdata_wide %>%
  pat_status(., fu_end = "2017-12-31", dattype = "seer",
             status_var = "p_status", life_var = "p_alive.1",
             spc_var = "p_spc", birthdat_var = "datebirth.1",
             lifedat_var = "datedeath.1", fcdat_var = "t_datediag.1",
             spcdat_var = "t_datediag.2", life_stat_alive = "Alive",
             life_stat_dead = "Dead", spc_stat_yes = "SPC developed",
             spc_stat_no = "No SPC", lifedat_fu_end = "2019-12-31",
             use_lifedatmin = TRUE, lifedatmin_var = "p_dodmin.1", 
             check = TRUE, as_labelled_factor = TRUE)


```

#### Step 6b - Remove patients irrelevant to analysis depending on status {#step-filter-pat-status}
```{r}
usdata_wide <- usdata_wide %>%
  dplyr::filter(!p_status %in% c("NA - Patient not born before end of FU",
                                 "NA - Patient did not develop cancer before end of FU",
                                 "NA - Patient date of death is missing"))

usdata_wide %>%
  dplyr::count(p_status)

```

### Step 7 - Calculate FU time {#step-calc-futime}

```{r}
usdata_wide <- usdata_wide %>%
   calc_futime(., futime_var_new = "p_futimeyrs", fu_end = "2017-12-31",
               dattype = "seer", time_unit = "years", 
               lifedat_var = "datedeath.1", 
               fcdat_var = "t_datediag.1", spcdat_var = "t_datediag.2")

usdata_wide %>%
   dplyr::select(fake_id, p_status, p_futimeyrs, p_alive.1, datedeath.1, t_datediag.1, t_datediag.2)

```


### Step 8 - Calculate SIR {#step-sir-byfutime}

```{r}
sircalc_results <- usdata_wide %>%
  sir_byfutime(
    dattype = "seer",
    ybreak_vars = c("race.1", "t_dco.1"),
    xbreak_var = "none",
    futime_breaks = c(0, 1/12, 2/12, 1, 5, 10, Inf),
    count_var = "count_spc",
    refrates_df = us_refrates_icd2,
    calc_total_row = TRUE,
    calc_total_fu = TRUE,
    region_var = "registry.1",
    age_var = "fc_agegroup.1",
    sex_var = "sex.1",
    year_var = "t_yeardiag.1",
    race_var = "race.1",
    site_var = "t_site_icd.1", #using grouping by second cancer incidence
    futime_var = "p_futimeyrs",
    alpha = 0.05)

sircalc_results %>% print(n = 100)

```

### Step 9 - Summarize SIR results {#step-summarize-sir}

```{r}
#The summarize function is versatile. Here for example the summary with minimal output

sircalc_results %>%
  #summarize results across region, age, year and t_site
  summarize_sir_results(.,
                        summarize_groups = c("region", "age", "year", "race"),
                        summarize_site = TRUE,
                        output = "long",  output_information = "minimal",
                        add_total_row = "only",  add_total_fu = "no",
                        collapse_ci = FALSE,  shorten_total_cols = TRUE,
                        fubreak_var_name = "fu_time", ybreak_var_name = "yvar_name",
                        xbreak_var_name = "none", site_var_name = "t_site",
                        alpha = 0.05
                        ) %>%
  dplyr::select(-region, -age, -year, -race, -sex, -yvar_name)
```


## Built with

```{r}
sessionInfo()
```