--- title: "Getting Started" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(conmat) library(socialmixr) library(ggplot2) library(dplyr) library(tidyr) library(mgcv) library(patchwork) ``` The goal of conmat is to simplify the process of generating synthetic contact matrices for a given age population. **What is a contact matrix?** Contact matrices describe the degree of contact between individuals of given age groups. For example, this matrix describes the number of contacts between individuals. ```{r} #| label: show-contact #| echo: FALSE name_vec <- c("0-4", "5-9", "10-14") cmat <- matrix( data = rep(NA, 9), nrow = 3, ncol = 3, byrow = TRUE, dimnames = list( name_vec, name_vec ) ) diag(cmat) <- c(10, 11, 13) cmat[upper.tri(cmat)] <- 3:5 cmat[lower.tri(cmat)] <- 3:5 cmat ``` The rows and columns represent the age groups of the people. On the main diagonal we see that we have a higher number of contacts - showing that people of similar ages tend to interact more with one another. We can use the information in these matrices to model how diseases such as COVID-19 spread in a population through social contact. **Why do we need _synthetic_ contact matrices?** Contact matrices are produced from empirical data resulting from a contact survey, which requires individuals to diary the amount and manner of contact a person has in a day. However, these surveys are highly time-consuming and expensive to run, meaning that only a handful of these empirical datasets exist globally. We can use statistical methods to create _synthetic contact matrices_, which are new contact matrices that have been generalised to new countries based on existing surveys. **Why do we need `conmat`?** Existing methods only provide outputs of the contact matrices for each country, or at best, for urban and rural areas for a given country. We need methods that allow for flexibly creating synthetic contact matrices for a specified age population. This is because the age population distribution of many countries (e.g., Australia), are quite heterogeneous, and assuming it is homogeneous would result in inaccurate representation of community infection in many regions. ## Quick example using Australian data Suppose we want to get a contact matrix for a given region in Australia, let's say the city of Perth. We can get that from a helper function, `abs_age_lga`. ```{r} perth <- abs_age_lga("Perth (C)") perth ``` (You can learn more about the data sources we provide in `data-sources.Rmd`) We can get a contact matrix made for `perth` using the `extrapolate_polymod` function: ```{r} perth_contact <- extrapolate_polymod( population = perth ) perth_contact ``` We can plot this with `autoplot` ```{r} autoplot(perth_contact) ``` And you can see each contact matrix in a setting by referring to its name - for example, the "home" setting contact matrix: ```{r} perth_contact$home ``` These contact matrices could then be used in subsequent modelling, such as the input in a SIR (Susceptible, Infected, Recovered) model. ## Quick example using world data Similarly to above, we can access some world data from another data source - we have some helpers to pull data from the world population: ```{r} world_data <- socialmixr::wpp_age() head(world_data) ``` We can tidy the data up, filtering down to a specified location and year with the `age_population` function: ```{r} nz_2015 <- age_population( data = world_data, location_col = country, location = "New Zealand", age_col = lower.age.limit, year_col = year, year = 2015 ) nz_2015 ``` Then we could create a contact matrix for NZ population data for 2015 like so: ```{r} nz_contact <- extrapolate_polymod( population = nz_2015 ) autoplot(nz_contact) nz_contact$home ``` ## What next? From here you might want to: - Create a next generation matrix (NGM) - Apply vaccination to an NGM See the vignette, "example pipeline" for more details. # A More in depth example The above example showed how we might extract a contact matrix based on the polymod data - this example now shows all the steps that can be taken for full flexibility, and provides more detail on the initial datasets that could be used. First we want to fit the model to the POLYMOD data, which contains various survey and population data. ```{r get-polymod} library(conmat) polymod_contact_data_home <- get_polymod_contact_data(setting = "home") polymod_survey_data <- get_polymod_population() ``` The contact data is a data frame containing the age from and to, and the number of contacts for each of the specified settings, "home", "work", "school", "other", or "all" as well as the number of participants. By default, `polymod_contact_data` contains data from "all", but we're going to use the "work" set of data, as it produces an interesting looking dataset. Each row contains survey information of the number of contacts. Specifically, the number of contacts from one age group to another age group, and then the number of participants in that age group. The survey data, `polymod_survey_data` contains the lower age limit and the population in that age group. ```{r show-polymod} polymod_survey_data ``` We also provide control over the POLYMOD data retrieved from `get_polymod_contact_data()` via the arguments, `setting`, `country`, and `ages`. These allow you to specify the data to be only from certain settings or countries or ages. See `?get_polymod_contact_data` for more details. Below is a brief example of this: ```{r show-off-polymod-contact-data} polymod_contact_data_belgium_0_10 <- get_polymod_contact_data( setting = "work", countries = "Belgium", ages = c(0, 5, 10) ) polymod_contact_data_belgium_0_10 ``` Similarly, you can control the population data, retrieving it only for certain countries: ```{r show-off-polymod} get_polymod_population(countries = "Belgium") get_polymod_population(countries = "Finland") ``` You can see the available countries in the helpfile with `?get_polymod_population`. ## Predicting the contact rate We can create a model of the contact *rate* with the function `fit_single_contact_model`. We're first going to use some smaller sets of the data, to save on computation time. ```{r fit-polymod} set.seed(2022 - 10 - 04) polymod_contact_data_home_small <- polymod_contact_data_home %>% filter( age_from <= 30, age_to <= 30 ) polymod_survey_data_small <- polymod_survey_data %>% filter(lower.age.limit <= 30) contact_model <- fit_single_contact_model( contact_data = polymod_contact_data_home_small, population = polymod_survey_data_small ) ``` This fits a generalised additive model (GAM), predicting the contact rate, based on a series of prediction terms that describe various features of the contact rates. ```{r show-conctact-model} contact_model ``` We can use this contact model to then predict the contact rate in a new population. As a demonstration, let's take an age population from a given LGA in Australia (this was the initial motivation for the package, so there are some helper functions for Australian specific data). ```{r fairfield} fairfield <- abs_age_lga("Fairfield (C)") fairfield ``` We can then pass the contact model through to `predict_contacts`, along with the fairfield age population data, and some age breaks that we want to predict to. Note that these age breaks could be any size, we just have them set to 5 year age brackets in most of the examples, but these could be 1 year, 2 year, or even sub year. ```{r predict-contacts} set.seed(2022 - 10 - 04) synthetic_contact_fairfield <- predict_contacts( model = contact_model, population = fairfield, age_breaks = c(seq(0, 30, by = 5), Inf) ) synthetic_contact_fairfield ``` ## Plotting Let's visualise the matrix to get a sense of the predictions with `autoplot`. First we need to transform the predictions to a matrix: ```{r plot-matrix-differents} synthetic_contact_fairfield %>% predictions_to_matrix() %>% autoplot() ``` ## Note It is worth noting that the contact matrices created using this package are transposed when compared to the contact matrices discussed by [Prem](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005697) and [Mossong](https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0050074). That is, the rows are "age group to", and the columns are "age group from". ## Applying the model across all settings. Our experience has been that we would be fitting the same models to each setting when doing data analysis when using conmat. Accordingly, you can also fit a model for all of the settings all at once with the functions, `fit_setting_contacts()`, and `predict_setting_contacts()`. This means we can do the above, but for each setting, "home", "work", "school", "other", and "all", at once. If we want to use all of the POLYMOD data, we can also use the `extrapolate_polymod()` function. ### Fit to all settings We can create a model for each of the settings with `fit_setting_contacts()`. ```{r fit-polymod-setting} set.seed(2021 - 09 - 24) polymod_setting_data <- get_polymod_setting_data() polymod_setting_data_small <- polymod_setting_data %>% lapply(FUN = function(x) x %>% filter(age_from <= 20, age_to <= 20)) |> new_setting_data() setting_models <- fit_setting_contacts( contact_data_list = polymod_setting_data_small, population = polymod_survey_data ) ``` This contains a list of models, one for each setting. We can look at one, and get summary information out: ```{r names-setting-models} names(setting_models) setting_models$home ``` So this gives us our baseline model of a contact model. We have fit this model to the existing contact survey data. We can now predict to another age population, to create our "synthetic" contact matrix. ### Predict to all settings Then we take the model we had earlier, and extrapolate to the fairfield data with `predict_setting_contacts`, also providing some age breaks we want to predict to ```{r fairfield-synth-5} set.seed(2021 - 10 - 04) synthetic_settings_5y_fairfield <- predict_setting_contacts( population = fairfield, contact_model = setting_models, age_breaks = c(seq(0, 20, by = 5), Inf) ) ``` This then returns a list of synthetic matrices, "home", "work", "school", "other", and "all", which is the sum of all matrices. ```{r str-synthetic-fairfield} str(synthetic_settings_5y_fairfield) synthetic_settings_5y_fairfield$home synthetic_settings_5y_fairfield$all ``` We can use `autoplot` to plot all at once ```{r fairfield-synth-5-plot} # this code is erroring for the moment - something to do with rendering a large plot I think. autoplot( synthetic_settings_5y_fairfield, title = "Setting-specific synthetic contact matrices (fairfield 2020 projected)" ) ```