--- title: "webtrackR" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{webtrackR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(webtrackR) ``` webtrackR is an R package to preprocess and analyze web tracking data, i.e., web browsing histories of participants in an academic study. Web tracking data is oftentimes collected and analyzed in conjunction with survey data of the same participants. `webtrackR` is part of a series of R packages to analyse webtracking data: - [webtrackR](https://github.com/schochastics/webtrackR): preprocess raw webtracking data - [domainator](https://github.com/schochastics/domainator): classify domains - [adaR](https://github.com/gesistsa/adaR): parse urls ## Installation You can install the development version of webtrackR from [GitHub](https://github.com/) with: ``` r # install.packages("devtools") devtools::install_github("schochastics/webtrackR") ``` The [CRAN](https://CRAN.R-project.org/package=webtrackR) version can be installed with: ```r install.packages("webtrackR") ``` ## S3 class `wt_dt` The package defines an S3 class called `wt_dt` which inherits most of the functionality from the `data.frame` class. A `summary` and `print` method are included in the package. Each row in a web tracking data set represents a visit. Raw data need to have at least the following variables: - `panelist_id`: the individual from which the data was collected - `url`: the URL of the visit - `timestamp`: the time of the URL visit The function `as.wt_dt` assigns the class `wt_dt` to a raw web tracking data set. It also allows you to specify the name of the raw variables corresponding to `panelist_id`, `url` and `timestamp`. Additionally, it turns the timestamp variable into `POSIXct` format. All preprocessing functions check if these three variables are present. Otherwise an error is thrown. ## Preprocessing Several other variables can be derived from the raw data with the following functions: - `add_duration()` adds a variable called `duration` based on the sequence of timestamps. The basic logic is that the duration of a visit is set to the time difference to the subsequent visit, unless this difference exceeds a certain value (defined by argument `cutoff`), in which case the duration will be replaced by `NA` or some user-defined value (defined by `replace_by`). - `add_session()` adds a variable called `session`, which groups subsequent visits into a session until the difference to the next visit exceeds a certain value (defined by `cutoff`). - `extract_host()`, `extract_domain()`, `extract_path()` extracts the host, domain and path of the raw URL and adds variables named accordingly. See function descriptions for definitions of these terms. `drop_query()` lets you drop the query and fragment components of the raw URL. - `add_next_visit()` and `add_previous_visit()` adds the previous or the next URL, domain, or host (defined by `level`) as a new variable. - `add_referral()` adds a new variable indicating whether a visit was referred by a social media platform. Follows the logic of Schmidt et al., [(2023)](https://doi.org/10.31235/osf.io/cks68). - `add_title()` downloads the title of a website (the text within the `