Title: | Preprocessing and Analyzing Web Tracking Data |
---|---|
Description: | Data structures and methods to work with web tracking data. The functions cover data preprocessing steps, enriching web tracking data with external information and methods for the analysis of digital behavior as used in several academic papers (e.g., Clemm von Hohenberg et al., 2023 <doi:10.17605/OSF.IO/M3U9P>; Stier et al., 2022 <doi:10.1017/S0003055421001222>). |
Authors: | David Schoch [aut, cre] , Bernhard Clemm von Hohenberg [aut] , Frank Mangold [aut] , Sebastian Stier [aut] |
Maintainer: | David Schoch <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.3.1.9000 |
Built: | 2024-11-22 05:37:51 UTC |
Source: | https://github.com/schochastics/webtrackR |
add_duration()
approximates the time spent on a visit based on the difference
between two consecutive timestamps, replacing differences exceeding cutoff
with
the value defined in replace_by
.
add_duration( wt, cutoff = 300, replace_by = NA, last_replace_by = NA, device_switch_na = FALSE, device_var = NULL )
add_duration( wt, cutoff = 300, replace_by = NA, last_replace_by = NA, device_switch_na = FALSE, device_var = NULL )
wt |
webtrack data object. |
cutoff |
numeric (seconds). If duration is greater than this value,
it is reset to the value defined by |
replace_by |
numeric. Determines whether differences greater than
the cutoff are set to |
last_replace_by |
numeric. Determines whether the last visit
for an individual is set to |
device_switch_na |
boolean. Relevant only when data was collected
from multiple devices. When visits are ordered by timestamp sequence,
two consecutive visits can come from different devices, which makes the
timestamp difference less likely to be the true duration. It may be
preferable to set the duration of the visit to |
device_var |
character. Column indicating device.
Required if 'device_switch_na' set to |
webtrack data.frame with the same columns as wt and a new column called for duration.
## Not run: data("testdt_tracking") wt <- as.wt_dt(testdt_tracking) wt <- add_duration(wt) # Defining cutoff at 10 minutes, replacing those exceeding cutoff to 5 minutes, # and setting duration before device switch to `NA`: wt <- add_duration(wt, cutoff = 600, replace_by = 300, device_switch_na = TRUE, device_var = "device" ) ## End(Not run)
## Not run: data("testdt_tracking") wt <- as.wt_dt(testdt_tracking) wt <- add_duration(wt) # Defining cutoff at 10 minutes, replacing those exceeding cutoff to 5 minutes, # and setting duration before device switch to `NA`: wt <- add_duration(wt, cutoff = 600, replace_by = 300, device_switch_na = TRUE, device_var = "device" ) ## End(Not run)
add_next_visit()
adds the subsequent visit, as determined by order of
timestamps as a new column. The next visit can be added as either the full URL,
the extracted host or the extracted domain, depending on level
.
add_next_visit(wt, level = "url")
add_next_visit(wt, level = "url")
wt |
webtrack data object. |
level |
character. Either |
webtrack data.frame with the same columns as wt and
a new column called url_next
,host_next
or domain_next
.
## Not run: data("testdt_tracking") wt <- as.wt_dt(testdt_tracking) # Adding next full URL as new column wt <- add_next_visit(wt, level = "url") # Adding next host as new column wt <- add_next_visit(wt, level = "host") # Adding next domain as new column wt <- add_next_visit(wt, level = "domain") ## End(Not run)
## Not run: data("testdt_tracking") wt <- as.wt_dt(testdt_tracking) # Adding next full URL as new column wt <- add_next_visit(wt, level = "url") # Adding next host as new column wt <- add_next_visit(wt, level = "host") # Adding next domain as new column wt <- add_next_visit(wt, level = "domain") ## End(Not run)
Adds information about panelists (e.g., from a survey) to the tracking data.
add_panelist_data(wt, data, cols = NULL, join_on = "panelist_id")
add_panelist_data(wt, data, cols = NULL, join_on = "panelist_id")
wt |
webtrack data object. |
data |
a data frame containing panelist data which contains columns about panelists |
cols |
character vector of columns to add. If |
join_on |
which columns to join on. Defaults to |
webtrack object with the same columns and the columns from data
specified in cols
.
## Not run: data("testdt_tracking") data("testdt_survey_w") wt <- as.wt_dt(testdt_tracking) # add survey test data add_panelist_data(wt, testdt_survey_w) ## End(Not run)
## Not run: data("testdt_tracking") data("testdt_survey_w") wt <- as.wt_dt(testdt_tracking) # add survey test data add_panelist_data(wt, testdt_survey_w) ## End(Not run)
add_previous_visit()
adds the previous visit, as determined by order of
timestamps as a new column The previous visit can be added as either the full URL,
the extracted host or the extracted domain, depending on level
.
add_previous_visit(wt, level = "url")
add_previous_visit(wt, level = "url")
wt |
webtrack data object. |
level |
character. Either |
webtrack data.frame with the same columns as wt and
a new column called url_previous
,host_previous
or domain_previous.
.
## Not run: data("testdt_tracking") wt <- as.wt_dt(testdt_tracking) # Adding previous full URL as new column wt <- add_previous_visit(wt, level = "url") # Adding previous host as new column wt <- add_previous_visit(wt, level = "host") # Adding previous domain as new column wt <- add_previous_visit(wt, level = "domain") ## End(Not run)
## Not run: data("testdt_tracking") wt <- as.wt_dt(testdt_tracking) # Adding previous full URL as new column wt <- add_previous_visit(wt, level = "url") # Adding previous host as new column wt <- add_previous_visit(wt, level = "host") # Adding previous domain as new column wt <- add_previous_visit(wt, level = "domain") ## End(Not run)
Identifies whether a visit was referred to from social media and adds it as a new column. See details for method.
add_referral(wt, platform_domains, patterns)
add_referral(wt, platform_domains, patterns)
wt |
webtrack data object. |
platform_domains |
character. A vector of platform domains for which
referrers should be identified. Order and length must correspondent to |
patterns |
character. A vector of patterns for which referrers should
be identified. Order and length must correspondent to |
To identify referrals, we rely on the method described as most valid
in Schmidt et al.: When the domain preceding a visit was to the platform in question,
and the query string of the visit's URL contains a certain pattern,
we count it as a referred visit. For Facebook, the pattern has been identified
by Schmidt et al. as 'fbclid='
, although this can change in future.
webtrack data.frame with the same columns as wt and a new column called referral
,
which takes on NA if no referral has been identified, or the name specified
platform_domains if a referral from that platform has been identified
Schmidt, Felix, Frank Mangold, Sebastian Stier and Roberto Ulloa. "Facebook as an Avenue to News: A Comparison and Validation of Approaches to Identify Facebook Referrals". Working paper.
## Not run: data("testdt_tracking") wt <- as.wt_dt(testdt_tracking) wt <- add_referral(wt, platform_domains = "facebook.com", patterns = "fbclid=") wt <- add_referral(wt, platform_domains = c("facebook.com", "twitter.com"), patterns = c("fbclid=", "utm_source=twitter") ) ## End(Not run)
## Not run: data("testdt_tracking") wt <- as.wt_dt(testdt_tracking) wt <- add_referral(wt, platform_domains = "facebook.com", patterns = "fbclid=") wt <- add_referral(wt, platform_domains = c("facebook.com", "twitter.com"), patterns = c("fbclid=", "utm_source=twitter") ) ## End(Not run)
add_session()
groups visits into "sessions", defining a session to end
when the difference between two consecutive timestamps exceeds a cutoff
.
add_session(wt, cutoff)
add_session(wt, cutoff)
wt |
webtrack data object. |
cutoff |
numeric (seconds). If the difference between two consecutive timestamps exceeds this value, a new browsing session is defined. |
webtrack data.frame with the same columns as wt and a new column called session.
## Not run: data("testdt_tracking") wt <- as.wt_dt(testdt_tracking) # Setting cutoff to 30 minutes wt <- add_session(wt, cutoff = 1800) ## End(Not run)
## Not run: data("testdt_tracking") wt <- as.wt_dt(testdt_tracking) # Setting cutoff to 30 minutes wt <- add_session(wt, cutoff = 1800) ## End(Not run)
Gets the title of a URL by accessing the web address online and adds the title as a new column. See details for the meaning of "title". You need an internet connection to run this function.
add_title(wt, lang = "en-US,en-GB,en")
add_title(wt, lang = "en-US,en-GB,en")
wt |
webtrack data object. |
lang |
character (a language tag). Language accepted by the request.
Defaults to |
The title of a website (the text within the <title>
tag
of a web site's <head>
) #' is the text that is shown on the "tab"
when looking at the website in a browser. It can contain useful information
about a URL's content and can be used, for example, for classification purposes.
Note that it may take a while to run this function for a large number of URLs.
webtrack data.frame with the same columns as wt and a new column
called "title"
, which will be NA
if the title cannot be retrieved.
## Not run: data("testdt_tracking") wt <- as.wt_dt(testdt_tracking)[1:2] # Get titles with `lang` set to default English wt_titles <- add_title(wt) # Get titles with `lang` set to German wt_titles <- add_title(wt, lang = "de") ## End(Not run)
## Not run: data("testdt_tracking") wt <- as.wt_dt(testdt_tracking)[1:2] # Get titles with `lang` set to default English wt_titles <- add_title(wt) # Get titles with `lang` set to German wt_titles <- add_title(wt, lang = "de") ## End(Not run)
Symmetric Atkinson Index calculates the symmetric Atkinson index
atkinson_index(grp_a, grp_b)
atkinson_index(grp_a, grp_b)
grp_a |
vector (usually corresponds to a column in a webtrack data frame) indicating the number of individuals of group A using a website |
grp_b |
vector (usually corresponds to a column in a webtrack data frame) indicating the number of individuals of group B using a website |
Frankel, David, and Oscar Volij. "Scale Invariant Measures of Segregation "Working Paper, 2008.
# perfect score grp_a <- c(5, 5, 0, 0) grp_b <- c(0, 0, 5, 5) atkinson_index(grp_a, grp_b) grp_a <- c(5, 5, 5, 5) grp_b <- c(5, 5, 5, 5) atkinson_index(grp_a, grp_b)
# perfect score grp_a <- c(5, 5, 0, 0) grp_b <- c(0, 0, 5, 5) atkinson_index(grp_a, grp_b) grp_a <- c(5, 5, 5, 5) grp_b <- c(5, 5, 5, 5) atkinson_index(grp_a, grp_b)
Bakshy Top500 Ideological alignment of 500 domains based on facebook data
bakshy
bakshy
An object of class data.table
(inherits from data.frame
) with 500 rows and 7 columns.
Bakshy, Eytan, Solomon Messing, and Lada A. Adamic. "Exposure to ideologically diverse news and opinion on Facebook." Science 348.6239 (2015): 1130-1132.
classify_visits()
categorizes visits by either extracting the visit URL's
domain or host and matching them to a list of domains or hosts;
or by matching a list of regular expressions against the visit URL.
classify_visits( wt, classes, match_by = "domain", regex_on = NULL, return_rows_by = NULL, return_rows_val = NULL )
classify_visits( wt, classes, match_by = "domain", regex_on = NULL, return_rows_by = NULL, return_rows_val = NULL )
wt |
webtrack data object. |
classes |
a data frame containing classes that can be matched to visits. |
match_by |
character. Whether to match list entries from |
regex_on |
character. Column in |
return_rows_by |
character. A column in |
return_rows_val |
character. The value of the columns specified in
|
webtrack data.frame with the same columns as wt
and any column
in classes
except the column specified by match_by
.
## Not run: data("testdt_tracking") data("domain_list") wt <- as.wt_dt(testdt_tracking) # classify visits via domain wt_domains <- extract_domain(wt) wt_classes <- classify_visits(wt_domains, classes = domain_list, match_by = "domain") # classify visits via domain # for the example, just renaming "domain" column domain_list$host <- domain_list$domain wt_hosts <- extract_host(wt) wt_classes <- classify_visits(wt_hosts, classes = domain_list, match_by = "host") # classify visits with pattern matching # for the example, any value in "domain" treated as pattern data("domain_list") regex_list <- domain_list[type == "facebook"] wt_classes <- classify_visits(wt[1:5000], classes = regex_list, match_by = "regex", regex_on = "domain" ) # classify visits via domain and only return class "search" data("domain_list") wt_classes <- classify_visits(wt_domains, classes = domain_list, match_by = "domain", return_rows_by = "type", return_rows_val = "search" ) ## End(Not run)
## Not run: data("testdt_tracking") data("domain_list") wt <- as.wt_dt(testdt_tracking) # classify visits via domain wt_domains <- extract_domain(wt) wt_classes <- classify_visits(wt_domains, classes = domain_list, match_by = "domain") # classify visits via domain # for the example, just renaming "domain" column domain_list$host <- domain_list$domain wt_hosts <- extract_host(wt) wt_classes <- classify_visits(wt_hosts, classes = domain_list, match_by = "host") # classify visits with pattern matching # for the example, any value in "domain" treated as pattern data("domain_list") regex_list <- domain_list[type == "facebook"] wt_classes <- classify_visits(wt[1:5000], classes = regex_list, match_by = "regex", regex_on = "domain" ) # classify visits via domain and only return class "search" data("domain_list") wt_classes <- classify_visits(wt_domains, classes = domain_list, match_by = "domain", return_rows_by = "type", return_rows_val = "search" ) ## End(Not run)
Create an urldummy variable
create_urldummy(wt, dummy, name)
create_urldummy(wt, dummy, name)
wt |
webtrack data object |
dummy |
a vector of urls that should be dummy coded |
name |
name of dummy variable to create. |
webtrack object with the same columns and a new column called "name" including the dummy variable
## Not run: data("testdt_tracking") wt <- as.wt_dt(testdt_tracking) wt <- extract_domain(wt) code_urls <- "https://dkr1.ssisurveys.com/tzktsxomta" create_urldummy(wt, dummy = code_urls, name = "test_dummy") ## End(Not run)
## Not run: data("testdt_tracking") wt <- as.wt_dt(testdt_tracking) wt <- extract_domain(wt) code_urls <- "https://dkr1.ssisurveys.com/tzktsxomta" create_urldummy(wt, dummy = code_urls, name = "test_dummy") ## End(Not run)
deduplicate()
flags, drops or aggregates duplicates, which are defined as
consecutive visits to the same URL within a certain time frame.
deduplicate( wt, method = "aggregate", within = 1, duration_var = "duration", keep_nvisits = FALSE, same_day = TRUE, add_grpvars = NULL )
deduplicate( wt, method = "aggregate", within = 1, duration_var = "duration", keep_nvisits = FALSE, same_day = TRUE, add_grpvars = NULL )
wt |
webtrack data object. |
method |
character. One of |
within |
numeric (seconds). If |
duration_var |
character. Name of duration variable. Defaults to |
keep_nvisits |
boolean. If method set to |
same_day |
boolean. If method set to |
add_grpvars |
vector. If method set to |
webtrack data.frame with the same columns as wt with updated duration
## Not run: data("testdt_tracking") wt <- as.wt_dt(testdt_tracking) wt <- add_duration(wt, cutoff = 300, replace_by = 300) # Dropping duplicates with one-second default wt_dedup <- deduplicate(wt, method = "drop") # Flagging duplicates with one-second default wt_dedup <- deduplicate(wt, method = "flag") # Aggregating duplicates wt_dedup <- deduplicate(wt[1:1000], method = "aggregate") # Aggregating duplicates and keeping number of visits for aggregated visits wt_dedup <- deduplicate(wt[1:1000], method = "aggregate", keep_nvisits = TRUE) # Aggregating duplicates and keeping "domain" variable despite grouping wt <- extract_domain(wt) wt_dedup <- deduplicate(wt, method = "aggregate", add_grpvars = "domain") ## End(Not run)
## Not run: data("testdt_tracking") wt <- as.wt_dt(testdt_tracking) wt <- add_duration(wt, cutoff = 300, replace_by = 300) # Dropping duplicates with one-second default wt_dedup <- deduplicate(wt, method = "drop") # Flagging duplicates with one-second default wt_dedup <- deduplicate(wt, method = "flag") # Aggregating duplicates wt_dedup <- deduplicate(wt[1:1000], method = "aggregate") # Aggregating duplicates and keeping number of visits for aggregated visits wt_dedup <- deduplicate(wt[1:1000], method = "aggregate", keep_nvisits = TRUE) # Aggregating duplicates and keeping "domain" variable despite grouping wt <- extract_domain(wt) wt_dedup <- deduplicate(wt, method = "aggregate", add_grpvars = "domain") ## End(Not run)
The Dissimilarity Index can be interpreted as the share of Group A visits that would need to be redistributed across media for the share of group A to be uniform across websites.
dissimilarity_index(grp_a, grp_b)
dissimilarity_index(grp_a, grp_b)
grp_a |
vector (usually corresponds to a column in a webtrack data frame) indicating the number of individuals of group A using a website |
grp_b |
vector (usually corresponds to a column in a webtrack data frame) indicating the number of individuals of group B using a website |
Cutler, David M., Edward L. Glaeser, and Jacob L. Vigdor. "The rise and decline of the American ghetto." Journal of political economy 107.3 (1999): 455-506.
# perfect dissimilarity grp_a <- c(5, 5, 0, 0) grp_b <- c(0, 0, 5, 5) dissimilarity_index(grp_a, grp_b) # no dissimilarity grp_a <- c(5, 5, 5, 5) grp_b <- c(5, 5, 5, 5) dissimilarity_index(grp_a, grp_b)
# perfect dissimilarity grp_a <- c(5, 5, 0, 0) grp_b <- c(0, 0, 5, 5) dissimilarity_index(grp_a, grp_b) # no dissimilarity grp_a <- c(5, 5, 5, 5) grp_b <- c(5, 5, 5, 5) dissimilarity_index(grp_a, grp_b)
Domain list classification of domains into news,portals, search, and social media
domain_list
domain_list
An object of class data.table
(inherits from data.frame
) with 663 rows and 2 columns.
Stier, S., Mangold, F., Scharkow, M., & Breuer, J. (2022). Post Post-Broadcast Democracy? News Exposure in the Age of Online Intermediaries. American Political Science Review, 116(2), 768-774.
drop_query()
adds the URL without query and fragment as a new column.
The query is defined as the part following a "?" after the path.
The fragement is anything following a "#" after the query.
drop_query(wt, varname = "url")
drop_query(wt, varname = "url")
wt |
webtrack data object. |
varname |
character. name of the column from which to extract the host.
Defaults to |
webtrack data.frame with the same columns as wt
and a new column called '<varname>_noquery'
## Not run: data("testdt_tracking") wt <- as.wt_dt(testdt_tracking) # Extract URL without query/fragment wt <- drop_query(wt) ## End(Not run)
## Not run: data("testdt_tracking") wt <- as.wt_dt(testdt_tracking) # Extract URL without query/fragment wt <- drop_query(wt) ## End(Not run)
extract_domain()
adds the domain of a URL as a new column.
By "domain", we mean the "top private domain", i.e., the domain under
the public suffix (e.g., "com
") as defined by the Public Suffix List.
See details.
Extracts the domain from urls.
extract_domain(wt, varname = "url")
extract_domain(wt, varname = "url")
wt |
webtrack data object. |
varname |
character. Name of the column from which to extract the host.
Defaults to |
We define a "web domain" in the common colloquial meaning, that is,
the part of an web address that identifies the person or organization in control.
is google.com
. More technically, what we mean by "domain" is the
"top private domain", i.e., the domain under the public suffix,
as defined by the Public Suffix List.
Note that this definition sometimes leads to counterintuitive results because
not all public suffixes are "registry suffixes". That is, they are not controlled
by a domain name registrar, but allow users to directly register a domain.
One example of such a public, non-registry suffix is blogspot.com
. For a URL like
www.mysite.blogspot.com
, our function, and indeed the packages we are aware of,
would extract the domain as mysite.blogspot.com
, although you might think of
blogspot.com
as the domain.
For details, see here
webtrack data.frame with the same columns as wt
and a new column called 'domain'
(or, if varname not equal to 'url'
, '<varname>_domain'
)
## Not run: data("testdt_tracking") wt <- as.wt_dt(testdt_tracking) # Extract domain and drop rows without domain wt <- extract_domain(wt) # Extract domain and keep rows without domain wt <- extract_domain(wt) ## End(Not run)
## Not run: data("testdt_tracking") wt <- as.wt_dt(testdt_tracking) # Extract domain and drop rows without domain wt <- extract_domain(wt) # Extract domain and keep rows without domain wt <- extract_domain(wt) ## End(Not run)
extract_host()
adds the host of a URL as a new column.
The host is defined as the part following the scheme (e.g., "https://") and
preceding the subdirectory (anything following the next "/"). Note that
for URL entries like chrome-extension://soomething
or http://192.168.0.1/something
,
result will be set to NA
.
extract_host(wt, varname = "url")
extract_host(wt, varname = "url")
wt |
webtrack data object. |
varname |
character. Name of the column from which to extract the host.
Defaults to |
webtrack data.frame with the same columns as wt
and a new column called 'host'
(or, if varname not equal to 'url'
, '<varname>_host'
)
## Not run: data("testdt_tracking") wt <- as.wt_dt(testdt_tracking) # Extract host and drop rows without host wt <- extract_host(wt) # Extract host and keep rows without host wt <- extract_host(wt) ## End(Not run)
## Not run: data("testdt_tracking") wt <- as.wt_dt(testdt_tracking) # Extract host and drop rows without host wt <- extract_host(wt) # Extract host and keep rows without host wt <- extract_host(wt) ## End(Not run)
extract_path()
adds the path of a URL as a new column.
The path is defined as the part following the host but not including a
query (anything after a "?") or a fragment (anything after a "#").
extract_path(wt, varname = "url", decode = TRUE)
extract_path(wt, varname = "url", decode = TRUE)
wt |
webtrack data object |
varname |
character. name of the column from which to extract the host. Defaults to |
decode |
logical. Whether to decode the path (see |
webtrack data.frame with the same columns as wt
and a new column called 'path'
(or, if varname not equal to 'url'
, '<varname>_path'
)
## Not run: data("testdt_tracking") wt <- as.wt_dt(testdt_tracking) # Extract path wt <- extract_path(wt) ## End(Not run)
## Not run: data("testdt_tracking") wt <- as.wt_dt(testdt_tracking) # Extract path wt <- extract_path(wt) ## End(Not run)
Small fake webtracking data for testing purpose
fake_tracking
fake_tracking
An object of class data.frame
with 500 rows and 3 columns.
Given two groups (A and B) of individuals, the isolation index captures the extent to which group A disproportionately visit websites whose other visitors are also members of group A.
isolation_index(grp_a, grp_b, adjusted = FALSE)
isolation_index(grp_a, grp_b, adjusted = FALSE)
grp_a |
vector (usually corresponds to a column in a webtrack data frame) indicating the number of individuals of group A using a website |
grp_b |
vector (usually corresponds to a column in a webtrack data frame) indicating the number of individuals of group B using a website |
adjusted |
logical. should the index be adjusted (defaults to FALSE) |
a value of 1 indicates that the websites visited by group A and group B do not overlap. A value of 0 means both visit exactly the same websites
numeric value between 0 and 1. 0 indicates no isolation and 1 perfect isolation
Cutler, David M., Edward L. Glaeser, and Jacob L. Vigdor. "The rise and decline of the American ghetto." Journal of political economy 107.3 (1999): 455-506. Gentzkow, Matthew, and Jesse M. Shapiro. "Ideological segregation online and offline." The Quarterly Journal of Economics 126.4 (2011): 1799-1839.
# perfect isolation grp_a <- c(5, 5, 0, 0) grp_b <- c(0, 0, 5, 5) isolation_index(grp_a, grp_b) # perfect overlap grp_a <- c(5, 5, 5, 5) grp_b <- c(5, 5, 5, 5) isolation_index(grp_a, grp_b)
# perfect isolation grp_a <- c(5, 5, 0, 0) grp_b <- c(0, 0, 5, 5) isolation_index(grp_a, grp_b) # perfect overlap grp_a <- c(5, 5, 5, 5) grp_b <- c(5, 5, 5, 5) isolation_index(grp_a, grp_b)
Classification of domains into different news types
news_types
news_types
An object of class data.table
(inherits from data.frame
) with 690 rows and 2 columns.
Stier, S., Mangold, F., Scharkow, M., & Breuer, J. (2022). Post Post-Broadcast Democracy? News Exposure in the Age of Online Intermediaries. American Political Science Review, 116(2), 768-774.
parse_path()
parses parts of a path, i.e., anything separated by
"/", "-", "_" or ".", and adds them as a new variable. Parts that do not
consist of letters only, or of a real word, can be filtered via the argument keep
.
parse_path(wt, varname = "url", keep = "letters_only", decode = TRUE)
parse_path(wt, varname = "url", keep = "letters_only", decode = TRUE)
wt |
webtrack data object |
varname |
character. name of the column from which to extract the host.
Defaults to |
keep |
character. Defines which types of path components to keep.
If set to |
decode |
logical. Whether to decode the path (see |
webtrack data.frame with the same columns as wt
and a new column called 'path_split'
(or, if varname not equal to 'url'
, '<varname>_path_split'
)
containing parts as a comma-separated string.
## Not run: data("testdt_tracking") wt <- as.wt_dt(testdt_tracking) wt <- parse_path(wt) ## End(Not run)
## Not run: data("testdt_tracking") wt <- as.wt_dt(testdt_tracking) wt <- parse_path(wt) ## End(Not run)
Print web tracking data
## S3 method for class 'wt_dt' print(x, ...)
## S3 method for class 'wt_dt' print(x, ...)
x |
object of class wt_dt |
... |
additional parameters for print |
No return value, called for side effects
sum_activity()
counts the number of active time periods (i.e., days, weeks,
months, years, or waves) by panelist_id
. A period counts as "active" if
the panelist provided at least one visit for that period.
sum_activity(wt, timeframe = "date")
sum_activity(wt, timeframe = "date")
wt |
webtrack data object. |
timeframe |
character. Indicates for what time frame to aggregate visits.
Possible values are |
a data.frame with columns panelist_id
, column indicating the
number of active time units.
## Not run: data("testdt_tracking") wt <- as.wt_dt(testdt_tracking) # summarize activity by day wt_sum <- sum_activity(wt, timeframe = "date") ## End(Not run)
## Not run: data("testdt_tracking") wt <- as.wt_dt(testdt_tracking) # summarize activity by day wt_sum <- sum_activity(wt, timeframe = "date") ## End(Not run)
sum_durations()
summarizes the duration of visits by person within a timeframe
,
and optionally by visit_class
of visit. Note:
If for a time frame all rows are NA on the duration column, the summarized duration for that time frame will be NA.
If only some of the rows of a time frame are NA on the duration column, the function will ignore those NA rows.
If there were no visits to a class (i.e., a value of the 'visit_class' column) for a time frame, the summarized duration for that time frame will be zero; if there were visits, but NA on duration, the summarized duration will be NA.
sum_durations(wt, var_duration = NULL, timeframe = NULL, visit_class = NULL)
sum_durations(wt, var_duration = NULL, timeframe = NULL, visit_class = NULL)
wt |
webtrack data object. |
var_duration |
character. Name of the duration variable if already present.
Defaults to |
timeframe |
character. Indicates for what time frame to aggregate visit durations.
Possible values are |
visit_class |
character. Column that contains a classification of visits.
For each value in this column, the output will have a column indicating the
number of visits belonging to that value. Defaults to |
a data.frame with columns panelist_id
, column indicating the time unit
(unless timeframe
set to NULL
), duration_visits
indicating the duration of visits
(in seconds, or whatever the unit of the variable specified by var_duration
parameter),
and a column for each value of visit_class
, if specified.
## Not run: data("testdt_tracking") wt <- as.wt_dt(testdt_tracking) # summarize for whole period wt_summ <- sum_durations(wt) # summarize by week wt_summ <- sum_durations(wt, timeframe = "week") # create a class variable to summarize by class wt <- extract_domain(wt) wt$google <- ifelse(wt$domain == "google.com", 1, 0)] wt_summ <- sum_durations(wt, timeframe = "week", visit_class = "google") ## End(Not run)
## Not run: data("testdt_tracking") wt <- as.wt_dt(testdt_tracking) # summarize for whole period wt_summ <- sum_durations(wt) # summarize by week wt_summ <- sum_durations(wt, timeframe = "week") # create a class variable to summarize by class wt <- extract_domain(wt) wt$google <- ifelse(wt$domain == "google.com", 1, 0)] wt_summ <- sum_durations(wt, timeframe = "week", visit_class = "google") ## End(Not run)
sum_visits()
summarizes the number of visits by person within a timeframe
,
and optionally by visit_class
of visit.
sum_visits(wt, timeframe = NULL, visit_class = NULL)
sum_visits(wt, timeframe = NULL, visit_class = NULL)
wt |
webtrack data object. |
timeframe |
character. Indicates for what time frame to aggregate visits.
Possible values are |
visit_class |
character. Column that contains a classification of visits.
For each value in this column, the output will have a column indicating the
number of visits belonging to that value. Defaults to |
a data.frame with columns panelist_id
, column indicating the time unit
(unless timeframe
set to NULL
), n_visits
indicating the number of visits,
and a column for each value of visit_class
, if specified.
## Not run: data("testdt_tracking") wt <- as.wt_dt(testdt_tracking) # summarize for whole period wt_summ <- sum_visits(wt) # summarize by week wt_summ <- sum_visits(wt, timeframe = "week") # create a class variable to summarize by class wt <- extract_domain(wt) wt$google <- ifelse(wt$domain == "google.com", 1, 0)] wt_summ <- sum_visits(wt, timeframe = "week", visit_class = "google") ## End(Not run)
## Not run: data("testdt_tracking") wt <- as.wt_dt(testdt_tracking) # summarize for whole period wt_summ <- sum_visits(wt) # summarize by week wt_summ <- sum_visits(wt, timeframe = "week") # create a class variable to summarize by class wt <- extract_domain(wt) wt$google <- ifelse(wt$domain == "google.com", 1, 0)] wt_summ <- sum_visits(wt, timeframe = "week", visit_class = "google") ## End(Not run)
Summary function for web tracking data
## S3 method for class 'wt_dt' summary(object, ...)
## S3 method for class 'wt_dt' summary(object, ...)
object |
object of class wt_dt |
... |
additional parameters for summary |
No return value, called for side effects
Same randomly generated survey data, one row per person/wave (long format)
testdt_survey_l
testdt_survey_l
An object of class tbl_df
(inherits from tbl
, data.frame
) with 15 rows and 7 columns.
Randomly generated survey data only used for illustrative purposes (wide format)
testdt_survey_w
testdt_survey_w
An object of class data.frame
with 5 rows and 8 columns.
Sample of fully anomymized webtrack data from a research project with US participants
testdt_tracking
testdt_tracking
An object of class data.frame
with 49612 rows and 5 columns.
vars_exist()
checks if columns are present in a webtrack data object.
By default, checks whether the data has a panelist_id
, a ulr
and a
timestamp
column.#'
vars_exist(wt, vars = c("panelist_id", "url", "timestamp"))
vars_exist(wt, vars = c("panelist_id", "url", "timestamp"))
wt |
webtrack data object. |
vars |
character vector of variables.
Defaults to |
A data.table object.
An S3 class to store web tracking data
Convert a data.frame containing web tracking data to a wt_dt
object
as.wt_dt( x, timestamp_format = "%Y-%m-%d %H:%M:%OS", tz = "UTC", varnames = c(panelist_id = "panelist_id", url = "url", timestamp = "timestamp") ) is.wt_dt(x)
as.wt_dt( x, timestamp_format = "%Y-%m-%d %H:%M:%OS", tz = "UTC", varnames = c(panelist_id = "panelist_id", url = "url", timestamp = "timestamp") ) is.wt_dt(x)
x |
data.frame containing a necessary set of columns, namely panelist's ID, visit URL and visit timestamp. |
timestamp_format |
string. Specifies the raw timestamp's formatting.
Defaults to |
tz |
timezone of date. defaults to UTC |
varnames |
Named vector of column names, which contain the panelist's ID
( |
A wt_dt
table is a data.frame.
a webtrack data object with at least columns panelist_id
, url
and timestamp
logical. TRUE if x is a webtrack data object and FALSE otherwise
data("testdt_tracking") wt <- as.wt_dt(testdt_tracking) is.wt_dt(wt)
data("testdt_tracking") wt <- as.wt_dt(testdt_tracking) is.wt_dt(wt)