Preparing to Analyze Your Data in R
If you'd like to follow along with this tutorial but don't have an R development environment set up, consider using RStudio Cloud, a free service from the RStudio team.
The majority of the code and directions in this page are now deprecated and integrated in the LAMP API. See 'Preparing to Analyze Your Data in Python' for up-to-date data analysis steps.
Connect to your LAMP server.
First, install the LAMP client library.
install.packages("devtools")
devtools::install_github('BIDMCDigitalPsychiatry/LAMP-r')
Then connect to the LAMP API. Your Researcher ID comes from the dashboard at dashboard.lamp.digital
.
library(LAMP)
LAMP <- LAMP$new('https://api.lamp.digital', 'your_email_here@email.com', 'your_password_here')
researcher_id <- "YOUR_RESEARCHER_ID"
Download data from the LAMP server.
First, download all Participants and retrieve the ID mapping for all Activities in our Study.
library(dplyr)
library(anytime)
library(data.table)
participants <- LAMP$Participant$allByStudy(researcher_id) %>% pull(id)
activity_map <- LAMP$Activity$allByStudy(researcher_id) %>%
dplyr::select(id,name) %>%
rename(activity = id)
Now we run some pre-processing to flatten the nested data structure that we receive from the ActivityEvent
API for each Participant, performed in a loop over all Participants in our Study. Because some entries may be null
(missing or invalid), we ignore those and flatten the data after converting time stamps from the UNIX Epoch
standard to the human-readable YYYY-MM-DD
format. There’s a workaround for excessively-nested survey items as well.
The below section of code is now deprecated and integrated into the LAMP API.
data_list <- list()
for (i in 1:length(participants)) {
tmp <- LAMP$ActivityEvent$allByParticipant(participants[i])
if (length(tmp) > 0) {
tmp <- right_join(activity_map %>% select(activity, name),
jsonlite::flatten(tmp) %>%
mutate(timestamp = anytime(as.numeric(timestamp)/1000, tz = "America/New_York")) %>%
rename(activity_duration = duration) %>%
mutate(id = participants[i]),
by="activity")
tmp_event_list <- list()
for (j in 1:nrow(tmp)) {
if (length(tmp[j,]$temporal_events[[1]]) > 0) {
tmp_event_list[[j]] <- cbind(tmp[j,], tmp[j,]$temporal_events[[1]], row.names=F) %>%
dplyr::select(-temporal_events)
}
}
data_list[[i]] <- rbindlist(tmp_event_list, fill = T)
}
}
data_list[sapply(data_list, is.null)] <- NULL
Now, we combine the data frames we’ve pre-processed and ensure they follow the proper data format. Once we reorder the columns to include only variables of interest, we can call the head()
function to preview the data frame. To view all available variables in your data frame, use colnames(result_events_mod)
.
Some of the columns we’re selecting tell us about meta information about the Activity (such as whether it’s a survey or game), the individual survey question or game level results, and other optional game data (static_data
, which may include NA
values). To learn more about what each of these columns contains, represents, and can be used for, please see the help topic.
result_events <- rbindlist(data_list, fill = T) %>%
mutate(id = as.character(id)) %>%
mutate(activity = as.character(activity)) %>%
mutate(name = as.character(name))
head(results)
Code Output
id timestamp name item value
1 U1094374134 2019-12-08 PHQ-8 How often did you feel bad about yourself, or that you were a failure or let your family down? 1
2 U1094374134 2019-12-08 PHQ-8 How often did have you have trouble concentrating on things such as reading or watching tv? 1
3 U1094374134 2019-12-08 PHQ-8 How often did you find yourself moving so slowly, or so fidgety/restless, that others noticed? 1
4 U1094374134 2019-12-08 PHQ-8 How often did you have trouble falling or staying asleep, or sleep for more hours than you meant to? 1
5 U1094374134 2019-12-08 PHQ-8 How often did you feel tired or like you had little energy? 1
6 U1094374134 2019-12-08 PHQ-8 How often did you find yourself with no appetite, or eating more than you meant to? 1
Optional: Output the data to a CSV.
write.csv(results, "./output-1-27-20.csv", row.names = F)
Appendix: Sample Analyses
Check out all the activities in the study
This includes both default activities and custom activities.
activity_map %>% select(activity, spec, name)
Code Output
## activity spec name
## 1 QWN0aXZpdHk6MDoxMDM6MjM~ lamp.group Daily Survey Check-In
## 2 QWN0aXZpdHk6MToxMDM6MzIw lamp.survey PHQ-8
## 3 QWN0aXZpdHk6MToxMDM6MzQx lamp.survey Instructions
## 4 QWN0aXZpdHk6MToxMDM6MzQy lamp.survey GAD-7
## 5 QWN0aXZpdHk6MToxMDM6MzQz lamp.survey PIU-SF-6
## 6 QWN0aXZpdHk6MToxMDM6MzQ0 lamp.survey Warning
## 7 QWN0aXZpdHk6MToxMDM6MzQ1 lamp.survey Qualitative Digital Media Use Assessment
## 8 QWN0aXZpdHk6MToxMDM6MzQ2 lamp.survey Screen Time Use - iPhones only
## 9 QWN0aXZpdHk6MToxMDM6MzQ3 lamp.survey iPhone/Android Assessment
## 10 QWN0aXZpdHk6MjoxMDM6MA~~ lamp.nback N-Back
## 11 QWN0aXZpdHk6MzoxMDM6MA~~ lamp.trails_b Trails B
## 12 QWN0aXZpdHk6NDoxMDM6MA~~ lamp.spatial_span Spatial Span
## 13 QWN0aXZpdHk6NToxMDM6MA~~ lamp.simple_memory Simple Memory
## 14 QWN0aXZpdHk6NjoxMDM6MA~~ lamp.serial7s Serial 7s
## 15 QWN0aXZpdHk6NzoxMDM6MA~~ lamp.cats_and_dogs Cats and Dogs
## 16 QWN0aXZpdHk6ODoxMDM6MA~~ lamp.3d_figure_copy 3D Figure Copy
## 17 QWN0aXZpdHk6OToxMDM6MA~~ lamp.visual_association Visual Association
## 18 QWN0aXZpdHk6MTA6MTAzOjA~ lamp.digit_span Digit Span
## 19 QWN0aXZpdHk6MTE6MTAzOjA~ lamp.cats_and_dogs_new Cats and Dogs New
## 20 QWN0aXZpdHk6MTI6MTAzOjA~ lamp.temporal_order Temporal Order
## 21 QWN0aXZpdHk6MTM6MTAzOjA~ lamp.nback_new N-Back New
## 22 QWN0aXZpdHk6MTQ6MTAzOjA~ lamp.trails_b_new Trails B New
## 23 QWN0aXZpdHk6MTU6MTAzOjA~ lamp.trails_b_dot_touch Trails B Dot Touch
## 24 QWN0aXZpdHk6MTY6MTAzOjA~ lamp.jewels_a Jewels Trails A
## 25 QWN0aXZpdHk6MTc6MTAzOjA~ lamp.jewels_b Jewels Trails B
## 26 QWN0aXZpdHk6MTg6MTAzOjA~ lamp.scratch_image Scratch Image
## 27 QWN0aXZpdHk6MTk6MTAzOjA~ lamp.spin_wheel Spin Wheel
Get number of participants in the study
print(paste("Number of Participants:", length(participants)))
Code Output
## [1] "Number of Participants: 29"
Get engagement and plot activity histogram
engagement_data <- inner_join(results %>% dplyr::count(id),
results %>% group_by(id) %>% filter(row_number()==n()) %>% select(id, timestamp),
by="id") %>%
rename(activities.completed = n) %>% rename(most.recent.activity = timestamp)
engagement_data
Code Output
## # A tibble: 27 x 3
## id activities.completed most.recent.activity
## <chr> <int> <dttm>
## 1 U1005979819 659 2019-11-16 22:50:57
## 2 U1094374134 2085 2019-08-14 12:23:15
## 3 U1126469507 503 2019-12-10 21:52:20
## 4 U1232915366 226 2020-01-29 00:51:34
## 5 U1235780769 767 2019-12-09 20:08:02
## 6 U1367615199 813 2019-11-13 16:11:09
## 7 U1500960001 742 2019-10-25 21:26:24
## 8 U1680931766 1070 2019-12-13 22:41:58
## 9 U176381486 585 2019-11-08 18:45:24
## 10 U2127860149 170 2019-11-13 16:51:52
## # … with 17 more rows
Let's view a histogram representation of the data.
hist(engagement_data$activities.completed, breaks=15)
Code Output
![../Topics/Preparing to analyze your data in R/Untitled.png](../Topics/Preparing to analyze your data in R/Untitled.png)
Get mean and standard deviation for any survey scale
We’ll include only anxiety survey (GAD-7
) results and parse/convert their answer data to numbers using the readr
library. Then, we aggregate by timestamp (which is unique to each Activity), and summarize the table.
library(readr)
data <- results %>%
filter(name == "GAD-7") %>%
mutate(value = as.numeric(parse_number(as.character(value)))) %>%
group_by(id,timestamp) %>%
summarise(average_GAD = mean(value))
summary(data$average_GAD)
Code Output
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.4286 0.8571 0.9139 1.4286 3.0000
print(paste("Mean GAD-7 across all participants:", round(mean(data$average_GAD), 3), "±", round(sd(data$average_GAD), 3)))
Code Output
## [1] "Mean GAD-7 across all participants: 0.914 ± 0.667"
Plot survey scores over time for an individual participant
Filter the first Participant’s mood (PHQ-8
) results and parse the strings into numbers as they may be either numeric or text. Then, we aggregate by timestamp, which is unique for each Activity, and take the mean of all scores for a given timestamp.
data <- results %>%
filter(id == participants[1] & name == "PHQ-8") %>%
mutate(value = as.numeric(as.character(value))) %>%
group_by(timestamp) %>%
summarise(average_score = mean(value))
Plot the now-filtered data using the ggplot2
library; you’ll find our sample graph below.
library(ggplot2)
ggplot(data, aes(x = timestamp, y = average_score)) +
geom_line(size=1, color="steelblue") +
theme_minimal(base_size = 15) +
ylim(0,4) +
theme(
panel.grid.major = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"),
axis.title.x = element_text(size = 15))
Compare self-reported anxiety with self-reported problematic internet use
First, we filter data, using a left-join (which is like a merge operation) between the GAD-7
and PIU-6
survey instruments, such that each participant is a single row. All answer data is converted to numeric values first, and then we aggregate by ID, which is unique to each Activity. Finally, we take the mean of all scores for a given timestamp.
library(tidyr)
library(Hmisc)
data <- left_join(results %>%
filter(name == "GAD-7") %>%
mutate(value = as.numeric(parse_number(as.character(value)))) %>%
group_by(id) %>%
summarise(average_GAD = mean(value)),
results %>%
filter(name == "PIU-SF-6") %>%
mutate(value = as.numeric(parse_number(as.character(value)))) %>%
group_by(id) %>%
summarise(average_PIU = mean(value)),
by="id")
Now, do a regression fit on the filtered data.
fit <- lm(average_PIU ~ average_GAD, data = drop_na(data))
Plot the now-filtered data; you’ll find our sample graph below. (drop_na()
removes rows with one or more NA values.)
ggplot(drop_na(data), aes(x = average_GAD, y = average_PIU)) +
geom_point() +
geom_smooth(method = 'lm', formula = y~x) +
# From https://sejohnston.com/2012/08/09/a-quick-and-easy-function-to-plot-lm-results-in-r/
labs(title = paste("Adj R2 = ",signif(summary(fit)$adj.r.squared, 5),
"Intercept =",signif(fit$coef[[1]],5 ),
" Slope =",signif(fit$coef[[2]], 5),
" P =",signif(summary(fit)$coef[2,4], 5)),
x = "Average GAD-7",
y = "Average PIU-SF-6") +
theme_minimal(base_size = 12) +
theme(
panel.grid.major = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"),
axis.title.x = element_text(size = 12))