7 Discussion
This dissertation set out to examine the extent to which the theoretical interpretation, granularity, and scope of digital trace data shape what can be validly inferred about digital media use and well-being. The motivation for this endeavor was rooted in a broader paradigm shift within the field: as researchers increasingly move from retrospective self-reports of screen time toward passively collected behavioral data, the promise of greater objectivity and precision has been accompanied by a set of underexamined validity challenges: While log data circumvent recall bias and social desirability effects, they do not eliminate measurement problems. Moreover, the choices researchers make in turning raw event logs into analyzable constructs, from preprocessing and feature construction to device selection and interpretation of behavioral indicators, shape the conclusions that follow.
Across a methods chapter, three empirical chapters, and a tool chapter, this dissertation examined one or more sub-research questions, drawing from an intensive longitudinal dataset collected as part of project DISCONNECT, which combined experience sampling with passively logged smartphone and computer trace data.
First, the Methods chapter, Chapter 2, provided a detailed and explicit account of the trace-to-indicator pipeline underlying all empirical studies, documenting how decisions regarding data cleaning, event inference, time-window alignment, and feature construction are themselves sources of variation that shape the kind of theoretical claims that can be made. In doing so, the chapter aimed to contribute to ongoing discussions about validity, reproducibility, and transparency in digital trace data research.
Chapter 3 (“Connected Yet Cognitively Drained”) addressed the first sub-research question, namely whether passively logged behavioral data can correspond with psychologically meaningful states. The study examined the state of online vigilance, and its momentary and lagged relation to cognitive fatigue. The online vigilance construct was assessed both via self-report and using digital trace data. Surprisingly, the findings revealed that the behavioral smartphone features designed to capture online vigilant behavior showed only weak associations with self-reported online vigilance. At the same time, this self-reported online vigilance was a consistent predictor of mental fatigue, whereas associations with the behavioral measure were absent. The chapter concluded that subjective experience appears to be more consequential than observed behavior in explaining the downstream well-being implications of digital connectivity, thus showing the interpretative limits of treating trace data as stand-ins for psychological states.
Chapter 4 (“Always On, Always Rushed for Time”) shifted the focus to the second sub-research question, examining how granular and dynamic features of logged behavior relate to momentary well-being. By disaggregating smartphone use into four behavioral features (duration, frequency, fragmentation, and notifications) across four app categories (email, social media, chat, and work communication), the study demonstrated that category-specific and dynamic indicators revealed within-person associations with feeling rushed. Critically, these associations operated largely through a subjective mechanism, perceived juggling load, thereby also contributing to the first sub-research question by reinforcing that trace data require theoretical interpretation to become psychologically meaningful.
Chapter 5 (“Computers Matter”) addressed the third sub-research question by including computer log data alongside smartphone logs for the first time. The chapter showed that adding computer data substantially altered screen time estimates (on average by +136%), and revealed frequent cross-device dynamics such as device switching (around 66 times per day), as well as overlapping use, and markedly changed the observed associations between digital media use and momentary and daily mood. Models using only smartphone data produced few and inconsistent associations, while computer and cross-device models showed more consistent patterns – particularly for work- and email-related activities. These findings demonstrate that measurement scope is not a minor methodological detail, but an important determinant of the inferences researchers can draw.
Chapter 6 (“Making the Impossible Possible”) complemented the empirical work by presenting a novel procedure for collecting granular screen time data from iOS devices – a platform that remained largely inaccessible to researchers due to Apple’s restrictive policies. This procedure also includes a tool we developed which does some of the “heavy lifting”: it parses the raw biome files into a usable format, queries the sync database for device identifiers, and creates combined data frames of Screen Time data per device which can be used for subsequent feature creation. The tool is offered in two deployment modes: researchers can either access a publicly hosted Marimo Notebook online, or run a containerized local instance through Docker. By providing both options, the tool accommodates varying institutional constraints: from researchers who need data to remain on local infrastructure for privacy or ethical reasons, to those who prefer the convenience of a browser-based workflow. Moreover, both the tool and all accompanying materials are shared openly via the Open Science Framework, so that the data processing pipeline becomes open and reusable by the broader research community.
Together, these chapters arrive at a central insight: the shift from self-reported to logged measures of digital behavior does not in itself resolve the measurement challenges facing digital well-being research. The validity of what we can conclude about digital media use and well-being depends not only on whether we use trace data, but on how we select and interpret behavioral indicators, at what level of granularity we operationalize them, and from which devices we collect them. The sections that follow discuss the implications of these findings, first for the use of log data in digital well-being research specifically, then for broader debates within the field, before turning to limitations and societal implications.
7.1 Contributions of the Use of Log Data in Digital Well-Being Research
7.1.1 Interpretative Gap
An overarching finding is that logged behavioral indicators, on their own, provide an incomplete account of the relevant cognitive-behavioral states they supposedly represent. Behavioral indicators capture when, how long, and how often someone interacted with a device, but not how these interactions are truly experienced by the user. More specifically:
In Chapter 3, we attempted to construct theory-driven behavioral features of online vigilance, i.e., by looking at monitoring sessions, notification reaction speed and responsiveness, which mapped to two specific dimensions of the construct. Yet, the measures showed only weak associations with self-reported online vigilance, and also performed poorly in predicting mental fatigue, while self-reported online vigilance was a robust predictor.
Chapter 4 did present meaningful associations between smartphone use and feeling rushed, but mostly operating through a subjective mechanism. Specifically, the link between granular smartphone usage features and feeling rushed was mediated by perceived juggling load. The subjective appraisal of having too many tasks to manage across roles served as a bridge between behavioral data and well-being outcomes.
Combined, these two chapters illustrate an interpretative gap, whereby the subjective experience appeared to trump behavior in explaining the implications digital media use. This brings the question: were the specific features chosen as indicators of the underlying state poor behavioral representations, or should we accept that self-reported subjective experiences simply trump behavioral measures in explaining the implications digital media use? A growing body of work indeed points towards subjective beliefs, perceptions and evaluations of digital media use as playing a central role in shaping well-being outcomes. For instance, Ernala et al. (2022) demonstrated subjective moderation, i.e., the same behavior related differently to well-being based on the users’ beliefs; Lee & Hancock (2024) showed that mindsets explained more variance in well-being than behavioral use measures; a finding that Parry & Coetzee (2025) replicated (albeit finding some differences to the original work of Lee and Hancock).
Alternatively, this gap is not an artefact of one particular operationalization or construct, but perhaps a more general feature of the relationship between logged behavior and subjective well-being. Therefore, the question for the broader field of digital wellbeing research is not whether to use either self-reported or behavioral data, but how do we create research designs that can integrate both in theoretically productive ways? Such a path forward is presented in the recent social constructivist framework of Wolfers (2024). This framework considers perceptions of media use not as biased proxies of objective behavior, but conceptually distinct constructs that are themselves socially constructed and independently consequential for mental health. This dissertation’s empirical findings are broadly consistent with this emerging theoretical direction: well-being implications of digital behavior appear to be filtered through different subjective processes that are not reducible to the behavior itself.
The interpretative gap, however, operates not just at the level of the user; it also runs through the research process itself, as we have shown in our Methods chapter (Chapter 2). Each of the steps taken in this process involves the researcher’s judgement about what constitutes relevant behavior. This means that trace data occupy a position between objective and subjective. This dual subjectivity, in how users experience their behavior and how researchers construct behavioral indicators, makes it difficult to label trace data as (completely) objective. Rather than seeing these traces as neutral records, and self-reports as biased, the field should acknowledge that both data sources require interpretation, albeit of different kinds.
Apart from the necessity to document decisions fully (see below for a discussion of the implications in relation to Open Science), this interpretative gap leads to two broader implications. First, for research design: if subjective mechanisms are essential for understanding the well-being implications of digital behavior, then linkage designs combining trace data with ESM are not merely a nice-to-have, but more of a necessity. Studies relying on trace data alone risk producing findings that are perhaps precise in their behavioral description, but lacking in psychological explanation.
Second, for theory: the findings raise the question of what trace data then are indicators of. If they are not reliably capturing psychological states such as online vigilance, and if their associations with well-being depend more on subjective mediators, then how do they contribute independently? Trace data are perhaps best understood as indicators of behavioral structure that become meaningful in research when interpreted through theory and triangulated with subjective data.
In sum, the dissertation’s findings suggest that trace data are best understood as capturing a different level of description – the behavioral-structural level – that requires integration with subjective data to become psychologically meaningful, not as a more accurate version of what self-reports measure or as a fully independent source of information. This has consequences for the methodological debate, where the central question should be: at which level of description does each data source operate, and how should they be combined?
A more practical implication of working with these data is that investing in ever more fine-grained logging could yield diminishing returns unless it can be matched with equally fine-grained subjective data. Increasing behavioral precision without attempting the same for psychological precision would, in other words, not help close the interpretative gap.
7.1.2 Granularity
A recurrent critique of existing research on digital media use and well-being is its reliance on undifferentiated measures of screen time, typically total duration or total frequency of device use aggregated at a daily level (Kaye et al., 2020; Meier & Reinecke, 2021). As discussed in the introduction, treating screen time as a monolithic construct clashes with theoretical models that conceptualize digital well-being as the product of dynamic interactions between person-, context-, and device-specific factors (Vanden Abeele, 2021). If what matters is not just how much people use their device, but what they do, in which temporal pattern, and in which context, then aggregate indicators may flatten the variation the theory predicts to be relevant. This dissertation treated the explanatory value of increased granularity as an empirical question. Across the empirical chapters, the findings suggest that granularity does matter, but not unconditionally, and not in a straightforward “more is better” manner.
Chapter 4 provides the clearest test of this proposition. By disaggregating smartphone use in four behavioral features across four app categories, we were able to examine a matrix of category × feature combinations. This disaggregation revealed associations that aggregate measures would have obscured. For instance, email and work communication features showed the strongest association with feeling rushed, while social media features showed no significant direct associations. Had we relied on total smartphone duration alone, these differential patterns would have collapsed into a single, likely weak and uninformative, coefficient. In this sense, granularity proved analytically informative: it allowed us to identify which kinds of digital behavior related to a specific well-being outcome, and enabled to begin distinguishing functional contexts of use.
This categorical decomposition represents a shift from studying screen time to studying screen use, where the functional character of digital behavior differs across categories which map onto distinct theoretical expectations (Meier & Reinecke, 2021). Email and work communication apps, for instance, carry connotations of professional obligations and task demands, whereas social media use may serve as leisure. Treating both as interchangeable units of screen time obscures these functional differences. Notably, Chapter 5 extended this observation by showing that even within the same category, the device on which an activity occurred shaped its association with mood: email use on the computer was consistently linked with lower happiness and higher feelings of being down, while email use on the smartphone showed weaker or non-significant associations. This implies that categorical granularity alone is not sufficient; the device context further explains what a given category of use means in practice.
Beyond the question of what people do, this dissertation also explores how behavior is distributed in time. The fragmentation feature, capturing whether screen time accumulated through sustained sessions or frequent short bursts, proved to be a relevant predictor in both Chapters 4 and 5. In Chapter 4, fragmented smartphone use was positively associated with feeling rushed, with this association operating mostly through perceived juggling load. In Chapter 5, multiple smartphone, computer, and combined-device fragmentation patterns showed positive associations with feeling down, and negative associations with feeling happy. These findings suggest that the temporal structure of digital behavior carries information that volume measures alone do not capture.
Yet, it would be misleading to conclude that finer granularity automatically produces stronger or more coherent findings. In Chapter 4, effect sizes for granular within-person associations remained small, even when category-specific and temporal features were employed. Chapter 5 showed that not all granular features outperformed their aggregate counterparts: some category-specific models did not yield significant associations while more aggregated ones did. Granularity adds analytical resolution, but more resolution does not guarantee more signal. It can also mean more noise and a greater risk of theoretically incoherent patterns. Consequently, deciding which kind of granularity is most theoretically relevant given the research question is key.
Finally, granularity raises a practical tension that should be acknowledged. Finer-grained operationalization demands more behavioral data, preprocessing decisions, and more analytical complexity. Each of these introduces additional researcher degrees of freedom, but also new threats to reproducibility. As we have shown in the Methods chapter, the computation of these features requires multiple decisions. An increase in more granular features should therefore also be accompanied by fuller documentation practices, allowing for evaluation and reproduction of these operationalizations. Else, this risks producing findings which are more complex, but not more trustworthy.
7.1.3 Scope
A third contribution of this dissertation concerns the scope of behavioral measurement. While digital well-being research has increasingly embraced passive smartphone logging, it has largely treated the smartphone as a sufficient proxy for individuals’ broader digital engagement. The findings of Chapter 5 demonstrate that this assumption is flawed.
First, restricting measurement to smartphones results in substantial construct underrepresentation. In our sample, including computer log data increased average daily screen time estimates by roughly 136%. This is not a small correction, but could signal a larger shift in how digital engagement is quantified. Participants with near-identical smartphone use displayed very different overall screen time patterns once computer use was included. Thus, reliance on a single device both underestimates usage and misrepresents individuals’ behavioral exposure.
Second, limiting measurement scope obscures qualitatively distinct behavioral dynamics. Multi-device logging revealed frequent device switching, overlapping use, and context transitions. These patterns, which smartphone data alone would not allow us to infer, were both common and associated with lower well-being. In this sense, scope affects which theoretical mechanisms can be tested, not only how much behavior is measured. Processes such as attentional fragmentation or media multitasking would not be visible in single-device designs.
Third, device scope shapes inferential conclusions. Smartphone-only models yielded few and inconsistent associations with mood, whereas computer and cross-device models produced clearer patterns, particularly for work- and email-related activities. This indicates that, especially among adult populations, conclusions about digital well-being depend on what parts of their digital behavior are captured.
These findings relate to broader concerns about tracking undercoverage in digital trace research, where incomplete device capture leads to systematic bias (Bosch et al., 2025). In the context of digital well-being research, undercoverage risks misestimating both exposure and its psychological consequences. Device selection therefore constitutes a theoretical boundary decision: it determines what counts as digital media use in the first place.
Chapter 5 of this dissertation demonstrates that multi-device logging is a feasible methodological direction, and by introducing a procedure for granular iOS data extraction along with open-source computer logging tools, Chapter 6 provides a practical pathway towards expanding measurement scope across devices.
Even the multi-device designs adopted in Chapter 5 capture only a partial view of participants’ full device ecologies. Our data encompasses smartphones and computers, but leaves out, for instance tablets, TVs, gaming consoles, and wearables – devices that, for some, may constitute a non-trivial amount of daily digital engagement. Furthermore, the measurement foundation across the two devices was not symmetric. Certain behavioral indicators that proved relevant in Chapters 3 and 4, most notably notifications, were available exclusively from the smartphone, as the computer-based logging tool did not capture equivalent events.
The scope argument advanced here is not a claim that our two-device configuration is sufficient to capture a person’s full digital ecology. Smartphones and computers represent only part of a broader set of digital devices that may include many others. Each additional device introduces its own data structures, logging conventions, technical limitations and constraints that complicate integration into a coherent measurement framework. Fully capturing an entire device ecology thus remains an aspiration that current tools cannot achieve. What Chapter 5 demonstrates is that expanding beyond a single device already meaningfully changes what we observe. There is little reason to assume that further expansions to other, additional, devices would not produce similar shifts. However, as we discussed before, expanding studies to include more devices and even richer data will not automatically close the distance between observed behavior and psychological experience. Investing in broader device coverage therefore risks increasing measurement complexity without adding explanatory value, unless it is accompanied by increased efforts to interpret what these additional behavioral records mean for the individuals generating them. In other words, expanding scope is most productive when contextual and subjective data can provide meaning to the behavioral patterns and the psychological processes they are meant to explain.
7.1.4 Transparency and Reproducibility
The methodological shift towards trace data was motivated, in part, by the inherent validity problems that come with self-report data. Logged behavioral data offer the promise of greater objectivity. Across this dissertation, and specifically in the Methods chapter, however, we have come to see this promise is incomplete.
While trace data do eliminate recall bias and social desirability effects, the trace-to-indicator workflow documented in Chapter 2 reveals that the transformation of raw device logs into analytical variables involves numerous consequential decisions – about what to retain, how to parse events, where to set thresholds, and how to operationalize features – each of which introduces researcher judgement into the data. Because these decisions cascade through the pipeline, with early-stage choices impacting subsequent indicators, they are not just isolated technical operations, but sources of variation that shape what the data can and cannot show. Providing transparency about these decisions is therefore key to increasing reproducibility of log-based research.
The pipeline document in Chapter 2 makes visible a set of decisions that are rarely reported in sufficient detail for others to evaluate or reproduce. This raises a broader question about where such pre-analysis processing sits within current thinking on computational reproducibility. Existing frameworks have articulated the importance of code sharing, environment encapsulation, and tiered verification (Bleier, 2025; Schoch et al., 2023), but have, understandably, focused more on the analysis stage of the research pipeline, where statistical code operates on data that is already in its final, analyzable form. The pre-analysis processing decisions that are particularly consequential for trace data have received comparatively less attention. The detailed account offered in Chapter 2 is a response to calls for such documentation (Baumgartner et al., 2023; Parry & Klingelhoefer, 2025; Parry & Toth, 2025), but transparency alone does not fully resolve the problem. Even a thoroughly documented pipeline represents a single path chosen from a range of alternatives. Complementary approaches, such as multiverse-style sensitivity analyses which could be used to assess how different processing decisions affect downstream conclusions (e.g., Murphy et al., 2025), or interactive computational documents (e.g., R Markdown, Quarto) that allow readers to adjust pipeline parameters and observe resulting changes, could further strengthen the reproducibility and evaluability of trace-based research.
7.2 Contributions to Broader Discussions in the Field
Above, I have provided a general synthesis of the findings of my empirical work, and have formulated an overarching response to the three core research questions guiding it. In the current section, I explore some broader implications of the study findings for the field of research on digital well-being, and digital media effects more generally.
7.2.1 Small Effects
A recurring finding across the empirical chapters is that effect sizes of behavioral indicators are small. This finding is consistent with the broader literature where small effect sizes have been reported across studies employing logged behavioral data and ESM designs (e.g., de Segovia Vicente et al., 2024; Elmer et al., 2025; Siebers et al., 2024). Several considerations could help interpret these small effect sizes. I want to discuss three of these considerations, which are also limitations of the current dissertation, in greater depth: (1) the potential for momentary effects to accumulate over time, (2) the possibility that effects have already stabilized, and (3) the sensitivity of effect estimates to the inclusion of control variables.
First, as we investigated momentary associations, an important question is whether these small effects might accumulate into a larger, more consequential impact. As we will also discuss in the limitations (see below), we put forward this accumulation-hypothesis when investigating online vigilance (Chapter 3), as we found both contemporaneous and lagged effects of online vigilance on mental fatigue. If these small effects could linger throughout the day and are not fully regulated, they could potentially accumulate over the course of the day. Thomas (2022) offers a systematic methodological framework for analyzing such dynamics, distinguishing between different patterns in the appearance and duration of media effects. Their typology includes, among others, the “drip” scenario, a pattern in which a single media exposure produces weak or absent effects, but the influence across repeated exposures can accumulate, leading to a meaningful overall effect. Modelling the effects observed in our study using such a drip framework could have provided more insight into whether momentary associations with fatigue build up during the day. At the time of performing the study, I was not familiar enough with these models to confidently employ them, but believe the observation of a t-2 lagged effect of online vigilance warrants such a future exploration.
Second, a complementary perspective, advanced by Baumgartner (2025), proposes that media effects may stabilize after repeated exposure. Drawing on theories of habituation, adaptation, and learning, Baumgartner argues that media effects are likely to diminish and eventually plateau as users grow accustomed to their media environments. If this is the case, effects are only empirically observable during an “effect-sensitive period”. For more established media users, such as the adult participants in our study, this would mean that the period during which digital behavior most strongly affected well-being outcomes may have already passed before data collection began. This could offer an explanation as to why our momentary associations remained small, although statistically significant. I think future research should consider how research designs could be developed to capture media habits when they start to form, or when changes in digital media use may be observed, for instance as a result of major life changes.
Finally, the inclusion of control variables when modelling media effects can make a noticeable difference to the effect sizes observed. Ferguson (2026), for instance, demonstrated that bivariate correlations between self-reported social media screen time and various well-being indicators largely disappeared once trait-level controls were included. This suggests that small bivariate associations might be artefacts, reflecting shared underlying factors, instead of direct media effects. A similar dynamic was observed in our own work: in Chapter 4, we initially modeled the association between smartphone use and time pressure controlling only for whether or not a participant had worked. Reviewers argued this did not sufficiently account for potential role blurring between work and personal life, hence an additional control for the time of day was added, which altered several of the observed effect sizes.
Relatedly, across the empirical chapters, behavioral features – such as duration, frequency, and fragmentation – were typically entered in separate models instead of being included together. Whether the small individual effects reported reflect overlapping or distinct variance in digital well-being outcomes remains an open question, and future research could benefit from modelling multiple behavioral features concurrently to clarify their relative and combined contributions.
7.3 Limitations
7.3.1 Partial Utilization of the Dataset
A major limitation of this dissertation concerns its single-wave scope. The broader ERC DISCONNECT project employed a multi-wave burst design with three separate data collection waves, yet this dissertation draws solely from Wave 1 data. This means that longer-term longitudinal dynamics across waves, including the potential to examine whether associations between digital behavior and well-being change, stabilize or build up over time – dynamics which recent theoretical work has argued are central to understanding media effects (Baumgartner, 2025; Thomas, 2022) – remain unexplored. This means that a core strength of the project’s design has not yet been fully realized within this dissertation.
That said, the choice to focus on Wave 1 data was a deliberate and pragmatic one. As documented in the Methods chapter, the processing pipeline for combining Android trace data, computer trace data and mobile experience sampling, even within a single wave, was substantial. The sheer volume of data already required considerable investment before any meaningful analysis could take place. This type of work – cleaning, parsing, debugging – constitutes a form of hidden labor that is essential to the success of trace-based research, yet remains largely invisible in the final publications it enables, rarely recognized by current academic incentive structures (Parry & Klingelhoefer, 2025).
In addition, between waves, changes were introduced in the tools used for data collection, including the app used for experience sampling. By Wave 2, a new version of the passive logging app for Android users was released, which included an ESM module. This had consequences for both the ESM protocol and, to an extent, for the structure of the raw trace data. These changes, which were beneficial from the perspective of (Android) participants, meant that cross-wave alignment of trace data and within-wave alignment of ESM data were not reducible to simply appending datasets. This required additional harmonization steps, once again introducing further decision points and potential sources of error.
Data harmonization and alignment steps were carried out, albeit after Wave 3 data collection was completed. As such, these steps fell outside of the scope and timing of the current dissertation. My hope is that we can share these data with the broader scientific community, allowing other researchers to benefit from our data collection efforts and this unique dataset. This ambition responds to growing calls in computational social science to move beyond data-available-upon-request claims, which often go unfulfilled over time (Schoch et al., 2023), and toward more structured archiving practices that support third-party verification and cumulative science progress (Bleier, 2025).
Finally, even had cross-wave analysis been conducted, doing so would have introduced a new set of methodological challenges related to sample fragmentation. Across three waves, not all participants contributed to every wave: some participants from Wave 1 did not return for subsequent waves, while new participants joined in later waves. The combination of participant attrition and new recruitment means that any longitudinal analysis, spanning multiple waves, would necessarily operate on a fragmented sample, one in which the subset shrinks as more waves are required to overlap. This fragmentation is not unique to the cross-wave dimension, however; as the next section discusses, it also occurs in other forms within the empirical chapters of this dissertation.
7.3.2 Fragmented Samples
All three empirical chapters draw from the same Wave 1 data collection, but each chapter effectively operates on a different analytic sample. Our full sample consists of 1,315 participants, but only 774 provided usable Android trace data, and of those, only 106 also provided computer trace data. Chapters 3 and 4, which rely on smartphone traces alone, draw on a much larger subsample than the subsample used in Chapter 5, where both smartphone and computer trace data were required. This means that findings across chapters are not strictly comparable, as each is based on a different subset of the participant pool, and as data requirements become more stringent, selectivity likely played a larger role. Further, this selectivity might not be random: participants who provided both smartphone and computer data likely differ from those who only provided smartphone data (e.g., in terms of motivation, digital literacy, device ecology), echoing Bosch et al.’s (2025) point that undercoverage is shaped by participant characteristics.
A different, more granular, form of fragmentation arose within each chapter, particularly in Chapters 4 and 5, where category-specific features (email, social media, chat, and work [communication]) were modelled. The issue here concerns the interpretation and handling of data absence: when a particular participant showed no category-specific use during a given ESM time window, it is ambiguous whether this reflects genuine non-use (a meaningful zero) or data absence (a missing that should be excluded). If a participant never uses email apps on their smartphone across the entire study period, should their email duration be coded as a zero – implying that email non-use is a meaningful behavioral state to include in models – or should they be excluded from email-specific analyses entirely? This decision determines whether the analytic sample for a given app category includes only users of that category or the full sample. To our knowledge, no established convention exists in the field for resolving this ambiguity, and different choices would produce different samples and different results, connecting back to our discussion on researchers’ degrees of freedom (Methods chapter). This ambiguity is impacted by the inherent ambiguity in app categorization itself, where decisions about how to label applications (using existing or custom labeling schemes) can alter which participants register any use in a given category (see also Elmer et al., 2025; Schoedel et al., 2022).
Another point related to missing data is the notion that, in our study on digital well-being, the behavior of interest – disconnecting from digital devices – may produce theoretically relevant missing data in both trace and ESM data streams. If a participant deliberately goes offline or shuts down their device, they will not receive or respond to their ESM prompts and will not generate log data via their smartphone. In other words, well-being during disconnection periods goes unrecorded. Importantly, the behaviors that produce this data loss are not marginal: Keusch et al. (2022) found that substantial proportions of smartphone users reported regularly leaving their device at home or not carrying it on their person, behaviors which would suppress both trace data and ESM data in a study like ours. This creates a non-random form of missingness: the moments where participants are most disconnected are also the moments least likely to be captured by the study design. In Chapter 3, we explicitly discuss this limitation, as several participants reported in the final questionnaire that their missing ESM values were often due to deliberate disconnection decisions. This observation, also noted by Klingelhoefer et al. (2024), reveals an intrinsic tension in ESM research on digital disconnection: the method intervenes on the reality it aims to capture, as the smartphone notifications required to collect ESM data are also a form of digital demand on participants.
7.3.3 Reactivity
The interdependence between study design and the behavior under investigation extends beyond the problem of missing data. A related concern is reactivity: the possibility that participants alter their digital behavior precisely because they know it is being monitored. In studies that combine passive logging with experience sampling, as ours does, reactivity can be present in both data streams and on different timescales. Toth et al. (2025) disentangled these two sources in their study. They found that smartphone logging induced small reductions in use frequency and duration that were most pronounced during the first five days before fading. This is consistent with earlier, smaller-scale evidence of habituation to monitoring (Toth & Trifonova, 2021). ESM prompts, by contrast, produced short-term increases in smartphone use lasting approximately two to three hours after each survey, likely because the act of picking up the phone to complete a questionnaire served as a connection cue that triggered further device use. This latter finding is particularly relevant to our design, where participants received six ESM prompts per day via their smartphone. On the ESM side, there is an additional, more conceptual, form of reactivity: our study of online vigilance (Chapter 3) required participants to attend to incoming notifications from the ESM app, effectively asking them to be vigilant toward their smartphone in order to report on their vigilance toward their smartphone. As we noted in that chapter, some participants explicitly remarked on this irony, indicating that the ESM notifications themselves became part of the attentional demands the study aimed to measure. Together, these observations suggest that our study likely captured behavior that was shaped, in part, by the measurement process itself. The existing evidence offers some reassurance that logging-induced reactivity tends to diminish within the first days of a study period, and that ESM-induced reactivity, while recurrent, is not long-lasting (Toth et al., 2025; Toth & Trifonova, 2021). Nevertheless, the possibility that both forms of reactivity shaped the behavioral data in our empirical chapters cannot be fully ruled out, and remains a source of uncertainty common to all intensive longitudinal designs that combine experience sampling with passive sensing.
7.4 Societal Implications
7.4.1 Screen Time Discourse and Public Policy
While this dissertation has primarily adopted a methodological lens, its findings are not without broader societal relevance. The methodological choices examined throughout these chapters directly have consequences for how digital well-being is understood and acted upon in public discourse and policy.
Public debate on screen time and well-being overwhelmingly treats screen time as a single number: hours per day. As we have discussed, this reflects some of the “conceptual and methodological mayhem” related to screen time (Kaye et al., 2020); the concept is underspecified, but central in popular concern and legislative action. Although research on the topic has increasingly moved away from self-reported and aggregated duration measures, this shift has not yet reached the evidence base that informs policy.
As an example, take the topic of smartphone bans in schools, which has received considerable policy attention in recent years. The evidence base informing these policies, however, remains limited, and critically, continues to rely on the very measurement approaches this dissertation has called into question. Recent work on the consequences of these bans in the UK (Goodyear et al., 2025) and the Netherlands (Vanluydt et al., 2026) has concluded that there is little to no impact on youth’s screen time or well-being, leading others to state that these bans are unproven or have failed (Ferguson, 2025, 2026). In each of these cases, the measurement of smartphone use is limited to what students self-report. In some cases, this involved prompted self-reports, where students were asked to consult iOS Screen Time or Android Digital Wellbeing dashboards; in others, screen time was captured using a single survey item asking students to estimate daily hours across all devices. As this dissertation has demonstrated, self-reported screen time differs systematically from logged behavior (Parry et al., 2021), but also obscures granularity and usage patterns that may be more relevant than just aggregated duration alone. For instance, the question of whether these bans might only shift or redistribute screen time towards other periods outside school hours, or towards other devices such as laptops present in classrooms, has already been raised (Vanluydt et al., 2026; Weiss & Bonell, 2025). Until log-data approaches find their way into policy-relevant research, the evidence base for interventions such as smartphone bans will remain constrained by the measurement limitations this dissertation has sought to address.
7.4.2 Digital Divide(s) in Research Participation
The methodological advances in digital behavior tracking discussed in the previous sections have consequences that are less frequently discussed: they can produce and reproduce digital divides that shape who is studied and who can conduct these studies. boyd & Crawford (2012) warned that unequal access to Big Data creates new stratifications in research, between the “Big Data rich” and “Big Data poor”, and that these inequalities bias both the data and the kinds of research that follow. In trace-based digital well-being research, this claim holds.
A first source of bias concerns operating system constraints. The majority of smartphone logging research relies on Android devices, as Android has historically afforded researchers access to detailed, event-level usage logs (Parry & Klingelhoefer, 2025), while iOS imposes far more restrictive data access policies. Existing workarounds, such as screen capture donations of Apple Screen Time (Baumgartner et al., 2023; Ohme et al., 2021), yield only aggregated data and can impose a higher burden on participants, possibly leading to attrition. The procedure presented in Chapter 6, which extracts raw iOS Screen Time data, is a promising step but requires participants to own both an iPhone and a Mac, which restricts eligibility to users embedded in the Apple ecosystem.
Beyond platform constraints, participant self-selection further narrows who is represented. Trace data collection typically requires installing software, granting extensive permissions, and sustaining that participation over time – conditions that favor individuals with higher digital literacy and lower privacy concerns, and higher trust in researchers (Keusch et al., 2024; Pankowska et al., 2025). The DISCONNECT project illustrates this concretely: our recruitment in collaboration with De Standaard yielded a sample where 93% held at least a bachelor’s degree; participation was incentivized through personalized well-being feedback without offering financial compensation, likely attracting people already motivated to reflect on their digital habits; and of 1,315 ESM participants, around 70% were Android users. Each selection stage progressively narrows the sample towards a specific profile: educated, digitally reflective, Android-using adults. The risk, then, is that our understanding of digital well-being is built on the experiences of a particular subset of the population. Whether these findings generalize to those who are less digitally literate, who use different devices, or who lack the motivation or time for intensive longitudinal studies remains an open question.
The divide extends equally to the researcher side. Working with behavioral trace data requires expertise in data collection design, processing of complex log files, and computational analysis. These skills are not uniformly distributed across the research community. As boyd & Crawford (2012) observed, computational skills are generally restricted to those with a technical background, raising the question of who is advantaged in research contexts that increasingly require these skills.
In practice, even the data access methods themselves are fragile. As we outlined in Chapter 6, during the development of our original iOS logging tool, an Apple operating system update rendered our data extraction procedure inoperable mid-study, illustrating the power imbalance between researchers and platform companies. Parry & Klingelhoefer (2025) note more broadly that commercial organizations frequently act as gatekeepers, mediating and restricting access to trace data, and that the specialized infrastructure required can present a significant barrier for researchers without institutional support.
Reproducibility can also create knowledge gaps in, for instance, programming practices among computational social scientists, many of whom are self-taught and lack formal training in software engineering (Schoch et al., 2023). Proprietary tools and APIs impose additional economic barriers, and heavy computational resource requirements can render research effectively irreproducible for those without access (Bleier, 2025; Schoch et al., 2023).
These dynamics risk reinforcing the hierarchies boyd & Crawford (2012) cautioned against: well-resourced research groups with computational infrastructure and skills are best positioned to conduct trace-data research, while those at small institutions or in less computationally oriented disciplines are excluded from these types of research designs. If we want to understand digital well-being across diverse populations, then we must remain critical of the exclusionary nature of these dynamics.
7.5 Concluding Remarks
This dissertation began from a simple premise: that the move from self-reported screen time to passively logged behavior promised more objective and precise measurement of digital media use, and thereby clearer answers about its relation to well-being. The work presented here suggests that this promise is real but conditional. Trace data do circumvent the recall and desirability problems of self-report, and they open analytical possibilities (e.g., category- and feature-level granularity, cross-device scope, temporal dynamics) that aggregated self-reports cannot. Yet across the empirical chapters, the same lesson recurred in different forms: the validity of what we can conclude does not follow automatically from the act of logging behavior. It depends on how indicators are constructed, at what granularity they are operationalized, from which devices they are drawn, and on whether they are interpreted alongside the subjective experiences that give behavior its meaning. The take-home message, then, is not that trace data are unreliable, nor that self-reports should be reinstated, but that neither data source is self-sufficient. I hope this dissertation can guide others in designing studies that combine them deliberately, document the choices they make in connecting raw trace data to analytical claims, and remain honest about the populations and devices their designs can and cannot reach. If this dissertation has a forward-looking contribution, it is an invitation for others to treat the integration of behavioral and subjective data—and the transparency of the pipeline that links them—not as a methodological afterthought, but as a central design problem of the field.