2024
- AbstractDOIBibSonomyThe term Behavioral Networks describes networks that contain relational information on human behavior. This ranges from social networks that contain friendships or cooperations between individuals, to navigational networks that contain geographical or web navigation, and many more. Understanding the forces driving behavior within these networks can be beneficial to improving the underlying network, for example, by generating new hyperlinks on websites, or by proposing new connections and friends on social networks. Previous approaches considered different hypotheses on a single network and evaluated which hypothesis fits best. These hypotheses can represent human intuition and expert opinions or be based on previous insights. In this work, we extend these approaches to enable the comparison of a single hypothesis between multiple networks. We unveil several issues of naive approaches that potentially impact comparisons and lead to undesired results. Based on these findings, we propose a framework with five flexible components that allow addressing specific analysis goals tailored to the application scenario. We show the benefits and limits of our approach by applying it to synthetic data and several real-world datasets, including web navigation, bibliometric navigation, and geographic navigation. Our work supports practitioners and researchers with the aim of understanding similarities and differences in human behavior between environments.
- AbstractDOIBibSonomyLinks between human milk (HM) and infant development are poorly understood and often focus on individual HM components. Here we apply multi-modal predictive machine learning to study HM and head circumference (a proxy for brain development) among 1022 mother-infant dyads of the CHILD Cohort. We integrated HM data (19 oligosaccharides, 28 fatty acids, 3 hormones, 28 chemokines) with maternal and infant demographic, health, dietary and home environment data. Head circumference was significantly predictable at 3 and 12 months. Two of the most associated features were HM n3-polyunsaturated fatty acid C22:6n3 (docosahexaenoic acid, DHA; p{\thinspace}={\thinspace}9.6e−05) and maternal intake of fish (p{\thinspace}={\thinspace}4.1e−03), a key dietary source of DHA with established relationships to brain function. Thus, using a systems biology approach, we identified meaningful relationships between HM and brain development, which validates our statistical approach, gives credence to the novel associations we observed, and sets the foundation for further research with additional cohorts and HM analytes.
2023
- AbstractDOIBibSonomyPreterm birth (PTB) is the leading cause of death in children under five, yet comprehensive studies are hindered by its multiple complex etiologies. Epidemiological associations between PTB and maternal characteristics have been previously described. This work used multiomic profiling and multivariate modeling to investigate the biological signatures of these characteristics. Maternal covariates were collected during pregnancy from 13,841 pregnant women across five sites. Plasma samples from 231 participants were analyzed to generate proteomic, metabolomic, and lipidomic datasets. Machine learning models showed robust performance for the prediction of PTB (AUROC = 0.70), time-to-delivery (r = 0.65), maternal age (r = 0.59), gravidity (r = 0.56), and BMI (r = 0.81). Time-to-delivery biological correlates included fetal-associated proteins (e.g., ALPP, AFP, and PGF) and immune proteins (e.g., PD-L1, CCL28, and LIFR). Maternal age negatively correlated with collagen COL9A1, gravidity with endothelial NOS and inflammatory chemokine CXCL13, and BMI with leptin and structural protein FABP4. These results provide an integrated view of epidemiological factors associated with PTB and identify biological signatures of clinical covariates affecting this disease. Deep biological profiling and machine learning reveal insights into the epidemiology of preterm birth.
- AbstractDOIBibSonomyAlthough prematurity is the single largest cause of death in children under 5 years of age, the current definition of prematurity, based on gestational age, lacks the precision needed for guiding care decisions. Here, we propose a longitudinal risk assessment for adverse neonatal outcomes in newborns based on a deep learning model that uses electronic health records (EHRs) to predict a wide range of outcomes over a period starting shortly before conception and ending months after birth. By linking the EHRs of the Lucile Packard Children’s Hospital and the Stanford Healthcare Adult Hospital, we developed a cohort of 22,104 mother-newborn dyads delivered between 2014 and 2018. Maternal and newborn EHRs were extracted and used to train a multi-input multitask deep learning model, featuring a long short-term memory neural network, to predict 24 different neonatal outcomes. An additional cohort of 10,250 mother-newborn dyads delivered at the same Stanford Hospitals from 2019 to September 2020 was used to validate the model. Areas under the receiver operating characteristic curve at delivery exceeded 0.9 for 10 of the 24 neonatal outcomes considered and were between 0.8 and 0.9 for 7 additional outcomes. Moreover, comprehensive association analysis identified multiple known associations between various maternal and neonatal features and specific neonatal outcomes. This study used linked EHRs from more than 30,000 mother-newborn dyads and would serve as a resource for the investigation and prediction of neonatal outcomes. An interactive website is available for independent investigators to leverage this unique dataset: https://maternal-child-health-associations.shinyapps.io/shiny\_app/. Machine learning models that use maternal and neonatal EHR data can assist in the assessment of risk of multiple adverse neonatal outcomes. Reduction of neonatal mortality and morbidity requires timely risk assessment so that care can be appropriately managed. Using multiple cohorts of mother and newborn dyads, De Francesco et\ al. trained and externally validated a deep learning model to predict different adverse neonatal outcomes by mining the paired electronic health records (EHRs). Their method largely outperformed currently used EHR-based clinical risk scores and can be applied to EHR data at time points ranging from early gestation to at or after birth, positioning their risk assessment model to be of potential clinical utility. —CAC
2022
- AbstractDOIBibSonomyMachine learning comprises algorithms that can perform tasks they were not explicitly programmed to perform. Explicitly programmed algorithms perform tasks according to a predefined sequence of instructions. Conversely, machine learning algorithms are programmed to learn to perform tasks using input data. In the era of abundant data, affordable data storage, and computational capabilities, understanding machine learning algorithms is critical to better explore and answer questions that can advance surgical science.
2021
- AbstractDOIBibSonomyCurrent methods to predict spontaneous labor are fairly inaccurate. To provide better estimates and biomarkers of labor onset, the biological processes that lead up to labor need to be better understood. Stelzer et al. performed metabolome, proteome, and immunome studies on blood samples from 63 women in the 100 days before delivery. They identified a surge in IL-1R4 and steroid hormones in the weeks before delivery, which was coordinated with a switch from immune activation to regulation of inflammatory responses. A model was then constructed to predict time to labor independent of gestational age. These results may be helpful for development of more accurate methods to predict labor.Estimating the time of delivery is of high clinical importance because pre- and postterm deviations are associated with complications for the mother and her offspring. However, current estimations are inaccurate. As pregnancy progresses toward labor, major transitions occur in fetomaternal immune, metabolic, and endocrine systems that culminate in birth. The comprehensive characterization of maternal biology that precedes labor is key to understanding these physiological transitions and identifying predictive biomarkers of delivery. Here, a longitudinal study was conducted in 63 women who went into labor spontaneously. More than 7000 plasma analytes and peripheral immune cell responses were analyzed using untargeted mass spectrometry, aptamer-based proteomic technology, and single-cell mass cytometry in serial blood samples collected during the last 100 days of pregnancy. The high-dimensional dataset was integrated into a multiomic model that predicted the time to spontaneous labor [R = 0.85, 95\% confidence interval (CI) [0.79 to 0.89], P = 1.2 {\texttimes} 10-40, N = 53, training set; R = 0.81, 95\% CI [0.61 to 0.91], P = 3.9 {\texttimes} 10-7, N = 10, independent test set]. Coordinated alterations in maternal metabolome, proteome, and immunome marked a molecular shift from pregnancy maintenance to prelabor biology 2 to 4 weeks before delivery. A surge in steroid hormone metabolites and interleukin-1 receptor type 4 that preceded labor coincided with a switch from immune activation to regulation of inflammatory responses. Our study lays the groundwork for developing blood-based methods for predicting the day of labor, anchored in mechanisms shared in preterm and term pregnancies.
- AbstractDOIBibSonomyA healthy pregnancy depends on complex interrelated biological adaptations involving placentation, maternal immune responses, and hormonal homeostasis. Recent advances in high-throughput technologies have provided access to multiomics biological data that, combined with clinical and social data, can provide a deeper understanding of normal and abnormal pregnancies. Integration of these heterogeneous datasets using state-of-the-art machine-learning methods can enable the prediction of short- and long-term health trajectories for a mother and offspring and the development of treatments to prevent or minimize complications. We review advanced machine-learning methods that could: provide deeper biological insights into a pregnancy not yet unveiled by current methodologies; clarify the etiologies and heterogeneity of pathologies that affect a pregnancy; and suggest the best approaches to address disparities in outcomes affecting vulnerable populations.
2020
- AbstractDOIBibSonomyTo assess the exposure of citizens to pollutants like NOx or particulate matter in urban areas, land use regression (LUR) models are a well established method. LUR models leverage information about environmental and anthropogenic factors such as cars, heating, or industry to predict air pollution in areas where no measurements have been made. However, existing approaches are often not globally applicable and require tedious hyper-parameter tuning to enable high quality predictions. In this work, we tackle these issues by introducing OpenLUR, an off-the-shelf approach for modeling air pollution that (i) works on a set of novel features solely extracted from the globally and openly available data source OpenStreetMap and (ii) is based on state-of-the-art machine learning featuring automated hyper-parameter tuning in order to minimize manual effort. We show that our proposed features are able to outperform their counterparts from local and closed sources, and illustrate how automated hyper parameter tuning can yield competitve results while alleviating the need for expert knowledge in machine learning and manual effort. Importantly, we further demonstrate the potential of the global availability of our features by applying cross-learning across different cities in order to reduce the need for a large amount of training samples. Overall, OpenLUR represents an off-the-shelf approach that facilitates easily reproducible experiments and the development of globally applicable models.
- AbstractDOIBibSonomyHigh-throughput single-cell analysis technologies produce an abundance of data that is critical for profiling the heterogeneity of cellular systems. We introduce VoPo (https://github.com/stanleyn/VoPo), a machine learning algorithm for predictive modeling and comprehensive visualization of the heterogeneity captured in large single-cell datasets. In three mass cytometry datasets, with the largest measuring hundreds of millions of cells over hundreds of samples, VoPo defines phenotypically and functionally homogeneous cell populations. VoPo further outperforms state-of-the-art machine learning algorithms in classification tasks, and identified immune-correlates of clinically-relevant parameters.
2018
- AbstractDOIBibSonomyThe k-Nearest Neighbor (kNN) classification approach is conceptually simple - yet widely applied since it often performs well in practical applications. However, using a global constant k does not always provide an optimal solution, e. g., for datasets with an irregular density distribution of data points. This paper proposes an adaptive kNN classifier where k is chosen dynamically for each instance (point) to be classified, such that the expected accuracy of classification is maximized. We define the expected accuracy as the accuracy of a set of structurally similar observations. An arbitrary similarity function can be used to find these observations. We introduce and evaluate different similarity functions. For the evaluation, we use five different classification tasks based on geo-spatial data. Each classification task consists of (tens of) thousands of items. We demonstrate, that the presented expected accuracy measures can be a good estimator for kNN performance, and the proposed adaptive kNN classifier outperforms common kNN and previously introduced adaptive kNN algorithms. Also, we show that the range of considered k can be significantly reduced to speed up the algorithm without negative influence on classification accuracy.
2017
- AbstractDOIBibSonomySequential traces of user data are frequently observed online and offline, e.g., as sequences of visited websites or as sequences of locations captured by GPS. However, understanding factors explaining the production of sequence data is a challenging task, especially since the data generation is often not homogeneous. For example, navigation behavior might change in different phases of browsing a website or movement behavior may vary between groups of users. In this work, we tackle this task and propose MixedTrails , a Bayesian approach for comparing the plausibility of hypotheses regarding the generative processes of heterogeneous sequence data. Each hypothesis is derived from existing literature, theory, or intuition and represents a belief about transition probabilities between a set of states that can vary between groups of observed transitions. For example, when trying to understand human movement in a city and given some data, a hypothesis assuming tourists to be more likely to move towards points of interests than locals can be shown to be more plausible than a hypothesis assuming the opposite. Our approach incorporates such hypotheses as Bayesian priors in a generative mixed transition Markov chain model, and compares their plausibility utilizing Bayes factors. We discuss analytical and approximate inference methods for calculating the marginal likelihoods for Bayes factors, give guidance on interpreting the results, and illustrate our approach with several experiments on synthetic and empirical data from Wikipedia and Flickr. Thus, this work enables a novel kind of analysis for studying sequential data in many application areas.
- AbstractDOIBibSonomyPolarized (POL) training intensity distribution (TID) emphasizes high-volume low-intensity exercise in zone (Z)1 (< first lactate threshold) with a greater proportion of high-intensity Z3 (> second lactate threshold) compared to Z2 (between first and second lactate threshold). In highly trained rowers there is a lack of prospective controlled evidence whether POL is superior to pyramidal (PYR; i.e. greater volume in Z1 vs. Z2 vs. Z3) TID. The aim of the study was to compare the effect of POL vs. PYR TID in rowers during an 11-wk preparation period. Fourteen national elite male rowers participated (age: 20 ± 2 years, maximal oxygen uptake (⩒O2max): 66±5 mL/min/kg). The sample was split into PYR and POL by varying the percentage spent in Z2 and Z3 while Z1 was clamped to ~93\% and matched for total and rowing volume. Actual TIDs were based on time within heart rate zones (Z1 and Z2) and duration of Z3-intervals. The main outcome variables were average power in 2000 m ergometer-test (P2000m), power associated with 4 mmol/L [blood lactate] (P4[BLa]), and ⩒O2max. To quantify the level of polarization, we calculated a Polarization-Index as log (\%Z1 x \%Z3/\%Z2). PYR and POL did not significantly differ regarding rowing or total volume, but POL had a higher percentage of Z3 intensities (6±3\% vs. 2±1\%; p < .005) while Z2 was lower (1±1\% vs. 3±2\%; p < .05) and Z1 was similar (94±3\% vs. 93±2\%, p = .37). Consequently, Polarization-Index was significantly higher in POL (3.0±0.7 a.u. vs. 1.9±0.4 a.u.; p < .01) P2000m did not significantly change with PYR (1.5±1.7\%, p = .06) nor POL (1.5±2.6\%, p = .26). ⩒O2max did not change (1.7±5.6\%, p = .52 or 0.6±2.6, p = .67) and a small increase in P4[BLa] was observed in PYR only (1.9±4.8\%, p = .37 or -0.5±4.1\%, p = .77). Changes from pre to post were not significantly different between groups in any performance measure. POL did not prove to be superior to PYR, possibly due to the high and very similar percentage of Z1 in this study.
- AbstractDOIBibSonomyThe aim of this pilot study was to analyze the off-training physical activity (PA) profile in national elite German U23 rowers during 31 days of their preparation period. The hours spent in each PA category (i.e. sedentary: <1.5 MET; light physical activity: 1.5–3 MET; moderate physical activity: 3–6 MET and vigorous intense physical activity: >6 MET) were calculated for every valid day (i.e. > 480 min of wear time). The off-training PA during 21 weekdays and 10 weekend days of the final 11-wk preparation period was assessed by a wrist-worn multisensory device (Microsoft Band II (MSBII)). A total of 11 rowers provided valid data (i.e. > 480 min/day) for 11.6 week days and 4.8 weekend days during the 31 days observation period. The average sedentary time was 11.63±1.25 hours per day during the week and 12.49±1.10 hours per day on the weekend, with a tendency to be higher on the weekend compared to weekdays (p = 0.06; d = 0.73). The average time in light, moderate and vigorous PA during the weekdays was 1.27±1.15, 0.76±0.37, 0.51±0.44 hours per day and 0.67±0.43, 0.59±0.37, 0.53±0.32 hours per weekend day. Light physical activity was higher during weekdays compared to the weekend (p = 0.04; d = 0.69) Based on our pilot study of eleven national elite rowers we conclude that rowers display a considerable sedentary off-training behavior of more than 11.5 hours/day.
2016
- AbstractDOIBibSonomySocial tagging systems have established themselves as a quick and easy way to organize information by annotating resources with tags. In recent work, user behavior in social tagging systems was studied, that is, how users assign tags, and consume content. However, it is still unclear how users make use of the navigation options they are given. Understanding their behavior and differences in behavior of different user groups is an important step towards assessing the effectiveness of a navigational concept and of improving it to better suit the users' needs. In this work, we investigate navigation trails in the popular scholarly social tagging system BibSonomy from six years of log data. We discuss dynamic browsing behavior of the general user population and show that different navigational subgroups exhibit different navigational traits. Furthermore, we provide strong evidence that the semantic nature of the underlying folksonomy is an essential factor for explaining navigation.
- AbstractDOIBibSonomyWe present a new method for detecting interpretable subgroups with exceptional transition behavior in sequential data. Identifying such patterns has many potential applications, e.g., for studying human mobility or analyzing the behavior of internet users. To tackle this task, we employ exceptional model mining, which is a general approach for identifying interpretable data subsets that exhibit unusual interactions between a set of target attributes with respect to a certain model class. Although exceptional model mining provides a well-suited framework for our problem, previously investigated model classes cannot capture transition behavior. To that end, we introduce first-order Markov chains as a novel model class for exceptional model mining and present a new interestingness measure that quantifies the exceptionality of transition subgroups. The measure compares the distance between the Markov transition matrix of a subgroup and the respective matrix of the entire data with the distance of random dataset samples. In addition, our method can be adapted to find subgroups that match or contradict given transition hypotheses. We demonstrate that our method is consistently able to recover subgroups with exceptional transition models from synthetic data and illustrate its potential in two application examples. Our work is relevant for researchers and practitioners interested in detecting exceptional transition behavior in sequential data.
- AbstractBibSonomyIdentifying plot structure in novels is a valuable step towards automatic processing of literary corpora. We present an approach to classify novels as either having a happy ending or not. To achieve this, we use features based on different sentiment lexica as input for an SVM- classifier, which yields an average F1-score of about 73\%.
2015
- AbstractBibSonomyThis paper presents exploratory subgroup analytics on ubiquitous data: We propose subgroup discovery and assessment approaches for obtaining interesting descriptive patterns and provide a novel graph-based analysis approach for assessing the relations between the obtained subgroup set. This exploratory visualization approaches allows for the comparison of subgroups according to their relations to other subgroups and to include further parameters, e.g., geo-spatial distribution indicators. We present and discuss analysis results utilizing real-world data given by geo-tagged noise measurements with associated subjective perceptions and a set of tags describing the semantic context.
- AbstractDOIBibSonomyA distance measure between objects is a key requirement for many data mining tasks like clustering, classification or outlier detection. However, for objects characterized by categorical attributes, defining meaningful distance measures is a challenging task since the values within such attributes have no inherent order, especially without additional domain knowledge. In this paper, we propose an unsupervised distance measure for objects with categorical attributes based on the idea that categorical attribute values are similar if they appear with similar value distributions on correlated context attributes. Thus, the distance measure is automatically derived from the given data set. We compare our new distance measure to existing categorical distance measures and evaluate on different data sets from the UCI machine-learning repository. The experiments show that our distance measure is recommendable, since it achieves similar or better results in a more robust way than previous approaches.
- AbstractBibSonomyThe issue of sustainability is at the top of the political and societal agenda, being considered of extreme importance and urgency. Human individual action impacts the environment both locally (e.g., local air/water quality, noise disturbance) and globally (e.g., climate change, resource use). Urban environments represent a crucial example, with an increasing realization that the most effective way of producing a change is involving the citizens themselves in monitoring campaigns (a citizen science bottom-up approach). This is possible by developing novel technologies and IT infrastructures enabling large citizen participation. Here, in the wider framework of one of the first such projects, we show results from an international competition where citizens were involved in mobile air pollution monitoring using low cost sensing devices, combined with a web-based game to monitor perceived levels of pollution. Measures of shift in perceptions over the course of the campaign are provided, together with insights into participatory patterns emerging from this study. Interesting effects related to inertia and to direct involvement in measurement activities rather than indirect information exposure are also highlighted, indicating that direct involvement can enhance learning and environmental awareness. In the future, this could result in better adoption of policies towards decreasing pollution.
2014
- AbstractDOIBibSonomyThe combination of ubiquitous and social computing is an emerging research area which integrates different but complementary methods, techniques and tools. In this paper, we focus on the Ubicon platform, its applications, and a large spectrum of analysis results. Ubicon provides an extensible framework for building and hosting applications targeting both ubiquitous and social environments. We summarize the architecture and exemplify its implementation using four real-world applications built on top of Ubicon. In addition, we discuss several scientific experiments in the context of these applications in order to give a better picture of the potential of the framework, and discuss analysis results using several real-world data sets collected utilizing Ubicon.
- AbstractBibSonomySensor data is objective. But when measuring our environment, measured values are contrasted with our perception, which is always subjective. This makes interpreting sensor measurements difficult for a single person in her personal environment. In this context, the EveryAware projects directly connects the concepts of objective sensor data with subjective impressions and perceptions by providing a collective sensing platform with several client applications allowing to explicitly associate those two data types. The goal is to provide the user with personalized feedback, a characterization of the global as well as her personal environment, and enable her to position her perceptions in this global context. In this poster we summarize the collected data of two EveryAware applications, namely WideNoise for noise measurements and AirProbe for participatory air quality sensing. Basic insights are presented including user activity, learning processes and sensor data to perception correlations. These results provide an outlook on how this data can further be used to understand the connection between sensor data and perceptions.
2013
- AbstractDOIBibSonomyThe development of ICT infrastructures has facilitated the emergence of new paradigms for looking at society and the environment over the last few years. Participatory environmental sensing, i.e. directly involving citizens in environmental monitoring, is one example, which is hoped to encourage learning and enhance awareness of environmental issues. In this paper, an analysis of the behaviour of individuals involved in noise sensing is presented. Citizens have been involved in noise measuring activities through the WideNoise smartphone application. This application has been designed to record both objective (noise samples) and subjective (opinions, feelings) data. The application has been open to be used freely by anyone and has been widely employed worldwide. In addition, several test cases have been organised in European countries. Based on the information submitted by users, an analysis of emerging awareness and learning is performed. The data show that changes in the way the environment is perceived after repeated usage of the application do appear. Specifically, users learn how to recognise different noise levels they are exposed to. Additionally, the subjective data collected indicate an increased user involvement in time and a categorisation effect between pleasant and less pleasant environments.
- AbstractBibSonomyWith the rising popularity of smart mobile devices, sensor data-based applications have become more and more popular. Their users record data during their daily routine or specifically for certain events. The application WideNoise Plus allows users to record sound samples and to annotate them with perceptions and tags. The app is being used to document and map the soundscape all over the world. The procedure of recording, including the assignment of tags, has to be as easy-to-use as possible. We therefore discuss the application of tag recommender algorithms in this particular scenario. We show, that this task is fundamentally different from the well-known tag recommendation problem in folksonomies as users do no longer tag fix resources but rather sensory data and impressions. The scenario requires efficient recommender algorithms that are able to run on the mobile device, since Internet connectivity cannot be assumed to be available. Therefore, we evaluate the performance of several tag recommendation algorithms and discuss their applicability in the mobile sensing use-case.
- AbstractBibSonomyAn increasing number of platforms like Xively or ThingSpeak are available to manage ubiquitous sensor data enabling the Internet of Things. Strict data formats allow interoperability and informative visualizations, supporting the development of custom user applications. Yet, these strict data formats as well as the common feed-centric approach limit the flexibility of these platforms. We aim at providing a concept that supports data ranging from text-based formats like JSON to images and video footage. Furthermore, we introduce the concept of extensions, which allows to enrich existing data points with additional information, thus, taking a data point centric approach. This enables us to gain semantic and user specific context by attaching subjective data to objective values. This paper provides an overview of our architecture including concept, implementation details and present applications. We distinguish our approach from several other systems and describe two sensing applications namely AirProbe and WideNoise that were implemented for our platform.
2012
- AbstractDOIBibSonomyThe connection of ubiquitous and social computing is an emerging research area which is combining two prominent areas of computer science. In this paper, we tackle this topic from different angles: We describe data mining methods for ubiquitous and social data, specifically focusing on physical and social activities, and provide exemplary analysis results. Furthermore, we give an overview on the Ubicon platform which provides a framework for the creation and hosting of ubiquitous and social applications for diverse tasks and projects. Ubicon features the collection and analysis of both physical and social activities of users for enabling inter-connected applications in ubiquitous and social contexts. We summarize three real-world systems built on top of Ubicon, and exemplarily discuss the according mining and analysis aspects.
- AbstractDOIBibSonomyExceptional model mining has been proposed as a variant of subgroup discovery especially focusing on complex target concepts. Currently, efficient mining algorithms are limited to heuristic (non exhaustive) methods. In this paper, we propose a novel approach for fast exhaustive exceptional model mining: We introduce the concept of valuation bases as an intermediate condensed data representation, and present the general GP-growth algorithm based on FP-growth. Furthermore, we discuss the scope of the proposed approach by drawing an analogy to data stream mining and provide examples for several different model classes. Runtime experiments show improvements of more than an order of magnitude in comparison to a naive exhaustive depth-first search.