- XYT : from raw GPS data to Mobility analysis
- 1. Abstract
- 2. Background
- 2.1 New data, new tools
- 2.2 Privacy concerns
- 2.3 Spatial data science
- 2.4 Previous works
- 3. Objectives and contribution
- 3.1 Data ingestion, management, and enrichment
- 3.2 Pipeline of XYT
- 3.3 Taxonomy of XYT
- 4. Methods and specifics
- 4.1 Generate fake gps data for easy manipulation
- 4.2 Preprocessing pipeline from raw GPS to mobility data
- 4.3 Organization of information
- 4.4 Graph-based approach
- 4.5 Spatial analytics
- 4.6 Privacy
- 5. Input gps format
- 5.1 Things to know about geodata
- 5.2 Typical data format for practitioners and researchers
- 5.3 Example of existing datasets fully compatible with XYT
- 5.4 Contextual data: enriching GPS and Mobility Analysis
- 6. List of instances and ‘public’ methods in XYT
- 7. Web-app
- 7.1 Streamlit demo
- 7.2 Swiss proximity
- 8. Future improvements
- 8.1 Full roadmap of the library
- 8.2 Upcoming developments
- 9. Release
- 10. Conclusion
- 11. Acknowledgments
- References
XYT : from raw GPS data to Mobility analysis
1. Abstract
The XYT project is a Python library that to contribute to the growing practice of open research data in the field of Human Urban Mobility and City Science. With the increasing amount of urban dynamics generating extensive and heterogeneous digital data, there is a need to effectively handle large, multi-sourced datasets in various formats and standards. While geolocation data is private and sensitive, public transit schedules are open data, presenting a mix of restriction levels.
Traditionally, travel surveys such as transport micro-census or paper-based surveys have been used for travel analysis. However, the pervasiveness of digital technologies in the past decade has given data collection a whole new dimension, providing researchers with the opportunity to collect accurate travel data using GPS receivers.
There is an increasing need from practitioners and researchers in the field of transport and mobility analysis and planning to leverage GPS data. Yet, GPS data in its raw form only provides information about a geolocalized time-stamp, which is limited in its insight. The python library XYT proposes an integrated pipeline to transform raw GPS data into mobility data.
Several instances are developed to generate fake GPS data for privacy-based manipulation of the library, pre-process, label and enrich the GPS data, obfuscate and aggregate data for privacy purposes, parse the GPS data into a format readable by mobility experts, including a pipeline for studying the action space of users, and abstract mobility diaries to graph to perform fast and scalable privacy-based operations.
By packaging these contributions into an open-source Python library, XYT aims to enhance accessibility and foster collaboration within the transport and urban researcher community. Lastly, a user-friendly Streamlit app is also available to showcase the library's capabilities and facilitate integration into diverse projects in the field of urban data analytics.
2. Background
The convergence of mature technologies, digital services, regulations, and strategic initiatives has recently unveiled unprecedented potential within the field of Urban Ecology – in particular because of the growing availability of fine-grained data – ranging from open data (e.g. OSM, GTFS) to restricted data (e.g. personal geolocation data) – that await to be transformed and valorized.
2.1 New data, new tools
In the contemporary landscape, urban dynamics generate vast digital datasets characterized by size, heterogeneity, noise, and diverse formats.
Python and R are the two most widely used programming languages in the coding community. Python's data science libraries are popular for their centralized and organized structure, as well as their compatibility with third-party platforms for creating dashboards and visualizations. However, some suggest that R still offers a more comprehensive selection of geospatial libraries compared to the libraries available for data science in general, giving it an edge for specific geospatial tasks.
Whether R or Python is the better language for a particular task depends on the project's specifics. For this research, Python was the primary language used for coding due to its compatibility with third-party platforms for drawing dashboards or visualizing data.
In addition to the two main languages, there are a number of essential tools available for geospatial data analysis and manipulation. Figure XX shows a non-exhaustive list of existing tools that can be used to manipulate, transform, and analyze spatial data. These tools may include libraries for R, libraries for Python, and platforms for storage, visualization, dashboarding, or Geographic Information System (GIS) processing. Additionally, spatial data can be quite large and complex, often requiring specific storage standards and resources to manage them (e.g. PostGIS and Postgres).
Over the last few years, the transportation-related Open Research community has converged and shared their work. The awesome-transit community (Center for Urban Transportation Research, 2021), the Urban Data Lab (Boeing, 2020), and the MATSim community (2022) have collectively attracted more than a hundred contributors in the field.
Within this context, the XYT project aims to address four critical needs: (i) providing a framework to unify diverse urban dynamics data, (ii) articulating open and restricted data, (iii) harmonizing supply and demand data, and (iv) tracking a privacy metric. This project seeks to develop and release an open Python package that contributes to Urban Mobility Open Research Data practices. The proposed package will augment existing open Python packages like "tracktotrip" (2021), "movingpandas" (2019), "scikit-mobility" (2021), "gtfs_function" (2020), "osmnx" (2017), and "MATSim" (2022).
2.2 Privacy concerns
Managing such data involves navigating privacy concerns too, with geolocation data being sensitive while public transit schedules remain open data. In the current scenario, smartphones play a pivotal role in generating digital data, but concerns about privacy have emerged due to data collection scandals and location-sharing controls in mobile operating systems.
To address these concerns, the XYT Python library is designed to process raw GPS data while safeguarding user privacy as much as possible.
2.3 Spatial data science
Spatial data science is a subset of Data Science that focuses on the unique characteristics of spatial data, moving beyond simply looking at “where things happen to understand why they happen there” (CARTO, 2023). The Center for Spatial Data Science at the University of Chicago is a leading institute. It develops state-of-the-art methods for geospatial analysis, spatial econometrics, and geo-visualization, and implements them through open-source software tools. More broadly, Spatial data science has become increasingly important in recent years, as the need to analyze large amounts of data has grown substantially. As such, there are now numerous ways in which spatial data science can be used to better understand the spatial dimension of activity-travel behaviors. Spatial data science is used to manipulate graphs and networks, cartography, and action space analysis.
2.4 Previous works
This project benefits from a large real-world dataset of GPS data from individuals using a tracking app in their daily lives, including the MOBIS data (2022) and the Lemanic Panel data (2024).
The XYP project builds upon Timur Lavrov's (2020) and Michal Pleskowicz’s (2020) projects at the Swiss Data Science Center (EPFL), focusing on mobility analysis using GPS data and Privacy Enhancing Technologies (PETs). Timur's used of the Ramer-Douglas-Peucker algorithm for dataset simplification is omitted here for the sake of higher accuracy needed for transit itinerary inference.
The subsequent section provides an overview of methods described in past literature for cleaning, splitting, and detecting modes and transit itineraries, citing notable works by Schuessler and Axhausen (2008), Stenneth et al. (2011), Patterson and Fitzimmons (2016), Bucher et al. (2019), etc.
3. Objectives and contribution
The core contribution of our library is a comprehensive data processing pipeline for upgrading raw GPS data to (analytics-ready) mobility data. This includes :
- leveraging machine learning approaches to generate fake GPS data ;
- synthesizing multiple pipeline steps to yield complete trip information from raw GPS smartphone data, leveraging pre-existing solutions ;
- introducing two privacy-preserving methods, detailed in their implementation, with insights into potential attacks.
- instancing some data transformation and spatial data science methods to get the data usable and visualized by practitioners
To enhance accessibility, we package these contributions into the XYT library on PyPI, enabling developers to easily install and integrate our library into their projects.
An interactive Streamlit app showcasing our library's capabilities is available online. It provides a practical demonstration of our library's potential, creating a bridge between developers and the powerful functionalities of XYT. This direct engagement ensures that our library is not only robust but also user-friendly, fostering a seamless integration into diverse projects within the rapidly evolving landscape of urban data analytics.
3.1 Data ingestion, management, and enrichment
Preprocessing of spatial data is a necessary step before performing Spatial Data Science. This process typically consists of four core steps: data ingestion and management, data enrichment, data analysis, and data visualization. Data ingestion and management includes tasks such as data cleaning, data compatibility, privacy measurements, and data transformation to convert raw geolocation data into mobility data. Raw geolocation data is valuable because it can be collected seamlessly and widely. However, geolocation data is far from being usable in behavioral analyses as it only contains noisy georeferenced timestamps (longitude, latitude, and timestamp). Data enrichment is the last step to transform raw data into actual mobility data (e.g., trips, modes, points-of-interest). It consists of, for example, labeling the data with mode of transport or activity type, and integrating contextual data (often open data). Data analysis uses spatial data science and statistical pattern recognition to draw meaningful insights from the data. Data visualization uses cartography and graphs to present the data in an understandable format, making it easier to interpret the results and draw conclusions.
3.2 Pipeline of XYT
Fake Data generation
GPS data are sensitive data in terms of the GDPR, meaning that sensitive information can be retrieved from the data. We suggest that, before manipulating real data, any analytic pipeline can be built on fake data. The objective is to provide fake (insensitive) data to be able to safely develop scripts without infringing on users’ privacy.
Data Preprocessing
Analyzing raw GPS data for mobility analysis involves important preprocessing steps including data cleaning, data splitting into trip segments, incorporating contextual data, and mode detection. These steps ensure data accuracy and usability, leading to valuable insights about user mobility patterns.
Overall, by following these preprocessing steps, we can transform raw GPS data into a format that is suitable for mobility analysis. Each step plays a crucial role in ensuring the accuracy and usability of the data, ultimately leading to valuable insights about user mobility patterns.
Data privacy
XYT is a privacy-based GPS analytical pipeline. It lets the developer obfuscate and aggregate the data to match its needs it terms of granularity. For instance, one may not need to know the exact location of Home and Work, but a Home and Work area. Similarly, one may not need to know the exact time of departure or arrival, but the number of home-loops in a days. XYT has several ways to affect the granularity of the data.
Space-based analytics
The accuracy of the sensing device is crucial to yield clean and accurate analyses. Thus, instanciate a pipeline to cluster Nearest Neighbors using a density-based spatial clustering of applications with noise (DBSCAN). The distance-based approach employs distance metrics to measure the similarity between data points. It can be used to measure the discrepancy between two metrics, such as travel demand and transit supply with the least square method. It can also be used to clean GPS data with the Gaussian smoothing algorithm (i.e., remove outliers) or to perform activity clustering to yield major Points of Interests (a.k.s POIs).
Time-based analytics
Several temporal challenges appear when it comes to manipulate geolocation data. This calls in particular several levels of time-based aggregation be it to the minute, the hour or the day. GPS is a sequence of single geolocated points in time, which is often way too granular (and therefore heavy) to perform mobility analysis.
Graph analysis
The weight of the data is often an issue when processing large-scale mobility data. The typical way of representing mobility data is a large table in which each row represents a leg or a stay-point, identified with a user id. We argue that structuring the data as a directed graph is a fast, computationally tractable, privacy conscious approach to study multi-day geolocation data. We suggest that these premises are promising for studying large-scale micro-level multi-day mobility behaviors. Python libraries such as NetworkX provide efficient tools for further exploring the network structure, dynamics and functions (Hagberg et al., 2008), as well as for geospatial analysis. The XYT project instantiates a set of transformations and methods to make GPS data graph-ready.
Action space analysis
Lastly, XYT proposes an interpretation of the graph topology (or shape) by exploring activity space and locational regularity.
3.3 Taxonomy of XYT
The XYT project aims at being a collaborative library that leverages scientific works. Collaborations are primarily expected from the EPFL transport laboratories (e.g. LaSUR directed by Vincent Kaufmann or TRANSP-OR directed by Michel Bierlaire), and will hopefully benefit from a wider community.
End of 2023, we happily launched a first version with main instances that give a first impulse to the XYT project. This include “preprocess”, “analyze” and “apply”, as introduced in the previous sections and in Figure XX. The “model” instance will be soon integrated in future versions.
4. Methods and specifics
In section 4, we will discuss the XYT library in detail. This includes insights on real GPS data and generating synthetic location-based data. We will outline the preprocessing pipeline from raw GPS data to mobility data, which includes steps like data cleaning, segmentation, contextual data incorporation, and mode detection. We will also cover privacy concerns in managing geolocation data. Additionally, we will explore spatial data science concepts and techniques for urban data analysis. XYT utilizes these methods for graph analysis, action space analysis, and spatial-temporal analytics.
4.1 Generate fake gps data for easy manipulation
4.1.1 Insight on real GPS data
Real-world GPS data generally consists of a series of latitude (x) and longitude (y) tuples with a timestamp (t). In some cases, the smartphone also refers an accuracy measure that depends on the signal acquisition and other redundancies in the signal collection process (e.g. use of accelerometer embedded in the device).
When delivered from GPS data providers (e.g. MotionTag), data generally have the following columns : < user id, tracked at, latitude, longitude, created at, accuracy, speed, altitude >.
Table X : What GPS data generally looks like when delivered from GPS data providers: the five first rows of the waypoints file for a selected user
user_id | type | started_at | started_at_timezone | finished_at | mode | purpose | geometry |
AAALY | Staypoint | 2023-05-30 16:28:30 | Europe/Zurich | 2023-05-31 01:38:59 | NaN | home | POINT (lon1 lat1) |
AAALY | Leg | 2023-05-31 01:38:59 | Europe/Zurich | 2023-05-31 01:46:16 | Mode::Bicycle | NaN | LINESTRING (lon1 lat1, lon2 lat2... |
AAALY | Staypoint | 2023-05-31 01:46:16 | Europe/Zurich | 2023-05-31 01:47:30 | NaN | work | POINT (lon1 lat1) |
AAALY | Leg | 2023-05-31 01:47:30 | Europe/Zurich | 2023-05-31 01:48:26 | Mode::Walk | NaN | LINESTRING (lon1 lat1, lon2 lat2... |
XYT provides a simple way to generate synthetic data based on the data structure introduced in Table X using the class FakeDataGenerator
.
4.1.2 Pipeline to generate fake data
The FakeDataGenerator
is a Python class that generates synthetic location-based data for testing and demonstration purposes. It is essential for conducting preprocessing steps and developing location-based applications while respecting individuals' privacy. This class allows developers to create and test their applications without compromising users' privacy. It also enables demonstrations, prototyping, preprocessing, and data analysis on controlled datasets. The generated fake data can be used for these tasks.
The FakeDataGenerator
provides the ability to generate various types of location-based data tailored to specific use cases and requirements.
In some cases, the GPS data provider may also offer a preprocessed, slightly enhanced dataset that includes several automated steps. This dataset consists of < user id, leg id, started at, finished at, started at, finished at, validated mode, detected mode, purpose, type, confirmed at, geometry >.
4.2 Preprocessing pipeline from raw GPS to mobility data
The process of analyzing raw GPS data from a mobility perspective involves several important preprocessing steps.
4.2.1 Cleaning
This step is designed to address any issues or inconsistencies in the raw GPS data. Raw GPS data often contains corrupted or irrelevant points that are not useful for analysis. Additionally, there may be inaccuracies in latitude and longitude readings that need to be resolved. By performing data cleaning, we can ensure that the data is of high quality and suitable for further analysis.
In addition to filtering out data points that fall outside the analysis area (e.g., points from neighboring countries in the case of a Swiss dataset), the cleaning process mainly involves removing outliers in trajectories, as well as smoothing and clustering points of interest (POIs).
Smoothing
The purpose of Gaussian smoothing is to reduce errors in the GPS coordinates' accuracy, resulting in clearer speed and acceleration approximations. We smooth the GPS coordinates instead of the user's speed, as speed readings are often unavailable in the files. Our assumption is that people do not make frequent changes in heading while in transit or engaged in an activity (with the exception of sports, such as squash or football). We use the Gaussian kernel implemented by Schuessler and Axhausen (2009). For each coordinate in {latitude, longitude}, the smoothed value at time t is calculated individually.
with being the raw value of the coordinate at time and the Gaussian Kernel function computed for each point of time by
The Kernel bandwidth, represented by , is set to 10 seconds, which results in a 15 second smoothing range. This is assumed to be a reasonable time frame for real behavioural changes as opposed to signal jumps.
We implemented a vectorized version of this method that leverages numpy’s optimized operations. While working well on small waypoints files, it proved unusable on bigger ones. The cause of this was that at some point in the function we create a weights matrix, where n is the number of datapoints in the input. Each row i of the matrix contains , calculated at . Considering that the positional coordinates are represented by 64bit floating point numbers, and the size of the weights matrix grows exponentially with the size of input data, we quickly run out of memory when cleaning big files. A waypoints file with n=200 000 entries, which is not uncommon in our dataset, needs 64 bits * 200 000^2 / (8*1024^3) ≈ 298 GB of space for the weight matrix only, which is clearly not supportable.
Figure XX : Example of the raw GPS data before smoothing, the consecutive points have been connected for the sake of visualizing the order of points
After smoothing our coordinates, we calculate additional information about the datapoints, that is the speed and acceleration. The directional speed for each coordinate c is the first derivative with respect to t of the smoothed position and the acceleration the second derivative with respect to . Both are set to 0 at . Haversine formula is used to calculate the distance between two consecutive points.
The Haversine formula defines the great-circle distance between two points on a sphere, given their latitues and longitudes. It is defined as
where
- are the latitude of point 1 and latitude of point 2 (in radians),
- are the longitude of point 1 and longitude of point 2 (in radians).
Clustering
In addition to the processing above, the poi detection module detects home and workplace locations and tags them with home or work in the purpose column of the above DataFrame. The algorithm uses a 500m × 500m grid and assigns each waypoint to the corresponding cell in the grid. Each cell has a unique number assigned with the Cantor pairing function where a and b are latitude and longitude, expressed in degrees. The average point in the cell with the most points is defined as the home location, while average point in the most visited cell on weekdays is the work location (or the second most visited weekday cell in case the first one is the same as home location). Then, all the activity legs are tagged accordingly with one of the corresponding POIs.
4.2.2 Splitting
Once the data has been cleaned, the next task is to group the individual GPS points into staypoints or legs. Waypoints can be categorized as either activities (i.e. staypoint such as staying at home or at work) or transport (i.e. leg such as a bus or train ride).
Splitting involves :
- adding an MTP (mode transfer point) column to flag start and end points,
- detecting activities using a sliding window approach. The window size is 180s, and the mean speed within the window is calculated. If the mean speed is below the threshold of 0.5m/s (see Yazdizadeh et al., 2018), it is considered an activity. A centroid and radius are calculated, and points within the radius and time frame are added to the activity window. Points in the window are flagged as activities, with the first and last points flagged as 'start' and 'end'. The sliding window is then moved outside the flagged window, and the procedure continues.
- identifying changes in mode of transport.
Leg files are created with additional statistics for the subsequent mode of transport classification.
4.2.3 Mode detection
Ideally, we aim to determine the user's mode of transport based on the characteristics of their trips and contextual data. By analyzing factors such as speed, acceleration, and other relevant features, we can develop algorithms or models to accurately classify the mode of transport used by the user. Currently, we have implemented a simple algorithm, but further research is needed to improve the classification accuracy.
4.3 Organization of information
Mobility data can be represented in different ways, meaning different level of aggregation in time both space. We describe below the different states that are used in the XYT project.
4.3.1 Three states of data
4.3.2 Different representations in space and time
Graphs and networks are commonly used to represent transit networks as edges and vertices. In Python, the Networkx library allows for the manipulation and analysis of complex networks. It provides data structures for graphs, digraphs, and multigraphs, and implements many standard graph algorithms and network structure and analysis measures.
Action space analysis utilizes point pattern analysis, which is a method for examining and describing patterns of points in space. It involves making inferences about recognized patterns or clusters based on the positioning of the points. Point pattern analysis is often combined with other methods such as network analysis and cluster analysis to gain a more comprehensive understanding of pattern formation.
The first step of spatial analysis involves organizing the spatial (and/or temporal) information contained in the data, as describe in Figure XX.
The same information, which includes a sequence of activities and trips over several days, can be represented in three different ways. The first way is to draw a timeline that shows the sequence of activities, with each station labeled to indicate the purpose of the activity. This allows us to analyze the allocation of time and the chaining of activities.
The second way is to abstract the daily chain of trips and activities as a "mobility motif," using edges and vertices to represent the structure and complexity of the mobility project. This allows us to analyze complexity based on entropy and the characteristics of the graph.
The third way is to project the points of interest in space and calculate metrics such as the frequency of visits to each location or the standard deviation ellipse, which serves as an indicator of the action space. From there, we can analyze the distance from the usual action space, the rate of innovation in locational behaviors, and the extent of spatial consumption.
4.4 Graph-based approach
In recent years, researchers have focused on "human mobility motifs" using network theory's pattern identification methods (Schneider et al., 2013). Mobility motifs are the most frequent activity-travel diary structures observed over several days. This approach represents activity chaining as a graph, preserving all data. Schneider et al. (2013) argue that using mobility motifs can enhance the generation of synthetic populations, providing a cost-effective alternative to travel diary surveys for multi-agent models. They also studied mode switching locations and found clear patterns in motif choices over several days. Jiang et al. (2017) translated raw weekday call detail records into meaningful mobility patterns using motifs. Su et al. (2020) proposed a joint pattern recognition method combining motif-based analysis and activity sequence-based analysis. Research shows that a small set of motifs can explain most activity-travel behavior. However, there is currently no research on joint analysis of mobility motifs and activity space. We describe below the methods we used in the instance GPStoGraphs()
.
4.4.1 Euclidean Directed graphs
In a first approach, we utilize the legs and stay-points of the geolocation data to construct a complete directed graph for each user in the dataset. The Python library NetworkX (Hagberg et al., 2008) is used to encode the graph objects. Each graph, node, and edge can contain attribute-value pairs in an associated attribute dictionary, such as travel mode or activity purpose. The edges are directed to maintain the sequence of travel. This method allows for efficient manipulation of large amounts of data points. After data cleaning, the data is represented as rows with information for each day of observation and each user. Figure XX illustrates an example graph obtained using this method. The complete directed graph includes the geographic coordinates of each node.
The multi-day graphs can be segmented into daily graphs, which can then be further abstracted as directed graphs by removing the geographic information. In these graphs, vertices represent legs between visited places, and nodes represent stay-points such as stations or events. They are represented as flattened adjacency matrices, enabling fast calculations and easy storage.
Previous research has shown that capturing the mobility diary structures can be achieved with 10 to 20 daily graphs. These commonly occurring graphs, known as "mobility motifs," are depicted in Figure XX.
4.4.2 Mobility motifs
To capture the set of mobility motifs, the following steps are computed under the assumption of day-to-day independence:
- Initially, we segmented (and labeled) the data into legs and stay-points based on activity purpose (work, leisure, home, duties). We then filtered the mobility diaries in three steps: (i) removing those that began and ended at a different location than the main home, (ii) eliminating those with a total trip distance greater than 300 km, and (iii) labeling all-day-home-stays for later consideration.
- Compute the unique stay-sequence for all diaries (k) and all travelers (i).
- Compute the size (n) of the network, which is the length of the sequence of unique stays.
- Populate a square binary tensor of size (n) with "1" if there are trip(s) between indexed unique locations (rows and columns).
- Isomorphic graphs are treated on a case-by-case basis.
This algorithm yields location-based motifs (LBM). Similarly, replacing the sequence of stay-point locations with the activity-chain yields activity-based motifs (ABM). Nodes are not weighted based on the number of visits, nor are they marked with location or activity information.
Variations of the motif construction process have been recently discussed in other studies (see Jiang et al., 2017:210-5; Su et al., 2020).
4.5 Spatial analytics
We describe below the methods we used in the instance GPSAnalytics()
.
4.5.1 Centrography
The objective is to generate key metrics that characterize the activity space for a more detailed exploration of spatial familiarity.
Spatial familiarity metrics encompass a comprehensive evaluation of location history, daily activity-space variability, and spatial innovation. Achieving this involves complex data transformations using advanced point-pattern centrography. By leveraging a dataset with labeled locations, including purpose and visit counts over a specific time frame, marked point pattern analysis (PPA) facilitates the study of individual action spaces (Baddeley, Rubak, and Turner 2015).
The implementation of centrography (using the Python Spatial Analysis library) extracts characteristics to describe the activity space:
- Points: Marked visited places with counts of visits, purpose labels (home, work, leisure, duties), unique location IDs, and intensity (average number of event points per unit of the convex hull area).
- Centers: The mean center and weighted mean centers (weighted by the count of visits).
- Distances: Standard distance, which provides a one-dimensional measure of how dispersed visited locations are around their mean center, and the sum of distances from home.
- Shapes: Standard deviational ellipse, which provides a two-dimensional measure of the dispersion of visited locations, and the minimum convex hull of frequently visited places.
This approach heavily relies on the Python library for spatial analysis, PySAL.
The GPSAnalytics()
instance yields several joint metrics of the points, centers, distances and shapes, including regularity, frequency of visits, proximity, and home shift.
These metrics are described below:
- Regularity is the fraction of the number of frequently visited places over all places, including places visited once. A small regularity implies high locational innovation(see Schönfelder and Axhausen 2010, 153), and a regularity that tends to 1 implies that the traveler mostly visits well-known locations. Regularity continuously varies between 0 and 1.
- Frequency of visits is categorical variable made of four categories. It differentiates the most visited places over several days of observation, but also the frequently visited places as well as the occasionally visited points, and finally the places visited once (outlying locations).
- Proximity refers to the relative dispersion of habitual action space. Figure XX shows significant differences between the overlaps of the habitual action space (represented by the hull) and the global action space (represented by the ellipse). Proximity is a measure of this overlap. For instance, a proximity greater than one indicates a dispersed habitual activity space, where places frequently visited over an 8-week period are spread out across the territory. Conversely, a close innovation activity space is characterized by places visited only once or occasionally that are closer to the main home location. It is important to note that proximity is a ratio of standard distances rather than a ratio of hull surfaces over ellipses. This is done to avoid corner cases where frequent locations would be geographically aligned, resulting in a Empirically, proximity values generally range between 0 and 3.
- Lastly, Home shift is the Euclidean distance between home and the weighted mean center (see Figure 1), which provides a measure of residential isolation. A small home shift means that most activities are done locally, in an area relatively close to the home; and a high home shift implies a remote activity space.
4.5.2 Action space
Spatial variability is less often discussed in the literature, but can be addressed through the characteristics of individuals' action and activity spaces. The action space is the geographic area where individual or group spatial interaction happens, consisting of both places actually visited and those potentially visited. An activity space, meanwhile, is the subset of all locations within which an individual has direct contact. Activity spaces are largely defined by (a) movement in the vicinity of home, (b) interactions with regular activity locations, and (c) movement between the centers of daily life travel. XYT allows to analyze these action spaces, as introduced in Figure XX.
In operational terms, activity spaces are geometric indicators of daily travel patterns which facilitate big data processing by leveraging centrography and spatial point-pattern analyses. Similarly to the temporal "habitual degree", the extension of action space is influenced by the orientation and commitment of out-of-home activities. Constrained activities on weekdays result in a more stable spatial behavior. In contrast, weekends are dominantly influenced by unobserved factors. Studies of action space over multiple days show low spatial variability and low place variety seeking. There are usually 2-4 locations which cover about 70% of all locations within 6 weeks. Although the maximum number of visited locations can reach 60, about 90% of all trips are made to the same 8 locations. These places are often referred to as "important places" or "spatial familiarity". The traditional transportation engineering approach often relies on objective metrics such as "frequency of visit" or "radius of gyration" to characterize the action space. However, this approach is arguably too functional and reductive, considering the literature on place attachment and sense of place. It recalls that "home" is the epicenter of the sensible action space, as it is where social spheres, support networks, and constraints intersect.
Depending on the privacy constraints, the developper has the possibility to leverage XYT to plot the action space within its territorial context.
4.5.3 Contextual data
In future work, our plan is to improve our dataset by including contextual data during the processing stage. This will allow us to achieve more accurate trip splitting, classify modes of transport, and analyze user mobility. Specifically, for modes of transport that are part of a public transport network, we aim to determine the exact service lines used by the user. By incorporating this additional contextual data, we can gain deeper insights into user mobility patterns.
4.5.4 Innovation rate
The innovation rate in transport and mobility analysis refers to the pace at which new locations are visited by users.
As shown in Figure XX, the innovation rate appears to follow a universal law. Testing the innovation rate can help assess the representativeness of the data, among other analytics.
4.6 Privacy
The literature (see Liu et al., 2018) mentions four categories of LPPMs (Location Privacy-Preserving Mechanisms): obfuscation, anonymization, cryptography, and reducing location information sharing. XYT currently integrates the obfuscation and aggregation mechanisms in the instance GPSDataPrivacy()
.
The following subsections were substantially developed in Pleskowicz, Schultheiss and Bouillet (2022).
4.6.1 Obfuscation
Obfuscation in the XYT library allows the programmer to perform spatial obfuscation of important Points of Interest (POIs), utilizing the detection of home and work locations described in the previous subsection. This feature provides the option to either remove all points in the proximity of these POIs or assign the same noisy location within that area to all of these points. The user can specify the radius and an offset for the obfuscation window. The process involves sampling a random location uniformly within the circle with a radius of around the POI location, as illustrated in Figure XX. Then, that new noisy location is treated as the center of the obfuscation window with (see Figure XX).
Shifting the home location first is a privacy protection measure to mitigate the risk of inference attacks. In an inference attack, an attacker identifies the largest circle without any data points in a densely populated area and calculates its center to determine the location of interest. The closer a point is to the edge of the obfuscation circle, the easier it is to infer the location of a point of interest (POI). For example, if a point is near the edge, there may be many points leading to the entrance of a building, indicating the presence of a POI. To address this, an offset parameter is set to ensure that the location is at least a certain distance from the edge of the obfuscation area.
It's important to note that there are other types of attacks that can compromise the effectiveness of obfuscation techniques. One significant risk is when hiding locations in areas with low population density. If the obfuscation circle contains only one building, such as a house in a rural area, it becomes easy to infer the location of the POI, rendering this approach ineffective for protecting privacy. One possible solution is to increase the radius of the obfuscation circle based on the density of buildings in a given area, ensuring that it encompasses at least two possible POI locations.
The XYT library proposes an obfuscation utility metric to assess the procedure above. The metric is defined as follows:
The statistic is calculated by incrementing the obfuscation radius for a selected subset of data. It measures the percentage of affected legs, which are legs that contain points that were either removed or shifted to the center of the obfuscation circle. The numerator represents the number of affected legs (), while the denominator represents the total number of legs () returned by the pipeline. The results of this calculation are presented in the Results section below.
4.6.2 Aggregation
The second method divides the map into a lattice and aggregates the dataset (that contains data from many users) over cells in the lattice as well as a given time period. It produces a dataset with the following columns: < timestamp, cell latitude, cell longitude, count >, where cell latitude and cell longitude denote the center point of a cell in the map lattice, and count is the count of unique users that visited that cell during a given time period. The method takes cell size and timedelta as parameters. The results of the aggregation with cell size = 0.4km and timedelta = 1 hour can be seen in Figure XX. Zooming in closer (Figure XX) we can spot the centers of the cells of the lattice. The bigger the cell size and timedelta, the smaller the resolution of the final data.
Upon further visual inspection of the heatmap, it reveals the potential for attacks, particularly those that exploit contextual information. These types of attacks are particularly effective in datasets with a small number of users. In Figure XX, the path taken by a single user during the specified time period is depicted. Despite the low spatial resolution (400mx400m cells), the attacker can easily infer the route taken by the user, as the path of visited cells aligns with one of the main roads.
The second method involves dividing the map into a grid and aggregating the dataset, which contains data from multiple users, based on the cells in the grid and a specified time period. The resulting dataset includes < timestamp, cell latitude, cell longitude, count >. The count represents the number of unique users who visited each cell during the given time period.
This method allows for adjusting the resolution of the final data by modifying the cell size and time interval. The results of the aggregation, using a cell size of 0.4km and a time delta of 1 hour, are shown in Figure XX. By zooming in on the heatmap generated from this aggregation (Figure XX), the centers of the grid cells can be observed. It is important to note that in datasets with a low number of users, attackers can potentially exploit contextual information to deduce the route taken by a single user, even with a relatively low spatial resolution.
5. Input gps format
5.1 Things to know about geodata
Here are some specific things to know regarding geodata processing
- CRS is important – this is the geographic projection system which is set to
EPSG:4327
(internatioinal standard, aka WGS 84) orEPSG:2056
(Swiss standard, aka CH1903+ / LV95) – please stick to WGS 84 - WGS 84 is in degrees, CH1903+ in in meter – this matters when computing distances, dbscna, etc.
datetimes
are set to a specific time zone (UTC+1 for Switzerland)- Typically there are three types of geometries:
type==waypoints
is a shapely point() unlabeled i.e., raw gps datatype==staypoint
is a shapely point() detected as an activitytype==leg
is a shapely linestring() detected as a trip- Waypoints typically have the following columns. Note that some other columns such as
<'detected_mode'>
or<'detected_purpose'>
may be inferred from the gps data provider
['user_id', 'type', 'tracked_at', 'latitude', 'longitude', 'accuracy']
- Leg and / or Staypoint df typically have the following columns :
['user_id', 'type', 'started_at', 'finished_at', 'timezone',
'length_meters', 'detected_mode', 'purpose', 'geometry',
'home_location', 'work_location']
5.2 Typical data format for practitioners and researchers
The tracking was scheduled to last for eight weeks. The collected location records were sent to a web-based platform where various treatments were performed, such as trip segmentation and travel mode detection. As a result, the data comes in different states of aggregation: raw geolocation data segmented into legs and stay-points, and labeled mobility data with inferred modes of travel and activity purposes, as described in Table XX and Table XX. The "waypoint" dataset contains raw GPS records collected through a mobile application. Each record includes a user ID, location coordinates, and a timestamp. The "leg" dataset contains preprocessed location data where GPS records are segmented into trips or stays. We have two files available for each user:
Waypoints
Each record (row) represents a GPS point. Columns: <'user_id', 'tracked_at', 'latitude', 'longitude', 'created_at', 'accuracy', 'speed', 'altitude'>
Table XX : Attributes of the “waypoint” dataset
Waypoint attributes | dtype | description |
user id | string | Unique pseudonym with 5 random characters |
tracked_at | timestamp | Time stamp when the user was at the location |
latitude | double | - |
longitude | double | - |
created_at | timestamp | The timestamp when this record is sent and stored in the server. |
accuracy | double | [m] |
speed | double | [m/s] |
altitude | double | [m] |
Legs
Each record (row) represents a trip or activity aggregated from the corresponding user's waypoints. Columns: <'user_id', 'leg_id', 'started_at', 'finished_at', 'started_at_utc', 'finished_at_utc', 'validated_mode', 'detected_mode', 'purpose', 'type', 'confirmed_at', 'geometry'>
Table XX : Attributes of the “leg” dataset
Leg attributes | dtype | description |
user_id | string | Unique pseudonym with 5 random characters |
leg_id | integer | Unique id for this leg |
started_at | timestamp | When the user started the trip at timezone of the start point |
finished_at | timestamp | When the user finished the trip at timezone of the finish point |
validated_mode | string | Validated travel mode by the user |
detected_mode | string | Detected travel mode |
purpose | string | Trip purpose |
Table X: variable dictionary
Field | Description |
user id | Unique user identification number |
detected mode | Mode detected by the data collection app (e.g., 'Car', 'Transit', 'Walk', 'Bicycle') |
validated mode | Whether the mode has been validated by the user |
purpose | Purpose of the trip (e.g., home, work, leisure, other) |
type | Type of data (track or stay) |
geometry | Geolocation information in degrees or meters depending on the CRS |
latitude | Latitude of the location |
longitude | Longitude of the location |
tracked at | Timestamp of the waypoint |
accuracy | Device accuracy |
leg id | Unique id of a leg |
started at | Starting time of the leg, in some cases the timezone is specified |
finished at | Finishing time of the leg, in some cases the timezone is specified |
5.3 Example of existing datasets fully compatible with XYT
Panel Data Dataset
Masse Florian, Alexis Gumy et al. (2024). Enquête: Panel lémanique de suivi de la durabilité des pratiques. EPFL. https://www.epfl.ch/labs/lasur/fr/index-html/enquetes/panel-lemanique/
SDSC MOBIS Dataset
Molloy et al. (2020) A national-scale mobility pricing experiment using GPS tracking and online surveys in Switzerland
5.4 Contextual data: enriching GPS and Mobility Analysis
In the context of GPS and mobility data processing, contextual data significantly enhances the understanding of urban dynamics. The XYT library will enable easy integration of transformed GPS data with General Transit Feed Specifications (GTFS) and OpenStreetMap (OSM). However, this feature will be available in a future version of the library, as it is still being implemented and tested.
5.4.1 GTFS: Standardizing Transit Data
In recent years, there has been a collective effort to harmonize public transport data, resulting in the General Transit Feed Specifications (GTFS). GTFS is a global standard for public transport data that enables transit agencies to provide real-time feeds. These feeds include information on timetables and network geometry, which in turn fosters innovation in services such as seamless ticketing and multi-modal routing.
The availability of real-time feeds also allows third parties to develop services on top of operators' schedules. This can include seamless ticketing, multi-modal routing, or simply providing real-time and personalized information. The GTFS data is structured in several files that contain information about stops, stop times, trips, and routes, all connected through unique IDs (Google Transit API, 2022). These files reference the specific details of the public transport network and operation, including different agencies, fare attributes, specific calendar days, and more.
5.4.2 OSM: Collaborative Geographic Database
OSM, a community-driven project, creates a free global geographic database based on local knowledge. Unlike proprietary alternatives, OSM follows an open data model. The database can be searched using Nominatim, which supports geocoding (forward search based on names or addresses) and reverse geocoding (backward search using coordinates). Some third-party products, like opentripplanner.org and routing.osm.ch, utilize OSM to offer free routing services. OSM provides editing and read-only APIs for accessing and manipulating raw geodata in the database.
GTFS and OSM are contextual data sources that play crucial roles in enhancing GPS and mobility data analyses. They provide a comprehensive understanding of urban landscapes in the scientific exploration of location-based services.
6. List of instances and ‘public’ methods in XYT
List of instances in the python library xyt:
from fake_gps_generator import FakeDataGenerator
from gps_data_processor import GPSDataProcessor
from xyt_plot import *
FakeDataGenerator()
A fake gps data generator to play with the library without infringing on users’ privacy
fakegps = FakeDataGenerator(location_name="Suisse", num_users=5, home_radius_km = 20)
waypoints = fakegps.generate_waypoints(num_rows=12, num_extra_od_points=10, max_displacement_meters = 10)
legs = fakegps.generate_legs(num_rows=12)
stays = fakegps.generate_staypoints(num_rows=12)
GPSDataProcessor()
A geotagging instance to transform raw gps data into mobility data
data_processor = GPSDataProcessor(radius=0.03)
poi_waypoints = data_processor.guess_home_work(waypoints_df, cell_size=0.3)
smoothed_df = data_processor.smooth(poi_waypoints, sigma=10)
segmented_df = data_processor.segment(smoothed_df)
mode_df = data_processor.mode_detection(segmented_df)
legs_ = data_processor.get_legs(df = mode_df)
xyt_plot
A function to plot easily xyt’s instances outputs
plot_gps_on_map(poi_waypoints, home_col='home_loc', work_col='work_loc')
GPSDataPrivacy()
An instance to artificially degrade the data for privacy purposes
data_privacy = GPSDataPrivacy()
df_obfuscated = data_privacy.obfuscate()
utility = data_privacy.get_obfuscation_utility()
df_aggergated = data_privacy.aggregate()
GPSAnalytics()
An instance to perform space-based and time-based analytics on the mobility data
metrics = GPSAnalytics()
metrics.check_inputs()
staypoint1 = metrics.split_overnight(staypoint)
staypoint2 = metrics.spatial_clustering(staypoint1)
staypoint3 = metrics.split_overnight(staypoint2)
extended_staypoint = metrics.get_metrics(staypoint3)
day_staypoint = metrics.get_daily_metrics(extended_staypoint)
GPStoGraph()
An instance to abstract mobility diaries as a graph
graphs = GPStoGraph()
multiday_graph = graphs.get_graphs(extended_staypoint)
graphs.plot_motif(multiday_graph)
graphs.plot_graph(multiday_graph)
motif_seq = graphs.motif_sequence(multiday_graph)
GPStoActionspace()
An instance to compute spatial data analytics on the Action Space
actionspace = GPStoActionspace()
AS = actionspace.compute_action_space(act, aggreg_method = 'user_id'/'user_id_day',plot_ellipses = False) -> Dataframe
actionspace.covariance_matric(AS)
actionspace.plot_action_space(act, AS, user_subset = ['CH15029', 'CH16871'], how = 'vignette'/'folium', save = False)
actionspace.inno_rate(mtf_, AS_day, user_id_, phase=None, treatment=None)
Creating and publishing a Python library like "xyt" for processing geolocation data and transforming it into mobility data is a multi-step process. Here's a high-level overview of the steps you should follow to eventually have your library available for installation via pip:
7. Web-app
7.1 Streamlit demo
A web app is under development for demonstration purposes of the XYT capabilities (instances and methods) – will be released December 2023.
7.2 Swiss proximity
Swiss Proximity is a planning and decision support tool for thinking more sustainably about our urban territories of tomorrow. This interface was developed by researchers at EPFL to help users, practitioners, communities, and decision-makers better understand their territories from the perspective of "urban proximities".
The platform is accessible at swiss-proximity.epfl.ch (see Figure XX)
8. Future improvements
8.1 Full roadmap of the library
8.2 Upcoming developments
Except for the cleaning and splitting mentioned above, the following improvements should be considered for future extensions of this work:
Connect GPS data to contextual data
- Considering more bus stops in the inference of transit itinerary. For now we only use the closest stop which might not always be right due to accuracy of the GPS position or splitting (as the bust/tram/train lines are inferred for endpoints of legs)
- GTFS Realtime should be incorporated into the transit itinerary inference to get more accurate results
- GTFS shapes should be used to calculate the overlap of candidate routes with the leg. Timur’s overlap score can be used
- Analysis of the usage of given lines of public transport can then be performed
Improve the mode detection algorithm
Machine learning methods, such as random forests, can be used for mode transport detection. The model can then be trained and validated on this dataset, and the modes of transport can be inferred for the legs using the developed methods. Further investigation and parameter tweaking of the fuzzy engine responsible for mode detection should be conducted to improve its performance.
Improve the fake data generation algorithm
Further develop the RNN approach released in XYT v1
Activity scheduling models
Integrate models to perform predictions and further data generation (see Pougala, Hillel and Bierlaire, 2021).
Sensitive analyses
Generally test and challenge the algorithms we developed
9. Release
The XYT library was officially released on the Python Package Index (PyPI), which serves as a repository for Python software. To install the library using pip, simply run the following command in the command line:
pip install xyt
The documentation for the XYT library can be accessed on the Read the Docs website. The source code of the library is hosted on GitHub, and the package follows semantic versioning.
10. Conclusion
The XYT project is a comprehensive and innovative approach to processing geolocation data and transforming it into valuable mobility insights. Through various methods and techniques, the project aims to enhance the understanding of urban dynamics and improve the analysis of location-based services.
The project encompasses several key components, including data generation, data processing, visualization, privacy protection, and analytics. With the XYT library, researchers and practitioners have access to a powerful toolkit that facilitates the transformation of raw GPS data into meaningful mobility data. The library offers functionalities such as obfuscation, aggregation, analysis, and integration with contextual data sources like GTFS and OSM.
The purposes of the XYT project are manifold. It provides a standardized and efficient way to process geolocation data, enabling researchers to gain insights into user behavior, travel patterns, and urban dynamics. By leveraging the power of data analytics, researchers can make informed decisions and recommendations to improve transportation systems, urban planning, and mobility services.
Collaboration is crucial to the success of the XYT project. Researchers, practitioners, and data scientists are encouraged to contribute their expertise and insights to further enhance the capabilities of the library. By working together, we can develop new methods, refine existing algorithms, and address emerging challenges in the field of geolocation data analysis.
In conclusion, the XYT project offers a comprehensive solution for geolocation data processing and mobility analysis. It helps researchers and practitioners to unlock valuable insights from raw GPS data, leading to improved urban planning, transportation systems, and location-based services. Join us in this collaborative journey to advance the field of mobility analysis and make cities smarter, more sustainable, and more efficient.
11. Acknowledgments
I would like to express my sincere gratitude to the following individuals and organizations for their contributions and support during the development of the XYT project:
- The team at Renkulab for providing a web-based platform for collaborative data analysis and hosting the XYT project.
- The ETH Domain Open Research Data (ORD) Program for the financial support.
- Eric Bouillet for his valuable scientific and technical support.
- Institute for Transport Planning and Systems (IVT) at ETH Zürich for providing the initial set of GPS data.
- The Renkulab platform team at SDSC for their work on development versioning and data privacy management.
- Michal Pleskowicz and Timur Lavrov for their participation in the development process.
- Mattéo Berthet and Rémi Delacourt for their contribution to the integration and instantiation of the library.
References
Baddeley, A., Rubak, E., & Turner, R. (2015). Spatial Point Patterns: Methodology and Applications with R. Taylor & Francis. https://www.routledge.com/Spatial-Point-Patterns-Methodology-and-Applications-with-R/Baddeley-Rubak-Turner/p/book/9781482210200
Boeing, G. (2017). OSMnx: New methods for acquiring, constructing, analyzing, and visualizing complex street networks. Computers, Environment and Urban Systems, 65, 126–139. https://doi.org/10.1016/j.compenvurbsys.2017.05.004
Boeing, G. (2020, June 4). Urban Data Lab. https://geoffboeing.com/lab/
Bucher, D., Mangili, F., Cellina, F., Bonesana, C., Jonietz, D., & Raubal, M. (2019). From location tracking to personalized eco-feedback: A framework for geographic information collection, processing and visualization to promote sustainable mobility behaviors. Travel Behaviour and Society, 14, 43–56. https://doi.org/10.1016/j.tbs.2018.09.005
Center for Urban Transportation Research. (2021). Library awesome-transit [Computer software]. https://github.com/CUTR-at-USF/awesome-transit
Gil, R. (2021). TrackToTrip [Python]. https://github.com/ruipgil/TrackToTrip (Original work published 2016)
Google Transit API. (2022). GTFS documentation. https://developers.google.com/transit/gtfs/reference
Hagberg, A. A., Schult, D. A., & Swart, P. J. (2008). Exploring Network Structure, Dynamics, and Function using NetworkX.
Jiang, S., Ferreira, J., & Gonzalez, M. C. (2017). Activity-Based Human Mobility Patterns Inferred from Mobile Phone Data: A Case Study of Singapore. IEEE Transactions on Big Data, 3(2), 208–219. https://doi.org/10.1109/TBDATA.2016.2631141
Lavrov, T., Bouillet, E., & Schultheiss, M.-E. (2020). Detecting Public Transport Usage from GPS Data.
MATSim. (2022). About MATSim, multi-agent transport simulation. MATSim.Org. https://www.matsim.org/about-matsim
Molloy, J., Castro, A., Götschi, T., Schoeman, B., Tchervenkov, C., Tomic, U., Hintermann, B., & Axhausen, K. W. (2022). The MOBIS dataset: A large GPS dataset of mobility behaviour in Switzerland. Transportation. https://doi.org/10.1007/s11116-022-10299-4
MovingPandas. (2019). MovingPandas. https://anitagraser.github.io/movingpandas/
Pappalardo, L., Simini, F., Barlacchi, G., & Pellungrini, R. (2021). scikit-mobility: A Python library for the analysis, generation and risk assessment of mobility data. arXiv:1907.07062 [Physics]. http://arxiv.org/abs/1907.07062
Patterson, Z., & Fitzsimmons, K. (2016). DataMobile: Smartphone Travel Survey Experiment. Transportation Research Record, 2594(1), 35–43. https://doi.org/10.3141/2594-07
Pleskowicz, M., Bouillet, E., & Schultheiss, M.-E. (2020). Match Matching algorithms and Open Data: Leveraging GTFS and OSM data to improve accuracy.
Pleskowicz, M., Schultheiss, M.-E., & Bouillet, E. (2022). From raw GPS data to mobility data.
Pougala, J., Hillel, T., & Bierlaire, M. (2021). Capturing trade-offs between daily scheduling choices. 27.
Schneider, C. M., Rudloff, C., Bauer, D., & González, M. C. (2013). Daily travel behavior: Lessons from a week-long survey for the extraction of human mobility motifs related information. Proceedings of the 2nd ACM SIGKDD International Workshop on Urban Computing - UrbComp ’13, 1. https://doi.org/10.1145/2505821.2505829
Schönfelder, S., & Axhausen, K. W. (2004). Structure and innovation of human activity spaces. Arbeitsberichte Verkehrs- Und Raumplanung, 258. https://doi.org/10.3929/ethz-b-000023551
Schuessler, N., & Axhausen, K. W. (2008). Identifying trips and activities and their characteristics from GPS raw data without further information. Transportation Research Record: Journal of the Transportation Research Board., 2105, 1–28. https://doi.org/10.3141/2105-04
Stenneth, L., Wolfson, O., Yu, P. S., & Xu, B. (2011). Transportation mode detection using mobile phones and GIS information. Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, 54–63. https://doi.org/10.1145/2093973.2093982
Su, R., McBride, E. C., & Goulias, K. G. (2020). Pattern recognition of daily activity patterns using human mobility motifs and sequence analysis. Transportation Research Part C: Emerging Technologies, 120, 102796. https://doi.org/10.1016/j.trc.2020.102796
Toso, S. (2020). Library GTFS functions [Python]. https://github.com/Bondify/gtfs_functions