The term Big data appeared in mid 1990s among the scientific community and was popular in 2008 but became more recognized in 2010.It generally refers to data that has characteristics of being voluminous, varied and tends to accumulate fast(velocity) veracity. This data may be structured, unstructured or semi- structured with the majority of it being unstructured. With the advent of smart phones and the wide adoption of GPS usage, there has been an explosion in the amount of spatial data being produced (smart phones, space telescopes, medical device). Due to this there has emerged a need for systems /technology that is able to quickly digest the large spatial data sets, process it and give the necessary output for decision making.
Big Data Analytics emerged from the need to process Big Data. It refers to the use of advanced analytic techniques against very large and diverse and complex data sets (structured, unstructured and semi structured) form different sources(Artificial Intelligence (AI), mobile, social and Internet of Things (IoT)).Big Data Analytics enables analysts, researchers and organizations to make better decisions faster using data that was previously unavailable by making use of advanced analytic techniques such as: text analytics, machine learning, predictive analytics, machine learning, predictive analytics, data mining, statistics and natural language processing.
As mentioned earlier, a large percentage of the world’s Big Data has a spatial component either as a single location (geo-tagged, GPS positions) or representation of geographic features using primitives like polygons, line and points. Addition of the location attribute to Big Data means that traditional analysis methods are inadequate in the determination of spatial patterns and correlations. This, however, does not mean that the problems associated with Big Data (volume and complexity) are not present in Big Spatial Data and in some respects magnified. In addition, Spatial analysis is more computationally intensive and the computational requirements can tend to increase depending on the type of analysis being performed.
Big Spatial Data analytics
Regrettably, as mentioned earlier, the need to urgently manage and analyze spatial data is hampered by lack of specialized systems, techniques and algorithms to support such data. For example, while the traditional Big data is supported by a variety of map reduce systems and cloud infrastructure like Hadoop, hive, hbase and spark none of these systems or infrastructure provide any support for spatial or spatio- temporal data. Thus, the only way to support big spatial data is to either treat it a non- spatial data or to write a set of functions as wrappers around existing non-spatial systems.
In the implementation of Big Spatial Data Analytics three different approaches have been used to overcome the challenges of classical Big Data Analytic tools.
These can be classified as:
a) On-top of approach-This makes use of an existing system of spatial data as the black box/foundation and spatial queries are implemented through user defined functions. It is simple to implement but inefficient since the system is inherently still unaware of the spatial data.
b) The from scratch Approach-This is where a new system is constructed from scratch to handle a specific application. This approach is very efficient but it’s also very difficult to build and maintain.
c)The built-in approach-This approach extends an existing system by injecting spatial data awareness in its core which achieves good performance while avoiding building a new system from scratch.
In processing Big Spatial data, we make use of parallel distributed processing on clusters of servers. To achieve this, it is pertinent to keep track of all elements that form part of a query and organize the results into a final dataset Map reduce is an algorithm that was developed by google to assist in this.
Fig 1: Showing the implementation of map reduce algorithm
The Map () component of the algorithm is used to sort and filter the source information into queues of smaller problems which are then passed to worker nodes for processing. The worker nodes break down the problem further and pass them into additional worker nodes creating a nested tree of processing nodes. Once processing is completed, each worker node passes the result back to the master node. The master node then performs the Reduce () step by collecting the answers to all sub-problems and combining them in some way to from the final answer. Spatial grounds are used at both Map () and Reduce () to contain the analysis spatially.
From the above description, it is clear that the primary role of the Map reduce algorithm is to handle the coordination and marshalling/ordering of all the distributed processes working in parallel and handling fault tolerance and redundancy.
Big Spatial Data Analytic tools
The best known Big data tool on the market is the Apache’s Foundation Hadoop which implements the MapReduce () algorithm. It can be used to run analysis on extremely large volumes of data on server clusters with any number of nodes. However, as mentioned above Hadoop is spatially unaware and so spatial predicates cannot be used on queries. Spotting this gap, ESRI has created GIS tools for Hadoop, a toolkit that provides three components:
a) ESRI geometry API for Java which is a generic geometry library that provides Hadoop with vector geometry types and operations
b) Spatial Framework for Hadoop which enables Hive Query Language(HQL) to use spatial data types and operations
c) Geo-processing tools for Hadoop which is a set of geo-processing tools for ArcGIS that enables users to move their data in and out of Hadoop and execute workflows.
The above tools make it possible to take data held in a spatial repository, package it and upload it into a hadoop cluster. Complex analysis can then be performed on the data and the results downloaded directly into ArcGIS for desktop where further detailed analysis can be performed. Moreover, the toolkit is open source and the source code freely available to the public.
Components of a Spatial Big Data Analytics Solution
Spatial big data visualization and analysis core components include the following
Language. It must have a language that is high level so as to enable the not technical users to operate the system without considerable knowledge of the system core workings and or design. The OGC or Open geospatial consortium has played a huge role in defining the functions as well as the data types that are supported by such high-level language.
Indexes. Spatial indexes enable a system to keep data in a database system in a spatial manner and this is by considering the spatial attributes of the data. This indexing provides or enables querying of the data to run faster. The indexes in systems vary and include flat indexes and hierarchical indexes.
Queries. This is core component of any system for big spatial data processing as this covers all the spatial queries that a system supports. Queries cover varied categories which include from basic queries like the range query, join queries likes spatial join, data mining query like K- means, geometry queries like polygon union and other raster operations.
Visualization. Visualization can be described as the process of developing an image that describes a dataset, for instance a heatmap for temperature. There are two types of imagery, single level imagery that can be produced with a static size and with no capacity for zooming to get or see more details. Then there are the multilevel imagery and this can be created aa multiple images and provide the capacity for the users to zoom in and out for finer and or more details
SHAHED is also a MapReduce system that is used for visualizing and analyzing satellite data. Mainly it carries out two features, visualization and also spatial selection and aggregation of queries. These features are available through a web interface that is very user friendly.
In the interface then the users can navigate to an area in the world map of their need as well as time range. Based on user’s selection the system runs a spatial and temporal selection and queries the dataset to extract the values for instance temperature in the range specified, and then also an aggregate spatial temporal query for instance to get the maximum, minimum and average in the range or one can visualize the values of the specific range as a heat map for example.
For instance, is an example of a system that does query analysis and visualization of tweets that are geotagged. This it does by collecting continuously tweets geotagged from twitter and using the spatial Hadoop indexes them. It then creates an isolated index for a day and later or after some time merge them into larger indexes for instance weekly or monthly for control. Additionally, TAGHREED also does create an inverted index that enables search of text of the tweets. The users then are given a world map from where they navigate to any world location they choose and search text within a certain time range
TAGHREED then retrieves the tweets based on user specifications and or search text and then runs some analysis on the tweets retrieved. The tweets as well as the analysis are then visualized on the interface and the users interact with them visually or to get more information
Future of Big Data Analysis and visualization
The example of Esri enhancing Hadoop to provide spatial capabilities and a means of directly working with Hadoop from the ArcGIS platform demonstrates this cross-discipline working. The ability to offload complex analysis of very large datasets to a cloud-based parallel processing engine and to then bring the results back into the GIS for further manipulation represents an important opportunity to challenge and change traditional ways of working with spatial information