As the name
suggests big data refers to the large volume of data which can be structured or
unstructured. The reason big data is so powerful is because it can make a huge
impact on your business and strategic planning that goes in it. But it’s not
the amount of data that’s important. It’s what the organizations decide to do
with that data. Big data gives insight in such a way that it can lead to a
better decision process and strategic business moves.
So what kind of data can be called as big data. Big data is big enough that
it won’t fit on a single machine which means you need to use more specialized
tools to muck with it. So basically data above 1TB can fall in this
category. Big data as the three v’s.
The three V’s:
1. Volume: Data can come from anywhere
such as social media website, sensors monitoring traffic, online transactions
2. Velocity: The data can come from
multiple resources and should be dealt with in timely manner.
3. Variety: Data can be of any type –
from numbers, strings in traditional databases to unstructured text documents,
emails, video, audio etc.
As the size of the data increases so does the complexity to
deal with it. The traditional approach cannot be used in this case as it may
not produce any result and even if it does it may not be within time
Why is it so damn important?
The reason it is so important is because with the insight you
get from your big data you can make 1. Cost reductions 2. Time reductions 3. New
product development and optimized offerings 4. Smart decision making.
To achieve all these results big data need to be combined
with high-powered analytics which can determine root causes of failures, issues
and defects in near real-time.
A real-world example can be what goes on in an air traffic
controller. They are personnel responsible for managing routes and altitudes
between different airlines. Their main goal is to monitor the speed, altitude,
location etc of the aircraft and contact them if needed when something goes
wrong. Now they receive huge amount of data every minute from different
aircrafts and they have to make sense from that data within time to avoid any
collision. The size of the data is too big and there are time constraints on
that data. In such conditions traditional techniques fail to provide result and
something more powerful is required.
Big data can be used in different sectors such as Government
, Education, Banking , Health Care, Manufacturing , retail etc.
How to store and manage it?
Storage would have been a problem several years ago, there
are now low-cost options available for managing data.
How much of it to analyse?
Some organizations like to include all of the data in to
consideration which is possible with today’s high performance technologies such
as grid computing. Other thing that can be done is determining which data is
relevant before analysing it.
How to use any insights you uncover?
The information that will be retrieved from data after the
processing will play a major role in making business decisions and shining
lights on failures.
What are the big data database?
NoSQL, MPP database and Hadoop few examples of big data
databases. NoSQL can be used to capture big data from the users and Hadoop can
be used to provide analytical insight for analysts and scientists.
So, Is NoSQL
Better For Analysis
This depends on a
lot of factors, for example the type of data one is analyzing, how much data
one has and how quickly you need it. For example, for applications such as user
behavior analysis, relational DB is best.
Well, if the data
fits into a spreadsheet, then it is better suited for a SQL-type database such
as PostGres, BigQuery as relational databases are good at analyzing data in
rows and columns. For semi-structured data, think social media, texts or
geographical data which requires large amount of text mining or image
processing, NoSQL type database such as mongoDB, CouchDB works best. Since
running analytics on semi-structured data requires a heavy coding background,
analyzing these type of DBs require a data scientist.
When it comes to
size of data, PostGres MySQL usually gives a good performance for under
1terabyte of data Amazon Redshift is preferred for petabyte scale. And with
smaller teams of engineers focused on building pipelines, relational DBs take
less to manage than NoSQL.
On the other hand,
relational databases, one can use SQL to query them. SQL as a language is
well-known among data analysts and engineers and is also easy to learn than
most programming languages.