July 23, 2020

Big Data from scratch

By Paula Martinez

What is Big Data?

The term Big Data is used to define huge volumes of data, both structured and unstructured, that are expensive and complex to process using traditional database and analytics technologies.

The 3 Vs of Big Data

Volume: Big Data systems handle massive scales of data. From terabytes to petabytes. The term “Big Data” refers to this massive amount of data. The size of the data that is processed plays an important role in being able to extract value from that data.

Variety: Big Data platforms process data from heterogeneous sources. Traditionally the data to be analyzed came from spreadsheets and databases, nowadays the sources are very varied, adding many sources of unstructured data, such as emails, blogs, audios, photos, videos, data from different sensors, etc. This variety of data types and sources represents an extra complexity when storing and analyzing the information.

Velocity: It refers to the speed with which the information is generated and processed. The speed with which organizations generate data is increasing and the time in which they expect to extract useful actionable information from it is decreasing. The flow of information is continuous and massive, and the processing may be even needed in real time.

Why you may need Big Data?

The objective is to be able to identify actionable knowledge, obtained from large volumes of data, with various formats and in relatively short times.

Big Data & IoT

Internet of Thing is a great data source for Big Data platforms. Different type of sensors collect data continuously, which means that many times the volume of information generated is too massive and diverse to be processed with traditional tools, in those cases Big Data platforms come into action.

Big Data vs Business Intelligence

Big Data and Business Intelligence (BI) are complementary technologies. Both are intended to process and analyze information to help organizations understand their business, but the way to do it and the volume and type of data with which they work are very different.

BI seeks to help organizations make decisions in an agile and fast way, presenting data at reasonable costs. Big Data instead deals with extracting, storing, processing and visualizing massive volumes of data.
Some of the most important differences are:

  • Big Data uses distributed file systems, while BI in general stores the information on a central server.
  • BI processes only structured data (traditional databases), whereas Big Data can also process unstructured information (blogs, photos, audios, etc).
  • Big Data uses massive parallel processing, allowing large volumes of data to be analyzed efficiently.
  • The data that is analyzed with Big Data can be historical or in real time, BI platforms usually work with historical data.

Big Data can even be a source of data for BI systems, which are ideal to identify and report trends, results, and indicators.

Why now?

In order to use Big Data platforms we need to have the infrastructure to store large volumes of data at low costs, this has not represented a severe limitation for some time now.

In addition, it is necessary to be able to process this data fast and at a relatively low cost. This has been possible thanks to an exponential increase in computing capacity in recent years. This phenomenon has had a great impact on the technological scene, for example, it has also made the use of artificial intelligence tools such as deep learning viable.

Big Data tools and frameworks

Big Data is an umbrella concept that encompasses many technologies. One of the most popular frameworks for dealing with these massive volumes of data is Hadoop.

Hadoop is an open source system used to store, process and analyze large volumes of data. Instead of using a single, very powerful server to store and process information, it allows you to create clusters of servers to analyze data in parallel.

It has two main components: HDFS (Hadoop Distributed File System) and MapReduce.

HDFS is a distributed file system, which allows information to be stored and processed in a cluster of servers.

The cluster can grow from a few nodes to thousands or tens of thousands, making the solution highly scalable. As the amount of information grows and the need for processing increases, new machines are added to the cluster. The main advantage is that it is not necessary to acquire and replace computers with increasing storage and compute capacity to increase the power of my platform.

The MapReduce framework is used to perform the processing and analysis of the information in the distributed file system. This programming paradigm allows the information processing to be massively scaled in clusters of thousands of servers (workers or nodes of the cluster). It divides the large tasks into many small ones, which are executed in parallel on different servers of the cluster (Map operation), to then unite the partial results in a single final result (Reduce operation).

We will use an analogy to understand the concept of Map and Reduce operations.

Suppose we want to carry out a census in the city of Montevideo. We hired hundreds of census takers, assigned each of them an area of ​​the city, and asked them to count how many people live in that area (going through all the houses in the area).

Then with the results of all the census takers, we add the values ​​to obtain the total number of people who live in Montevideo.

Dividing the census task into many people, who count the number of people in parallel, is analogous to the Map process. While adding the total amount to obtain the result of the census in Montevideo is analogous to the Reduce operation. Dividing the task and executing it in parallel is much more efficient than having a single person go all over the city to count the number of people.

As simple as the MapReduce concept seems, programming to run tasks in parallel is a major challenge when working with Hadoop schemas. For this reason, many tools have been developed within the Hadoop ecosystem to help us simplify this task.

For example, Hive is a tool that allows us to query Hadoop systems with a style similar to SQL queries. It was initially developed by Facebook, has been extended by Apache, and is open source.

On the other hand, Spark is also a very popular framework, which can work mounted on the Hadoop ecosystem, replacing the MapReduce framework. It is a distributed processing system that is optimized to rapidly execute iterative machine learning and data analysis operations. It is open source and provides development APIs in Java, Python, and Scala.