Apache Spark

Data is one of the most important assets of any organization. The scale at which data is being generated is incredible. The speed at which the type of the data and the amount of data that is being processed and stored is breaking all-time records every moment.  Even in small-scale organizations, data is growing from gigabytes to terabytes to petabytes. For the same reason, the processing needs are also growing that ask for capability to process data faster.

Technically, there were huge challenges in processing this huge amount of data. In a multi-core world, the applications were not designed and developed to make use of all the processors in a multi-core computer and wasted lots of the processing power.


Below are some qualities a data processing framework should have


  • It should be capable of processing the blocks of data in a parallel fashion so that a huge data processing job can be divided into multiple tasks processed in parallel so that the processing time can be reduced considerably.

  • It should be capable of using the processing power of all the cores or processors in a computer.

  • It should be capable of running on commodity hardware.


There are two open source data processing frameworks that are worth mentioning that satisfy all these requirements.

  • Apache Hadoop.

  • Apache Spark.


Apache Hadoop

if we talk about Hadoop, it comprises of two components:

  • a distributed file system called HDFS (Hadoop distributed file system)

  • a processing layer called MapReduce.


In Hadoop 1.x resource management was done using the MapReduce framework of Hadoop itself. However, in Hadoop 2.0, YARN was introduced to manage the resources of the Hadoop cluster and make it more MapReduce agnostic.


Apache Spark

Apache Spark is an open source distributed data processing project. Spark is written in Scala, which is built on top of the Java Virtual Machine (JVM). This makes Spark a plateform independent framework capable of running on windows, linux etc. Spark and Hadoop are closely related to each other as critical components of the big data landscape.

Spark Is Fast

Hadoop’s MapReduce implementation persists intermediate data to disk between the Map and Reduce processing phases. Spark implements a distributed, fault tolerant, in-memory structure called a Resilient Distributed Dataset (RDD). Spark maximizes the use of memory across multiple machines, improving overall performance by orders of magnitude.

Spark Ecosystem has many frameworks such as Spark SQL, Spark Streaming, GraphX and SparkR to name a few.

Spark itself is written in Scala. It runs in Java virtual machines (JVMs). Spark provides native support for programming interfaces including the following:

  • Scala

  • Python

  • Java

  • SQL

  • R

Leave a Reply

Your email address will not be published. Required fields are marked *