How to setup  Big Data Development Environment?

Requirements
  • Need to have modern laptop with 64 bit OS and at least 16 GB RAM  (for support of all softwares and speed respectively).
Description

Big Data is open source and there are many technologies one need to learn to be proficient in Big Data eco system tools such as Hadoop, Spark, Hive, Pig, Sqoop etc. This blog will cover how to set up development environment on personal computer or laptop using distributions such as Cloudera or Hortonworks. Both Cloudera and Hortonworks provide virtual machine image which contain all Big Data eco system tools packaged. This  blog will provide

  • Comparison of Virtualization software such as Virtualbox and VMWare
  • Step by step instructions to set up virtualization software such as virtualbox or VMWare
  • Choosing Cloudera or Hortonworks image
  • Step by step instructions to set up VM using chosen image
  • Setup necessary additional components such as MySQL database and log generation tool
  • Review HDFS, Map Reduce, Sqoop, Pig, Hive, Spark etc.

VirtualBox or Vmware comparison:

What is Virtualization?

Virtualization is a combination of software and hardware engineering that creates Virtual Machines (VMs) – an abstraction of the computer hardware that allows a single machine to act as if it where many machines.

There are two type of software used for virtualization VirtualBox and Vmware. Let’s Compare features of VirtualBox and Vmware.

VirtualBox

VirtualBox is best for visualizing single desktop environments.

          Platform: Windows, Mac, Linux

          Price: Free

    Features

  • Easy installation of popular operating systems like Windows, Linux, and Mac OS X
  • Run multiple virtualized environments simultaneously
  • Run a guest OS in “seamless mode”, which puts the applications on your main Windows desktop
  • Fast performance all around
  • supports snapshots of your virtual machines, so you can start it up from any configuration or point in its life
  • 3D Virtualization
  • Open virtual disk images made in VirtualBox, VMWare, or Microsoft Virtual PC

   Why to use virtual box

           First because it is free for use.  VirtualBox makes running other operating systems—whether it be Linux, other versions of  Windows, or even Mac OS X—super easy on your home computer. Just insert your install disc (or point it to an ISO on your computer), and you can install it in a virtual machine with as much or as little RAM, CPU, and hard drive space as you want. It integrates with our mouse pointer, so you don’t even have to click on the window to start using it, and lets you create “snapshots” of your machines so, like restore points, you can just boot it up from any point in its history and use it from that point. You can even share your clipboard back and forth between your virtualized and host OS.

   Why to avoid this

           VirtualBox can seem a little intimidating to most beginners, but so can any virtualization program. In addition, its “seamless” mode, while cool, isn’t done quite as well as VMWare’s— it brings the entire toolbar of your guest OS with it, and moving the Windows around isn’t the smoothest experience. But, overall, it’s still very feature-filled, and with a great documentation and a ton of users.

Vmware

Vmware is best for server virtualization.

Platform: Windows, Mac, Linux

Price: not Free

Features

  •   include all feature of virutal box and some advance features also
  •   VMware Player, run directly on the hardware itself and provide all the services you need within the software package
  •  VMware also supports restricted virtual machines, which is useful when you want to  prevent unauthorized IT personnel from tampering with configuration settings.

 Why to use VMware

             It includes some extra features rather than virual box. Vmware  provides more user friendly and smooth options.

Why to avoid this

          It is not free. provides so many options a newbie can not understand all the complications.