Insights from Our Experts

Blog image

Python vs R: The big data war

Author Image

Jithin Johnson,Software Engineer

It's hard to know whether to use Python or R for data analysis. This has been a question that we will always come across when we think about data analysis. There are many good resources that can help you to figure out the strengths and weaknesses of both languages. There are often a predefined answer to questions such as "What should I use for Machine Learning?", or "I need a fast solution, should I go for Python or R?". In today's post, I present to you "Data Science Wars: R vs Python", that highlights in great detail the differences betweens these two languages from a data science point of view.

Data science

In modern day we come across a lot of data, for a common person data means nothing special, But for a person who really understands the world and needs to achieve higher, these data mean a lot. Each data mean valuable information which can lead us to new and more good conclusions. There are actually a lot of insight in this data. Here comes the role of Data science. Data science gives valuable meaning to large sets of complex and unstructured data. The focus is on concepts like data analysis and visualization. However, in the field of artificial intelligence, Machine Learning has now been adopted by organizations and is becoming a core area for many data scientists to explore and implement.

Data Science comes with a basic question - which language should be used? We have two major languages R and Python. Let us discuss in detail.

The great battle begins

Main four areas:

  1. Data mining
  2. Data analysis
  3. Data visualization
  4. Machine learning

When and how to use R?

R is mainly used for data analysis task which requires standalone computing or analysis on individual servers. it's also good for almost any type of data analysis. This is because of the number of packages and readily usable tests, which can often provide with tools to set up and deploy in a go. It can be defined as a standalone solution.

When getting started with R, a good first step is to install the amazing RStudio IDE. Once this is done, we recommend you to have a look at the following popular packages:

  1. dplyr, plyr and data.table to easily manipulate packages
  2. stringr to manipulate strings
  3. zoo to work with regular and irregular time series
  4. ggvis, lattice, and ggplot2 to visualize data
  5. caret for machine learning

When and how to use Python?

Our goal is to Analyse the data and come to a conclusion; but how to show the conclusion for users, which can be said as an important step in final output as we concern. Here comes help of python. Using Python we can integrate the Data analysis with web apps, we can show as statistic graph or even populate data to a Database. Being a continuously integrating programming language, it’s a great tool to implement algorithms for production use.

While the infancy of Python packages for data analysis was an issue in the past, this has improved significantly over the years. We can use NumPy /SciPy (scientific computing) and pandas (data manipulation) to make Python usable for data analysis.  Also have a look at matplotlib to make graphics, and scikit-learn for machine learning.

Unlike R, Python has no clear “winning” IDE. We recommend you to have a look at Spyder, IPython Notebook and Rodeo, Pycharm to see which one best fits your needs.

Important Numbers

The numbers clearly show that more people move towards python, nowadays.

Pros and cons of R

Pro: A picture says more than a thousand words
Represent data or Pictorial Data which can explain a lot than numbers. R is excellent for the same. Some must-see visualization packages are ggplot2, ggvis, googleVis and rCharts.

Pro: R Community
R has a good and constantly updating community and packages around. Packages are available at CRAN, BioConductor and Github. You can search through all R packages at Rdocumentation.

Pro: R lingua franca of data science
R is developed by statisticians for statisticians, it is not basically a computer science background developed. They can communicate ideas and concepts through R code and packages, you don’t necessarily need a computer science background to get started.

Con: R is slow
R will follow a lot of codes to minimize the data structuring and all, it slows down speed due to handling this much data. It eats the computer, there are multiple packages to improve R’s performance: pqR,  FastR, Riposte and many more.

Con: R has a steep learning curve
R’s learning curve is non-trivial, especially if you come from a GUI for your statistical analysis. Even finding packages can be time-consuming if you’re not familiar with it.

Pros and cons of Python

Pro: IPython Notebook
The IPython Notebook makes it easier to work with Python and data. You can easily share your code with your colleagues without any installation.  This will reduce issues regarding organizing code, output and notes files. Thus more time in coding.

Pro: A general purpose language
Python is a general purpose language. This gives it an easy to learn, and it increases the speed at which you can write a program. You need less time to code and you have more time to play get data. Furthermore, the Python testing framework is a built-in, low-barrier-to-entry testing framework that encourages good test coverage. So the code will be reusable and dependable.

Pro: A multi-purpose language
As a common, easy to understand language that is known by programmers and that can easily be learned by all kind of people, you can build a single tool that integrates with every part of your workflow.

Pro/Con: Visualizations
Visualizations are important criteria when choosing data analysis software. Although Python has some nice visualization libraries, such as Seaborn, Bokeh and Pygal, there are maybe too many options to choose from. Moreover, compared to R, visualizations are usually more, and the results are not always so pleasing to the eye.

Con: Python is a challenger
Python is a challenger to R. It Does not offer much package strength than R.

Finally, Who Wins the title?

We cannot say really who win or lose, The answer is really depending. Let us look at these basic questions which can be used to solve this.

  1. What problems do you want to solve?
  2. What are the net costs for learning a language?
  3. What are the commonly used tools in your field?
  4. What are the other available tools and how do these relate to the commonly used tools?