Written as a part of ML101 Teaching_ML
Data Visualization forms the crux of data modeling. We use data visualization to explore the data before the modeling step, and then again to finally present the model in a graphical form to a non technical audience. There are numerous visualization libraries and tools available for Python or otherwise, but in the following post we are going to cover some basic functions of the Matplotlib library of Python.
Note: The code snippets were tested on Python3.4- Anaconda2.1.
Using Matplotlib is fairly simple. Lets begin by generating some data points we will be working on. The easiest way is to use the linspace function of matplotlib
The above function returns a list of floating values between the start and stop values(both inclusive).
Now, lets try and plot our lists.
Assuming a=x-axis, and b=y-axis. We are able to create a simple line passing through these points. The syntax for a line graph is simple:
By default, the line thickness is 1 unit, and color is blue. But one can tweak these parameters to obtain graphs of other format. As in the following figure, title, axis labels, color and marker type were added to the graph. The show() function, skips displaying the graph object value, and shows only the figure.
One can also plot multiple line graphs in a single figure, and use it for parametric comparisons.
By default, pi=3.14, while sin(x) and cos(x) functions return the respective sin and cosine values of x. The ‘r0’ parameter in the first graph means a red line graph marked by filled circles. Similarly, ‘g–‘ means a green line graph(curve in this case) marked by dashed lines.
2. BAR GRAPH:
Another type of basic graph is the bar graph. Bar graphs are used when “you want to show how some quantity varies among some discrete set of items” -Joel Grus.
By default the width of each bar is 0.8, and color is blue, before enumerating the x axis, we added 0.2 to the left co-ordinate, so that the first graph does not stick to the y axis.
3. PIE GRAPH:
This graph gives us an overview of the percentage distribution of the data, in totality of the data. Using the same population values(Source: Google), as in the previous section, we draw a pie graph representing the population of the 3 nations. The legend parameters are automatically matched with the distribution, following the sequence in which the information was entered. By specifying loc=”best”, we let Python decide where will be the best location to place the legend.
4. SCATTER PLOTS:
A scatterplot is used to compare, and find relation between two paired set of data. In addition to finding a general trend, it also helps to determine the region/s where most of our data is concentrated, and thus underline the outliers(if any).
Assume we have twitter data of number of followers of a person, and his daily tweets. Lets try and explore any trend among the two parameters, maybe they are correlated.
Our assumed dataset:
It looks like, the more follower a person has, the higher the frequency of his daily tweets. We can build further on this trend, and develop a model to predict the #of tweets, given a person’s # of followers. But, before we reach the stage of developing a model, we had to visualize the data, and extract some information which shall form the base of our modeling stage.
Such, is the power of exploratory data visualization.
Code used in the above blog is available HERE