ggplot in Python-Part 2

Continuing with our study of diamond data, let us employ basic exploration function and see what information we can draw from it.


1. Length of the data:

Use the len() function to see the number of rows of data.


2. Names of the columns:

Use the column() function.


3. Analyse the first few values:

Use the head() function, by default first 5 values are displayed.


4. Analyse the last few values:

Use the tail() function, by default last 5 values are displayed.


5. Random selection:

To view data at random from a large data set.


6. Statistical information:

Numeric fields can be evaluated by describe() function to present the statistical information of mean, median and range.


7. Determine the correlation between fields:

corr() function determines the correlation of all numeric fields in the data set.


8. Values stored in the non numeric fields:

We can simply view the values by using diamonds[‘color’] but, it has many repeated values, so better if we view the unique entries. Use unique() function for the same.


Observations so far:

OBSERVATION 1: The data has both numeric and non numeric values.

OBSERVATION 2: The mean and medians of x,y are approximately same. Do diamonds have proportionate length/breadth?

OBSERVATION 3: min of x,y,z is 0. Can length/breadth/height of a real object=0?

OBSERVATION 4: Diagonal correlations are all 1. Why?

OBSERVATION 5: Price,carat,x,y,z seems closely related with each other. Can a predictive model be developed? What about non-numeric values?


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s