Demo: NumPy, Pandas#
UW Geospatial Data Analysis
CEE467/CEWA567
David Shean
modified by Quinn Brencher
Introduction#
This is a quick demo of some key functionality for these core Python packages, emphasizing topics that will help with lab exercises this week and later in the quarter. It is by no means complete!
Please consult the reading assignment and lists of other excellent, more complete online resources.
NumPy#
NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.
Pandas#
pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis/manipulation tool available in any language. It is already well on its way toward this goal.
Matplotlib#
Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.
Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits.
Matplotlib tries to make easy things easy and hard things possible. You can generate plots, histograms, power spectra, bar charts, errorcharts, scatterplots, etc., with just a few lines of code. For examples, see the sample plots and thumbnail gallery.
For simple plotting the pyplot module provides a MATLAB-like interface, particularly when combined with IPython. For the power user, you have full control of line styles, font properties, axes properties, etc, via an object oriented interface or via a set of functions familiar to MATLAB users.
Import necessary modules#
Use shorthand, so you don’t have to type out full module name each time
Note different structure for matplotlib package
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
NumPy 1D array#
#Create 1D array of random integers
#Note parenthesis and brackets
a = np.random.randint(0,10,10)
a
array([4, 5, 9, 9, 8, 6, 4, 9, 1, 1])
type(a)
numpy.ndarray
#np.ndarray?
Constructing an array#
#np.array?
np.array(0, 1, 2)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[6], line 1
----> 1 np.array(0, 1, 2)
TypeError: array() takes from 1 to 2 positional arguments but 3 were given
#Pass in an array-like object - need brackets around the numbers
np.array([0, 1, 2])
array([0, 1, 2])
mylist = [0, 1, 2]
np.array(mylist)
array([0, 1, 2])
Array properties and datatypes#
a
array([4, 5, 9, 9, 8, 6, 4, 9, 1, 1])
a.shape
(10,)
a.size
10
a.dtype
dtype('int64')
What is ‘int64’?#
Signed integer represented by 64 bits
Each bit can be 0 or 1
0 = 0000000000000000000000000000000000000000000000000000000000000000
1 = 0000000000000000000000000000000000000000000000000000000000000001
2 = 0000000000000000000000000000000000000000000000000000000000000010
…
https://numpy.org/doc/stable/user/basics.types.html
#Possible unique combinations of 64 bits
int64_range = 2**64
int64_range
18446744073709551616
print(f"{int64_range:.2e}")
1.84e+19
mm = int((2**64)/2)
mm
9223372036854775808
f'A 64-bit signed integer can store values between -{mm} and +{mm-1}'
'A 64-bit signed integer can store values between -9223372036854775808 and +9223372036854775807'
# Overkill for our single integer values
a
array([4, 5, 9, 9, 8, 6, 4, 9, 1, 1])
#Number of bytes (8 bits each) for each element in the array
a.itemsize
8
#Total number of bytes for 10 elements
a.nbytes
80
# Recast to 8-bit unsigned integer (valid range: 0-255)
b = a.astype('uint8')
b
array([4, 5, 9, 9, 8, 6, 4, 9, 1, 1], dtype=uint8)
b.dtype
dtype('uint8')
2**8
256
b
array([4, 5, 9, 9, 8, 6, 4, 9, 1, 1], dtype=uint8)
b.nbytes
10
#Assign value within valid range
b[0] = 255
b
array([255, 5, 9, 9, 8, 6, 4, 9, 1, 1], dtype=uint8)
#Assign value outside of valid range - overlflow!
# https://en.wikipedia.org/wiki/Integer_overflow
b[0] = 257
b
---------------------------------------------------------------------------
OverflowError Traceback (most recent call last)
Cell In[26], line 3
1 #Assign value outside of valid range - overlflow!
2 # https://en.wikipedia.org/wiki/Integer_overflow
----> 3 b[0] = 257
4 b
OverflowError: Python integer 257 out of bounds for uint8
2D arrays#
a2 = np.random.random((10,10))
a2
a2.shape
a2.size
a2.dtype
#Get first element along first axis
#Question is this first row or col?
a2[0]
#Get first element along second axis
a2[:,0]
#Get first element along both axes
a2[0,0]
#Get slice along first axis - first 3 rows
a2[0:3]
#Slice along second axis - first 3 cols
a2[:,0:3]
a2[0:3,0:3]
ufunc#
Efficiently perform operation element-by-element in “vectorized” fashion (different than GIS vector dataset)
Do not loop over arrays (unless absolutely necessary)
https://numpy.org/doc/stable/reference/ufuncs.html
a2 * 10
# Don't do this!
#for n, i in enumerate(a2):
# a2[n] = i + 10
#np.power(a2, 2)
a2**2
#a2**0.5
np.sqrt(a2)
Numpy methods#
Operate over entire array, specified axes, or slice
Very fast/efficient
a2.mean()
a2.std()
a2.min()
a2
Note on axis order#
When indexing, first axis (0) will extract rows, second axis (1) will extract cols
When aggregating (e.g., computing mean along an axis), you are specifing the dimension of the array that will be collapsed, not the dimension that will be returned
So axis=0 will aggregate values across all rows for each column in a 2D array
a2[0:3,0:3]
a2[0:3,0:3].min(axis=0)
a2[0:3,0:3].min(axis=1)
nD arrays#
# how many dimensions can a numpy array have?
# let's try creating arrays with increasingly more dimensions
shape = []
for i in range(100):
shape.append(1) # add a new dimension of size 1 to the array shape
an = np.random.random(shape) # create a random array with the given shape
Basic array plotting and visualization#
a
plt.plot(a)
a2
plt.plot(a2)
plt.plot(a2[0])
#2D array visualization
plt.imshow(a2, cmap='gray')
plt.colorbar()
plt.hist(a2.ravel(), bins='auto')
Boolean arrays and fancy indexing#
a2
a2 > 0.5
idx = (a2 > 0.5)
idx
# Quick visualization, True = yellow (1)
plt.imshow(idx)
plt.colorbar()
# Return only elements where condition is True
a2[idx]
# Original shape
a2.shape
# Selected shape
a2[idx].shape
#idx.nonzero()[0].size
#Can also be used for assignment
#a2[idx] = 0
### Bitwise operators, combining boolean arrays
(a2 > 0.5)
(a2 < 0.7)
#Bitwise and - True if both are True
idx = (a2 > 0.5) & (a2 < 0.7)
#Bitwise or - True if either is True
#idx = (a < 0.5) | (a > 0.9)
idx
plt.imshow(idx)
a2[idx]
a2[idx].shape
#Invert the boolean array
~idx
plt.imshow(idx)
plt.imshow(~idx)
a2[~idx].shape
Pandas!#
#pd.DataFrame?
df = pd.DataFrame(a2)
df
Wow! Now we have labels for our Numpy array. This will make it much easier to keep track of.
df.index = ['a','b','c','d','e','f','g','h','i','j']
df
# Still just NumPy array under the hood
df.values
df.index.values
# Mean of each column
df.mean()
# Mean of each row
df.mean(axis=1)
Reading files with Pandas#
Most of the time, you will read in tabular data and let Pandas do the work
# Path to csv file
csv_fn = './data/GLAH14_tllz_conus_lulcfilt_demfilt.csv'
pd.read_csv(csv_fn)
# Store output as a new Pandas DataFrame
glas_df = pd.read_csv(csv_fn)
glas_df
type(glas_df)
# For demonstration purpuoses - multiply index to illustrate difference between loc and iloc
glas_df.set_index(glas_df.index*10+1, inplace=True)
glas_df
# Awesome descriptive statistics for each column
glas_df.describe()
Indexing and selecting#
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#different-choices-for-indexing
# Integer indexing like NumPy
glas_df.iloc[2]
glas_df.iloc[0:3]
glas_df.loc[21]
# Get labeled indices between 0 and 20
glas_df.loc[0:20]
# Get integer indices between 0 and 20
glas_df.iloc[0:20]
Selecting columns#
glas_df.columns
glas_df['glas_z']
glas_df.glas_z
glas_df.iloc[:,4]
glas_df.loc[:,'glas_z']
#Multiple columns
glas_df['glas_z', 'dem_z']
# Need to pass in a list of column names
glas_df[['glas_z', 'dem_z']]
glas_df.loc[:,['glas_z', 'dem_z']]
Boolean indexing#
glas_df['lulc']
glas_df['lulc'].value_counts()
glas_df['lulc'] == 12
# Boolean Series (index and single column) will be True for records with 'lulc' == 12
idx2 = glas_df['lulc'] == 12
type(idx2)
idx2.shape
glas_df.shape
# Use to select corresponding rows, returns a new DataFrame with all columns
glas_df[idx2]
glas_df[idx2].shape
glas_df[idx2].mean()
Groupby#
Let’s consider statistics for groups of rows that share the same column attribute
glas_df.groupby('lulc')
glas_df.groupby('lulc').count()
glas_df.groupby('lulc').mean()
glas_df.groupby('lulc').agg(['mean', 'std'])