Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices.

Let's introduce these three fundamental Pandas data structures: the `Series`

, `DataFrame`

, and `Index`

.

In [14]:

```
import numpy as np
import pandas as pd
```

A Pandas `Series`

is a one-dimensional array of indexed data:

In [15]:

```
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data
```

Out[15]:

0 0.25 1 0.50 2 0.75 3 1.00 dtype: float64

`Series`

objects wrap both a sequence of values and a sequence of indices, which we can access with the `values`

and `index`

attributes.

The `values`

are simply a familiar NumPy array:

In [16]:

```
data.values
```

Out[16]:

array([0.25, 0.5 , 0.75, 1. ])

In [17]:

```
type(_)
```

Out[17]:

numpy.ndarray

The `index`

is an array-like object of type `pd.Index`

:

In [18]:

```
data.index
```

Out[18]:

RangeIndex(start=0, stop=4, step=1)

In [19]:

```
print(pd.RangeIndex.__doc__)
```

Like with a NumPy array, data can be accessed by the associated index via the familiar Python square-bracket notation:

In [20]:

```
data[1]
```

Out[20]:

0.5

In [21]:

```
data[1:3]
```

Out[21]:

1 0.50 2 0.75 dtype: float64

In [22]:

```
type(_)
```

Out[22]:

pandas.core.series.Series

`Series`

as generalized NumPy array¶It may look like the `Series`

object is basically interchangeable with a one-dimensional NumPy array.
The essential difference is the presence of the index: while the Numpy Array has an *implicitly defined* integer index used to access the values, the Pandas `Series`

has an *explicitly defined* index associated with the values.

In [23]:

```
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
data
```

Out[23]:

a 0.25 b 0.50 c 0.75 d 1.00 dtype: float64

In [24]:

```
data['b'] # item access works as expected
```

Out[24]:

0.5

We can even use non-contiguous or non-sequential indices:

In [25]:

```
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[2, 5, 3, 7])
data
```

Out[25]:

2 0.25 5 0.50 3 0.75 7 1.00 dtype: float64

In [11]:

```
data[5]
```

Out[11]:

0.5

You can think of a Pandas `Series`

a bit like a specialization of a Python dictionary.
A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a `Series`

is a structure which maps typed keys to a set of typed values:

In [27]:

```
population_dict = {'California': 38332521,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}
population = pd.Series(population_dict)
population
```

Out[27]:

California 38332521 Texas 26448193 New York 19651127 Florida 19552860 Illinois 12882135 dtype: int64

In [28]:

```
type(_)
```

Out[28]:

pandas.core.series.Series

In [29]:

```
population.index
```

Out[29]:

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

In [30]:

```
population['California'] # typical dictionary-style item access
```

Out[30]:

38332521

In [31]:

```
population['California':'Illinois'] # array-like slicing
```

Out[31]:

California 38332521 Texas 26448193 New York 19651127 Florida 19552860 Illinois 12882135 dtype: int64

The next fundamental structure in Pandas is the `DataFrame`

which can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary.

If a `Series`

is an analog of a one-dimensional array with flexible indices, a `DataFrame`

is an analog of a two-dimensional array with both flexible row indices and flexible column names.

You can think of a `DataFrame`

as a sequence of *aligned* `Series`

objects.
Here, by *aligned* we mean that they share the same index:

In [32]:

```
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area
```

Out[32]:

California 423967 Texas 695662 New York 141297 Florida 170312 Illinois 149995 dtype: int64

In [36]:

```
states = pd.DataFrame({'population': population, 'area': area})
states
```

Out[36]:

population | area | |
---|---|---|

California | 38332521 | 423967 |

Texas | 26448193 | 695662 |

New York | 19651127 | 141297 |

Florida | 19552860 | 170312 |

Illinois | 12882135 | 149995 |

In [37]:

```
type(_)
```

Out[37]:

pandas.core.frame.DataFrame

In [38]:

```
states.index
```

Out[38]:

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

Additionally, the `DataFrame`

has a `columns`

attribute, which is an `Index`

object holding the column labels:

In [39]:

```
states.columns
```

Out[39]:

Index(['population', 'area'], dtype='object')

In [40]:

```
type(_)
```

Out[40]:

pandas.core.indexes.base.Index

Thus the `DataFrame`

can be thought of as a generalization of a two-dimensional NumPy array, where both the rows and columns have a generalized index for accessing the data.

Similarly, we can also think of a `DataFrame`

as a specialization of a dictionary.

Where a dictionary maps a key to a value, a `DataFrame`

maps a column name to a `Series`

of column data:

In [41]:

```
states['area'] # "feature"
```

Out[41]:

California 423967 Texas 695662 New York 141297 Florida 170312 Illinois 149995 Name: area, dtype: int64

A `DataFrame`

is a collection of `Series`

objects, and a single-column `DataFrame`

can be constructed from a single `Series`

:

In [42]:

```
pd.DataFrame(population, columns=['population'])
```

Out[42]:

population | |
---|---|

California | 38332521 |

Texas | 26448193 |

New York | 19651127 |

Florida | 19552860 |

Illinois | 12882135 |

In [48]:

```
data = [{'a': i, 'b': 2 * i} for i in range(3)]
data
```

Out[48]:

[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}]

In [49]:

```
pd.DataFrame(data)
```

Out[49]:

a | b | |
---|---|---|

0 | 0 | 0 |

1 | 1 | 2 |

2 | 2 | 4 |

In [23]:

```
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}]) # Pandas will fill missing keys with ``NaN``
```

Out[23]:

a | b | c | |
---|---|---|---|

0 | 1.0 | 2 | NaN |

1 | NaN | 3 | 4.0 |

Given a two-dimensional array of data, we can create a `DataFrame`

with any specified column and index names:

In [50]:

```
np.random.rand(3, 2)
```

Out[50]:

array([[0.30282887, 0.48376433], [0.53588853, 0.97428136], [0.94756199, 0.46766408]])

In [51]:

```
pd.DataFrame(np.random.rand(3, 2),
columns=['foo', 'bar'],
index=['a', 'b', 'c'])
```

Out[51]:

foo | bar | |
---|---|---|

a | 0.759907 | 0.458958 |

b | 0.776779 | 0.767430 |

c | 0.131552 | 0.740137 |

In [52]:

```
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
A
```

Out[52]:

array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])

In [53]:

```
pd.DataFrame(A)
```

Out[53]:

A | B | |
---|---|---|

0 | 0 | 0.0 |

1 | 0 | 0.0 |

2 | 0 | 0.0 |

This `Index`

object is an interesting structure in itself, and it can be thought of either as an *immutable array* or as an *ordered set* (technically a multi-set, as `Index`

objects may contain repeated values).

In [55]:

```
ind = pd.Index([2, 3, 5, 7, 11])
ind
```

Out[55]:

Int64Index([2, 3, 5, 7, 11], dtype='int64')

The `Index`

in many ways operates like an array.

In [56]:

```
ind[1]
```

Out[56]:

3

In [58]:

```
ind[::2]
```

Out[58]:

Int64Index([2, 5, 11], dtype='int64')

`Index`

objects also have many of the attributes familiar from NumPy arrays:

In [59]:

```
ind.size, ind.shape, ind.ndim, ind.dtype,
```

Out[59]:

(5, (5,), 1, dtype('int64'))

One difference is that indices are immutable–that is, they cannot be modified via the normal means:

In [60]:

```
ind[1] = 0
```

The `Index`

object follows many of the conventions used by Python's built-in `set`

data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way:

In [61]:

```
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])
```

In [62]:

```
indA.intersection(indB) # intersection
```

Out[62]:

Int64Index([3, 5, 7], dtype='int64')

In [39]:

```
indA.union(indB) # union
```

Out[39]:

Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

In [40]:

```
indA.symmetric_difference(indB) # symmetric difference
```

Out[40]:

Int64Index([1, 2, 9, 11], dtype='int64')

To modify values in NumPy arrays we use indexing (e.g., `arr[2, 1]`

), slicing (e.g., `arr[:, 1:5]`

), masking (e.g., `arr[arr > 0]`

), fancy indexing (e.g., `arr[0, [1, 5]]`

), and combinations thereof (e.g., `arr[:, [1, 5]]`

).

Here we'll look at similar means of accessing and modifying values in Pandas `Series`

and `DataFrame`

objects.
If you have used the NumPy patterns, the corresponding patterns in Pandas will feel very familiar, though there are a few quirks to be aware of.

A `Series`

object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard Python dictionary.

Like a dictionary, the `Series`

object provides a mapping from a collection of keys to a collection of values:

In [63]:

```
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
data
```

Out[63]:

a 0.25 b 0.50 c 0.75 d 1.00 dtype: float64

In [64]:

```
data['b'] # mnemonic indexing
```

Out[64]:

0.5

In [65]:

```
'a' in data # dictionary-like Python expressions...
```

Out[65]:

True

In [67]:

```
data.keys() # ...and methods.
```

Out[67]:

Index(['a', 'b', 'c', 'd'], dtype='object')

In [68]:

```
list(data.items())
```

Out[68]:

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

`Series`

objects can even be modified with a dictionary-like syntax:

In [69]:

```
data['e'] = 1.25
data
```

Out[69]:

a 0.25 b 0.50 c 0.75 d 1.00 e 1.25 dtype: float64

This easy mutability of the objects is a convenient feature: under the hood, Pandas is making decisions about memory layout and data copying that might need to take place.

A `Series`

builds on this dictionary-like interface and provides array-style item selection via the same basic mechanisms as NumPy arrays – that is, *slices*, *masking*, and *fancy indexing*:

In [47]:

```
data['a':'c'] # slicing by explicit index
```

Out[47]:

a 0.25 b 0.50 c 0.75 dtype: float64

In [48]:

```
data[0:2] # slicing by implicit integer index
```

Out[48]:

a 0.25 b 0.50 dtype: float64

In [49]:

```
data[(data > 0.3) & (data < 0.8)] # masking
```

Out[49]:

b 0.50 c 0.75 dtype: float64

because

In [70]:

```
(data > 0.3) & (data < 0.8)
```

Out[70]:

a False b True c True d False e False dtype: bool

In [71]:

```
type(_)
```

Out[71]:

pandas.core.series.Series

In [50]:

```
data[['a', 'e']] # fancy indexing
```

Out[50]:

a 0.25 e 1.25 dtype: float64

Notice that when slicing with an explicit index (i.e., `data['a':'c']`

), the final index is *included* in the slice, while when slicing with an implicit index (i.e., `data[0:2]`

), the final index is *excluded* from the slice.

If your `Series`

has an explicit integer index, an indexing operation such as `data[1]`

will use the explicit indices, while a slicing operation like `data[1:3]`

will use the implicit Python-style index.

In [72]:

```
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data
```

Out[72]:

1 a 3 b 5 c dtype: object

In [73]:

```
data[1] # explicit index when indexing
```

Out[73]:

'a'

In [74]:

```
data[1:3] # implicit index when slicing
```

Out[74]:

3 b 5 c dtype: object

Because of this potential confusion in the case of integer indexes, Pandas provides some special *indexer* attributes that explicitly expose certain indexing schemes.

These are not functional methods, but attributes that expose a particular slicing interface to the data in the `Series`

.

First, the `loc`

attribute allows indexing and slicing that always references the explicit index:

In [75]:

```
data.loc[1]
```

Out[75]:

'a'

In [76]:

```
data.loc[1:3]
```

Out[76]:

1 a 3 b dtype: object

The `iloc`

attribute allows indexing and slicing that always references the implicit Python-style index:

In [77]:

```
data.iloc[1:3]
```

Out[77]:

3 b 5 c dtype: object

A third indexing attribute, `ix`

, is a hybrid of the two, and for `Series`

objects is equivalent to standard `[]`

-based indexing.

The purpose of the `ix`

indexer will become more apparent in the context of `DataFrame`

objects.

Recall that a `DataFrame`

acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of `Series`

structures sharing the same index.

In [78]:

```
area = pd.Series({'California': 423967, 'Texas': 695662,
'New York': 141297, 'Florida': 170312,
'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
'New York': 19651127, 'Florida': 19552860,
'Illinois': 12882135})
```

In [79]:

```
data = pd.DataFrame({'area':area, 'pop':pop})
data
```

Out[79]:

area | pop | |
---|---|---|

California | 423967 | 38332521 |

Texas | 695662 | 26448193 |

New York | 141297 | 19651127 |

Florida | 170312 | 19552860 |

Illinois | 149995 | 12882135 |

In [80]:

```
data['area'] # columns can be accessed via dict-style indexing
```

Out[80]:

In [81]:

```
data.area # alternatively, use attribute-style access with column names
```

Out[81]:

this dictionary-style syntax can also be used to modify the object, in this case adding a new column:

In [82]:

```
data['density'] = data['pop'] / data['area']
data
```

Out[82]:

area | pop | density | |
---|---|---|---|

California | 423967 | 38332521 | 90.413926 |

Texas | 695662 | 26448193 | 38.018740 |

New York | 141297 | 19651127 | 139.076746 |

Florida | 170312 | 19552860 | 114.806121 |

Illinois | 149995 | 12882135 | 85.883763 |

`DataFrame`

can also be viewed as an enhanced two-dimensional array:

In [83]:

```
data.values # examine the raw underlying data array
```

Out[83]:

array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01], [6.95662000e+05, 2.64481930e+07, 3.80187404e+01], [1.41297000e+05, 1.96511270e+07, 1.39076746e+02], [1.70312000e+05, 1.95528600e+07, 1.14806121e+02], [1.49995000e+05, 1.28821350e+07, 8.58837628e+01]])

In [87]:

```
data.values.T
```

Out[87]:

array([[4.23967000e+05, 6.95662000e+05, 1.41297000e+05, 1.70312000e+05, 1.49995000e+05], [3.83325210e+07, 2.64481930e+07, 1.96511270e+07, 1.95528600e+07, 1.28821350e+07], [9.04139261e+01, 3.80187404e+01, 1.39076746e+02, 1.14806121e+02, 8.58837628e+01]])

In [88]:

```
type(_)
```

Out[88]:

numpy.ndarray

In [89]:

```
data.T # transpose the full DataFrame object
```

Out[89]:

California | Texas | New York | Florida | Illinois | |
---|---|---|---|---|---|

area | 4.239670e+05 | 6.956620e+05 | 1.412970e+05 | 1.703120e+05 | 1.499950e+05 |

pop | 3.833252e+07 | 2.644819e+07 | 1.965113e+07 | 1.955286e+07 | 1.288214e+07 |

density | 9.041393e+01 | 3.801874e+01 | 1.390767e+02 | 1.148061e+02 | 8.588376e+01 |

In [90]:

```
type(_)
```

Out[90]:

pandas.core.frame.DataFrame

In [91]:

```
data.values[0] # passing a single index to an array accesses a row
```

Out[91]:

array([4.23967000e+05, 3.83325210e+07, 9.04139261e+01])

In [92]:

```
data['area'] # assing a single "index" to access a column
```

Out[92]:

Using the `iloc`

indexer, we can index the underlying array as if it is a simple NumPy array (using the implicit Python-style index)

In [93]:

```
data.iloc[:3, :2]
```

Out[93]:

area | pop | |
---|---|---|

California | 423967 | 38332521 |

Texas | 695662 | 26448193 |

New York | 141297 | 19651127 |

Similarly, using the `loc`

indexer we can index the underlying data in an array-like style but using the explicit index and column names:

In [67]:

```
data.loc[:'Illinois', :'pop']
```

Out[67]:

area | pop | |
---|---|---|

California | 423967 | 38332521 |

Texas | 695662 | 26448193 |

New York | 141297 | 19651127 |

Florida | 170312 | 19552860 |

Illinois | 149995 | 12882135 |

Any of the familiar NumPy-style data access patterns can be used within these indexers.

In [68]:

```
data.loc[data.density > 100, ['pop', 'density']]
```

Out[68]:

pop | density | |
---|---|---|

New York | 19651127 | 139.076746 |

Florida | 19552860 | 114.806121 |

Any of these indexing conventions may also be used to set or modify values; this is done in the standard way that you might be accustomed to from working with NumPy:

In [69]:

```
data.iloc[0, 2] = 90
data
```

Out[69]:

area | pop | density | |
---|---|---|---|

California | 423967 | 38332521 | 90.000000 |

Texas | 695662 | 26448193 | 38.018740 |

New York | 141297 | 19651127 | 139.076746 |

Florida | 170312 | 19552860 | 114.806121 |

Illinois | 149995 | 12882135 | 85.883763 |

In [70]:

```
data['Florida':'Illinois'] # *slicing* refers to rows
```

Out[70]:

area | pop | density | |
---|---|---|---|

Florida | 170312 | 19552860 | 114.806121 |

Illinois | 149995 | 12882135 | 85.883763 |

In [71]:

```
data[data.density > 100] # direct masking operations are also interpreted row-wise
```

Out[71]:

area | pop | density | |
---|---|---|---|

New York | 141297 | 19651127 | 139.076746 |

Florida | 170312 | 19552860 | 114.806121 |

One of the essential pieces of NumPy is the ability to perform quick element-wise operations, both with basic arithmetic (addition, subtraction, multiplication, etc.) and with more sophisticated operations (trigonometric functions, exponential and logarithmic functions, etc.).

Pandas inherits much of this functionality from NumPy.

Pandas includes a couple useful twists, however: for unary operations like negation and trigonometric functions, these ufuncs will *preserve index and column labels* in the output, and for binary operations such as addition and multiplication, Pandas will automatically *align indices* when passing the objects to the ufunc.

Because Pandas is designed to work with NumPy, any NumPy ufunc will work on Pandas `Series`

and `DataFrame`

objects:

In [94]:

```
rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10, 4))
ser
```

Out[94]:

0 6 1 3 2 7 3 4 dtype: int64

In [96]:

```
rng.randint(0, 10, (3, 4))
```

Out[96]:

array([[1, 7, 5, 1], [4, 0, 9, 5], [8, 0, 9, 2]])

In [97]:

```
df = pd.DataFrame(rng.randint(0, 10, (3, 4)), columns=['A', 'B', 'C', 'D'])
df
```

Out[97]:

A | B | C | D | |
---|---|---|---|---|

0 | 6 | 3 | 8 | 2 |

1 | 4 | 2 | 6 | 4 |

2 | 8 | 6 | 1 | 3 |

If we apply a NumPy ufunc on either of these objects, the result will be another Pandas object *with the indices preserved:*

In [100]:

```
np.exp(ser)
```

Out[100]:

0 403.428793 1 20.085537 2 1096.633158 3 54.598150 dtype: float64

In [101]:

```
type(_)
```

Out[101]:

pandas.core.series.Series

In [102]:

```
np.sin(df * np.pi / 4) # a slightly more complex calculation
```

Out[102]:

A | B | C | D | |
---|---|---|---|---|

0 | -1.000000e+00 | 0.707107 | -2.449294e-16 | 1.000000e+00 |

1 | 1.224647e-16 | 1.000000 | -1.000000e+00 | 1.224647e-16 |

2 | -2.449294e-16 | -1.000000 | 7.071068e-01 | 7.071068e-01 |

In [103]:

```
type(_)
```

Out[103]:

pandas.core.frame.DataFrame

For binary operations on two `Series`

or `DataFrame`

objects, Pandas will align indices in the process of performing the operation.

Suppose we are combining two different data sources, and find only the top three US states by *area* and the top three US states by *population*:

In [104]:

```
area = pd.Series({'Alaska': 1723337, 'Texas': 695662, 'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193, 'New York': 19651127}, name='population')
```

In [106]:

```
population / area
```

Out[106]:

Alaska NaN California 90.413926 New York NaN Texas 38.018740 dtype: float64

The resulting array contains the *union* of indices of the two input arrays, which could be determined using standard Python set arithmetic on these indices:

In [107]:

```
area.index.union(population.index) # this does create a new index and doesn't modify in place.
```

Out[107]:

Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')

In [108]:

```
area.index
```

Out[108]:

Index(['Alaska', 'Texas', 'California'], dtype='object')

Any item for which one or the other does not have an entry is marked with `NaN`

, or "Not a Number," which is how Pandas marks missing data
.
This index matching is implemented this way for any of Python's built-in arithmetic expressions; any missing values are filled in with NaN by default:

In [109]:

```
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B
```

Out[109]:

0 NaN 1 5.0 2 9.0 3 NaN dtype: float64

If using NaN values is not the desired behavior, the fill value can be modified using appropriate object methods in place of the operators:

In [110]:

```
A.add(B, fill_value=0)
```

Out[110]:

0 2.0 1 5.0 2 9.0 3 5.0 dtype: float64

A similar type of alignment takes place for *both* columns and indices when performing operations on `DataFrame`

s:

In [111]:

```
A = pd.DataFrame(rng.randint(0, 20, (2, 2)), columns=list('AB'))
A
```

Out[111]:

A | B | |
---|---|---|

0 | 13 | 17 |

1 | 8 | 1 |

In [113]:

```
B = pd.DataFrame(rng.randint(0, 10, (3, 3)), columns=list('BAC'))
B
```

Out[113]:

B | A | C | |
---|---|---|---|

0 | 1 | 5 | 5 |

1 | 9 | 3 | 5 |

2 | 1 | 9 | 1 |

In [114]:

```
A + B
```

Out[114]:

A | B | C | |
---|---|---|---|

0 | 18.0 | 18.0 | NaN |

1 | 11.0 | 10.0 | NaN |

2 | NaN | NaN | NaN |

In [116]:

```
fill = A.stack().mean()
fill
```

Out[116]:

9.75

In [117]:

```
A.add(B, fill_value=fill)
```

Out[117]:

A | B | C | |
---|---|---|---|

0 | 18.00 | 18.00 | 14.75 |

1 | 11.00 | 10.00 | 14.75 |

2 | 18.75 | 10.75 | 10.75 |

The following table lists Python operators and their equivalent Pandas object methods:

Python Operator | Pandas Method(s) |
---|---|

`+` |
`add()` |

`-` |
`sub()` , `subtract()` |

`*` |
`mul()` , `multiply()` |

`/` |
`truediv()` , `div()` , `divide()` |

`//` |
`floordiv()` |

`%` |
`mod()` |

`**` |
`pow()` |

When performing operations between a `DataFrame`

and a `Series`

, the index and column alignment is similarly maintained.
Operations between a `DataFrame`

and a `Series`

are similar to operations between a two-dimensional and one-dimensional NumPy array.

In [87]:

```
A = rng.randint(10, size=(3, 4))
A
```

Out[87]:

array([[3, 8, 2, 4], [2, 6, 4, 8], [6, 1, 3, 8]])

In [89]:

```
type(A)
```

Out[89]:

numpy.ndarray

In [88]:

```
A - A[0]
```

Out[88]:

array([[ 0, 0, 0, 0], [-1, -2, 2, 4], [ 3, -7, 1, 4]])

According to NumPy's broadcasting rules , subtraction between a two-dimensional array and one of its rows is applied row-wise.

In Pandas, the convention similarly operates row-wise by default:

In [90]:

```
df = pd.DataFrame(A, columns=list('QRST'))
df - df.iloc[0]
```

Out[90]:

Q | R | S | T | |
---|---|---|---|---|

0 | 0 | 0 | 0 | 0 |

1 | -1 | -2 | 2 | 4 |

2 | 3 | -7 | 1 | 4 |

If you would instead like to operate column-wise you have to specify the `axis`

keyword:

In [91]:

```
df.subtract(df['R'], axis=0)
```

Out[91]:

Q | R | S | T | |
---|---|---|---|---|

0 | -5 | 0 | -6 | -4 |

1 | -4 | 0 | -2 | 2 |

2 | 5 | 0 | 2 | 7 |

The difference between data found in many tutorials and data in the real world is that real-world data is rarely clean and homogeneous. In particular, many interesting datasets will have some amount of data missing.

To make matters even more complicated, different data sources may indicate missing data in different ways.

To indicate the presence of missing data in a table or DataFrame we can use two strategies: using a *mask* that globally indicates missing values, or choosing a *sentinel value* that indicates a missing entry.

In the masking approach, the mask might be an entirely separate Boolean array, or it may involve appropriation of one bit in the data representation to locally indicate the null status of a value.

In the sentinel approach, the sentinel value could be some data-specific convention, such as indicating a missing integer value with -9999 or some rare bit pattern, or it could be a more global convention, such as indicating a missing floating-point value with NaN (Not a Number).

None of these approaches is without trade-offs: use of a separate mask array requires allocation of an additional Boolean array. A sentinel value reduces the range of valid values that can be represented, and may require extra (often non-optimized) logic in CPU and GPU arithmetic.

The way in which Pandas handles missing values is constrained by its reliance on the NumPy package, which does **not have** a built-in notion of NA values for non-floating-point data types.

NumPy does have support for masked arrays – that is, arrays that have a separate Boolean mask array attached for marking data as "good" or "bad." Pandas could have derived from this, but the overhead in both storage, computation, and code maintenance makes that an unattractive choice.

With these constraints in mind, Pandas chose to use sentinels for missing data, and further chose to use two already-existing Python null values: the special floating-point `NaN`

value, and the Python `None`

object.

`None`

: Pythonic missing data¶The first sentinel value used by Pandas is `None`

, a Python singleton object that is often used for missing data in Python code.

Because it is a Python object, `None`

cannot be used in any arbitrary NumPy/Pandas array, but only in arrays with data type `'object'`

(i.e., arrays of Python objects):

In [118]:

```
vals1 = np.array([1, None, 3, 4])
vals1
```

Out[118]:

array([1, None, 3, 4], dtype=object)

Any operations on the data will be done at the Python level, with much more overhead than the typically fast operations seen for arrays with native types:

In [93]:

```
for dtype in ['object', 'int']:
print("dtype =", dtype)
%timeit np.arange(1E6, dtype=dtype).sum()
print()
```

The use of Python objects in an array also means that if you perform aggregations like `sum()`

or `min()`

across an array with a `None`

value, you will generally get an error:

In [94]:

```
vals1.sum()
```

`NaN`

: Missing numerical data¶The other missing data representation, `NaN`

(acronym for *Not a Number*), is different; it is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation:

In [119]:

```
vals2 = np.array([1, np.nan, 3, 4])
vals2.dtype
```

Out[119]:

dtype('float64')

In [120]:

```
1 + np.nan, 0 * np.nan
```

Out[120]:

(nan, nan)

In [121]:

```
vals2.sum(), vals2.min(), vals2.max()
```

Out[121]:

(nan, nan, nan)

NumPy does provide some special aggregations that will ignore these missing values:

In [98]:

```
np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)
```

Out[98]:

(8.0, 1.0, 4.0)

`NaN`

and `None`

both have their place, and Pandas is built to handle the two of them nearly interchangeably, converting between them where appropriate:

In [99]:

```
pd.Series([1, np.nan, 2, None])
```

Out[99]:

0 1.0 1 NaN 2 2.0 3 NaN dtype: float64

The following table lists the upcasting conventions in Pandas when NA values are introduced:

Typeclass | Conversion When Storing NAs | NA Sentinel Value |
---|---|---|

`floating` |
No change | `np.nan` |

`object` |
No change | `None` or `np.nan` |

`integer` |
Cast to `float64` |
`np.nan` |

`boolean` |
Cast to `object` |
`None` or `np.nan` |

Keep in mind that in Pandas, string data is always stored with an `object`

dtype.

As we have seen, Pandas treats `None`

and `NaN`

as essentially interchangeable for indicating missing or null values.
To facilitate this convention, there are several useful methods for detecting, removing, and replacing null values in Pandas data structures.
They are:

`isnull()`

: Generate a boolean mask indicating missing values`notnull()`

: Opposite of`isnull()`

`dropna()`

: Return a filtered version of the data`fillna()`

: Return a copy of the data with missing values filled or imputed

Pandas data structures have two useful methods for detecting null data: `isnull()`

and `notnull()`

.
Either one will return a Boolean mask over the data:

In [122]:

```
data = pd.Series([1, np.nan, 'hello', None])
data.isnull()
```

Out[122]:

0 False 1 True 2 False 3 True dtype: bool

In addition to the masking used before, there are the convenience methods, `dropna()`

(which removes NA values) and `fillna()`

(which fills in NA values):

In [101]:

```
data.dropna()
```

Out[101]:

0 1 2 hello dtype: object

For a `DataFrame`

, there are more options:

In [123]:

```
df = pd.DataFrame([[1, np.nan, 2],
[2, 3, 5],
[np.nan, 4, 6]])
df
```

Out[123]:

0 | 1 | 2 | |
---|---|---|---|

0 | 1.0 | NaN | 2 |

1 | 2.0 | 3.0 | 5 |

2 | NaN | 4.0 | 6 |

In [124]:

```
df.dropna() # drop all rows in which *any* null value is present
```

Out[124]:

0 | 1 | 2 | |
---|---|---|---|

1 | 2.0 | 3.0 | 5 |

In [104]:

```
df.dropna(axis='columns') # drop NA values from all columns containing a null value
```

Out[104]:

2 | |
---|---|

0 | 2 |

1 | 5 |

2 | 6 |

The default is `how='any'`

, such that any row or column (depending on the `axis`

keyword) containing a null value will be dropped.

In [105]:

```
df[3] = np.nan
df
```

Out[105]:

0 | 1 | 2 | 3 | |
---|---|---|---|---|

0 | 1.0 | NaN | 2 | NaN |

1 | 2.0 | 3.0 | 5 | NaN |

2 | NaN | 4.0 | 6 | NaN |

You can also specify `how='all'`

, which will only drop rows/columns that are *all* null values:

In [106]:

```
df.dropna(axis='columns', how='all')
```

Out[106]:

0 | 1 | 2 | |
---|---|---|---|

0 | 1.0 | NaN | 2 |

1 | 2.0 | 3.0 | 5 |

2 | NaN | 4.0 | 6 |

The `thresh`

parameter lets you specify a minimum number of non-null values for the row/column to be kept:

In [107]:

```
df.dropna(axis='rows', thresh=3)
```

Out[107]:

0 | 1 | 2 | 3 | |
---|---|---|---|---|

1 | 2.0 | 3.0 | 5 | NaN |

Sometimes rather than dropping NA values, you'd rather replace them with a valid value.
This value might be a single number like zero, or it might be some sort of imputation or interpolation from the good values.
You could do this in-place using the `isnull()`

method as a mask, but because it is such a common operation Pandas provides the `fillna()`

method, which returns a copy of the array with the null values replaced.

In [108]:

```
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data
```

Out[108]:

a 1.0 b NaN c 2.0 d NaN e 3.0 dtype: float64

In [109]:

```
data.fillna(0) # fill NA entries with a single value
```

Out[109]:

a 1.0 b 0.0 c 2.0 d 0.0 e 3.0 dtype: float64

In [110]:

```
data.fillna(method='ffill') # specify a forward-fill to propagate the previous value forward
```

Out[110]:

a 1.0 b 1.0 c 2.0 d 2.0 e 3.0 dtype: float64

In [111]:

```
data.fillna(method='bfill') # specify a back-fill to propagate the next values backward
```

Out[111]:

a 1.0 b 2.0 c 2.0 d 3.0 e 3.0 dtype: float64

For `DataFrame`

s, the options are similar, but we can also specify an `axis`

along which the fills take place:

In [112]:

```
df
```

Out[112]:

0 | 1 | 2 | 3 | |
---|---|---|---|---|

0 | 1.0 | NaN | 2 | NaN |

1 | 2.0 | 3.0 | 5 | NaN |

2 | NaN | 4.0 | 6 | NaN |

In [113]:

```
df.fillna(method='ffill', axis=1)
```

Out[113]:

0 | 1 | 2 | 3 | |
---|---|---|---|---|

0 | 1.0 | 1.0 | 2.0 | 2.0 |

1 | 2.0 | 3.0 | 5.0 | 5.0 |

2 | NaN | 4.0 | 6.0 | 6.0 |

Up to this point we've been focused primarily on one-dimensional and two-dimensional data, stored in Pandas `Series`

and `DataFrame`

objects, respectively.
Often it is useful to go beyond this and store higher-dimensional data–that is, data indexed by more than one or two keys.

A far more common pattern in practice is to make use of *hierarchical indexing* (also known as *multi-indexing*) to incorporate multiple index *levels* within a single index.
In this way, higher-dimensional data can be compactly represented within the familiar one-dimensional `Series`

and two-dimensional `DataFrame`

objects.

Let's start by considering how we might represent two-dimensional data within a one-dimensional `Series`

.

Suppose you would like to track data about states from two different years. Using the Pandas tools we've already covered, you might be tempted to simply use Python tuples as keys:

In [125]:

```
index = [('California', 2000), ('California', 2010),
('New York', 2000), ('New York', 2010),
('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
18976457, 19378102,
20851820, 25145561]
pop = pd.Series(populations, index=index)
pop
```

Out[125]:

(California, 2000) 33871648 (California, 2010) 37253956 (New York, 2000) 18976457 (New York, 2010) 19378102 (Texas, 2000) 20851820 (Texas, 2010) 25145561 dtype: int64

If you need to select all values from 2010, you'll need to do some messy (and potentially slow) munging to make it happen:

In [115]:

```
pop[[i for i in pop.index if i[1] == 2010]]
```

Out[115]:

(California, 2010) 37253956 (New York, 2010) 19378102 (Texas, 2010) 25145561 dtype: int64

Our tuple-based indexing is essentially a rudimentary multi-index, and the Pandas `MultiIndex`

type gives us the type of operations we wish to have:

In [126]:

```
index = pd.MultiIndex.from_tuples(index)
index
```

Out[126]:

MultiIndex([('California', 2000), ('California', 2010), ( 'New York', 2000), ( 'New York', 2010), ( 'Texas', 2000), ( 'Texas', 2010)], )

In [127]:

```
type(_)
```

Out[127]:

pandas.core.indexes.multi.MultiIndex

A `MultiIndex`

contains multiple *levels* of indexing–in this case, the state names and the years, as well as multiple *labels* for each data point which encode these levels.

If we re-index our series with this `MultiIndex`

, we see the hierarchical representation of the data:

In [128]:

```
pop = pop.reindex(index)
pop
```

Out[128]:

California 2000 33871648 2010 37253956 New York 2000 18976457 2010 19378102 Texas 2000 20851820 2010 25145561 dtype: int64

Here the first two columns of the `Series`

representation show the multiple index values, while the third column shows the data.

Notice that some entries are missing in the first column: in this multi-index representation, any blank entry indicates the same value as the line above it.

Now to access all data for which the second index is 2010, we can simply use the Pandas slicing notation:

In [118]:

```
pop[:, 2010]
```

Out[118]:

California 37253956 New York 19378102 Texas 25145561 dtype: int64

The result is a singly indexed array with just the keys we're interested in. This syntax is much more convenient (and the operation is much more efficient!) than the home-spun tuple-based multi-indexing solution that we started with.

We could have stored the same data using a simple `DataFrame`

with index and column labels; in fact, Pandas is built with this equivalence in mind.

The `unstack()`

method will quickly convert a multiply indexed `Series`

into a conventionally indexed `DataFrame`

:

In [129]:

```
pop_df = pop.unstack()
pop_df
```

Out[129]:

2000 | 2010 | |
---|---|---|

California | 33871648 | 37253956 |

New York | 18976457 | 19378102 |

Texas | 20851820 | 25145561 |

In [131]:

```
type(pop_df)
```

Out[131]:

pandas.core.frame.DataFrame

Naturally, the `stack()`

method provides the opposite operation:

In [130]:

```
pop_df.stack()
```

Out[130]:

Seeing this, you might wonder why would we would bother with hierarchical indexing at all.

The reason is simple: just as we were able to use multi-indexing to represent two-dimensional data within a one-dimensional `Series`

, we can also use it to represent data of three or more dimensions in a `Series`

or `DataFrame`

.

Each extra level in a multi-index represents an extra dimension of data; taking advantage of this property gives us much more flexibility in the types of data we can represent.

Concretely, we might want to add another column of demographic data for each state at each year (say, population under 18) ; with a `MultiIndex`

this is as easy as adding another column to the `DataFrame`

:

In [132]:

```
pop_df = pd.DataFrame({'total': pop,
'under18': [9267089, 9284094,
4687374, 4318033,
5906301, 6879014]})
pop_df
```

Out[132]:

total | under18 | ||
---|---|---|---|

California | 2000 | 33871648 | 9267089 |

2010 | 37253956 | 9284094 | |

New York | 2000 | 18976457 | 4687374 |

2010 | 19378102 | 4318033 | |

Texas | 2000 | 20851820 | 5906301 |

2010 | 25145561 | 6879014 |

In addition, all the ufuncs work with hierarchical indices as well:

In [125]:

```
f_u18 = pop_df['under18'] / pop_df['total']
f_u18.unstack()
```

Out[125]:

2000 | 2010 | |
---|---|---|

California | 0.273594 | 0.249211 |

New York | 0.247010 | 0.222831 |

Texas | 0.283251 | 0.273568 |

The most straightforward way to construct a multiply indexed `Series`

or `DataFrame`

is to simply pass a list of two or more index arrays to the constructor:

In [126]:

```
df = pd.DataFrame(np.random.rand(4, 2),
index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
columns=['data1', 'data2'])
df
```

Out[126]:

data1 | data2 | ||
---|---|---|---|

a | 1 | 0.482545 | 0.352967 |

2 | 0.574280 | 0.063582 | |

b | 1 | 0.102271 | 0.569372 |

2 | 0.753026 | 0.194597 |

Similarly, if you pass a dictionary with appropriate tuples as keys, Pandas will automatically recognize this and use a `MultiIndex`

by default:

In [127]:

```
data = {('California', 2000): 33871648,
('California', 2010): 37253956,
('Texas', 2000): 20851820,
('Texas', 2010): 25145561,
('New York', 2000): 18976457,
('New York', 2010): 19378102}
pd.Series(data)
```

Out[127]:

California 2000 33871648 2010 37253956 Texas 2000 20851820 2010 25145561 New York 2000 18976457 2010 19378102 dtype: int64

For more flexibility in how the index is constructed, you can instead use the class method constructors available in the `pd.MultiIndex`

.

You can construct the `MultiIndex`

from a simple list of arrays giving the index values within each level:

In [128]:

```
pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])
```

Out[128]:

MultiIndex([('a', 1), ('a', 2), ('b', 1), ('b', 2)], )

You can even construct it from a Cartesian product of single indices:

In [129]:

```
pd.MultiIndex.from_product([['a', 'b'], [1, 2]])
```

Out[129]:

MultiIndex([('a', 1), ('a', 2), ('b', 1), ('b', 2)], )

Sometimes it is convenient to name the levels of the `MultiIndex`

.
This can be accomplished by passing the `names`

argument to any of the above `MultiIndex`

constructors, or by setting the `names`

attribute of the index after the fact:

In [130]:

```
pop.index.names = ['state', 'year']
pop
```

Out[130]:

state year California 2000 33871648 2010 37253956 New York 2000 18976457 2010 19378102 Texas 2000 20851820 2010 25145561 dtype: int64

In a `DataFrame`

, the rows and columns are completely symmetric, and just as the rows can have multiple levels of indices, the columns can have multiple levels as well:

In [132]:

```
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]], names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']], names=['subject', 'type'])
data = np.round(np.random.randn(4, 6), 1) # mock some data
data[:, ::2] *= 10
data += 37
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data # create the DataFrame
```

Out[132]:

subject | Bob | Guido | Sue | ||||
---|---|---|---|---|---|---|---|

type | HR | Temp | HR | Temp | HR | Temp | |

year | visit | ||||||

2013 | 1 | 48.0 | 38.1 | 19.0 | 38.4 | 52.0 | 38.8 |

2 | 34.0 | 38.0 | 37.0 | 36.9 | 31.0 | 37.6 | |

2014 | 1 | 41.0 | 37.0 | 52.0 | 38.9 | 38.0 | 37.4 |

2 | 47.0 | 36.9 | 46.0 | 36.4 | 42.0 | 36.6 |

This is fundamentally four-dimensional data, where the dimensions are the subject, the measurement type, the year, and the visit number; we can index the top-level column by the person's name and get a full `DataFrame`

containing just that person's information:

In [133]:

```
health_data['Guido']
```

Out[133]:

type | HR | Temp | |
---|---|---|---|

year | visit | ||

2013 | 1 | 19.0 | 38.4 |

2 | 37.0 | 36.9 | |

2014 | 1 | 52.0 | 38.9 |

2 | 46.0 | 36.4 |

In [134]:

```
pop
```

Out[134]:

In [135]:

```
pop['California', 2000] # access single elements by indexing with multiple terms
```

Out[135]:

33871648

The `MultiIndex`

also supports *partial indexing*, or indexing just one of the levels in the index.
The result is another `Series`

, with the lower-level indices maintained:

In [136]:

```
pop['California']
```

Out[136]:

year 2000 33871648 2010 37253956 dtype: int64

Other types of indexing and selection could be based either on Boolean masks:

In [137]:

```
pop[pop > 22000000]
```

Out[137]:

state year California 2000 33871648 2010 37253956 Texas 2010 25145561 dtype: int64

or on fancy indexing:

In [138]:

```
pop[['California', 'Texas']]
```

Out[138]:

state year California 2000 33871648 2010 37253956 Texas 2000 20851820 2010 25145561 dtype: int64

A multiply indexed `DataFrame`

behaves in a similar manner:

In [139]:

```
health_data
```

Out[139]:

subject | Bob | Guido | Sue | ||||
---|---|---|---|---|---|---|---|

type | HR | Temp | HR | Temp | HR | Temp | |

year | visit | ||||||

2013 | 1 | 48.0 | 38.1 | 19.0 | 38.4 | 52.0 | 38.8 |

2 | 34.0 | 38.0 | 37.0 | 36.9 | 31.0 | 37.6 | |

2014 | 1 | 41.0 | 37.0 | 52.0 | 38.9 | 38.0 | 37.4 |

2 | 47.0 | 36.9 | 46.0 | 36.4 | 42.0 | 36.6 |

Remember that columns are primary in a `DataFrame`

, and the syntax used for multiply indexed `Series`

applies to the columns.

We can recover Guido's heart rate data with a simple operation:

In [140]:

```
health_data['Guido', 'HR']
```

Out[140]:

year visit 2013 1 19.0 2 37.0 2014 1 52.0 2 46.0 Name: (Guido, HR), dtype: float64

Also, as with the single-index case, we can use the `loc`

, `iloc`

, and `ix`

indexers:

In [141]:

```
health_data.iloc[:2, :2]
```

Out[141]:

subject | Bob | ||
---|---|---|---|

type | HR | Temp | |

year | visit | ||

2013 | 1 | 48.0 | 38.1 |

2 | 34.0 | 38.0 |

These indexers provide an array-like view of the underlying two-dimensional data, but each individual index in `loc`

or `iloc`

can be passed a tuple of multiple indices:

In [142]:

```
health_data.loc[:, ('Bob', 'HR')]
```

Out[142]:

year visit 2013 1 48.0 2 34.0 2014 1 41.0 2 47.0 Name: (Bob, HR), dtype: float64

One of the keys to working with multiply indexed data is knowing how to effectively transform the data.

There are a number of operations that will preserve all the information in the dataset, but rearrange it for the purposes of various computations.

We saw a brief example of this in the `stack()`

and `unstack()`

methods, but there are many more ways to finely control the rearrangement of data between hierarchical indices and columns.

Earlier, we briefly mentioned a caveat, but we should emphasize it more here.

*Many of the MultiIndex slicing operations will fail if the index is not sorted.*

We'll start by creating some simple multiply indexed data where the indices are *not lexographically sorted*:

In [143]:

```
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])
data = pd.Series(np.random.rand(6), index=index)
data.index.names = ['char', 'int']
data
```

Out[143]:

char int a 1 0.002105 2 0.280923 c 1 0.008604 2 0.631968 b 1 0.072270 2 0.273800 dtype: float64

In [144]:

```
try:
data['a':'b'] # try to take a partial slice of this index
except KeyError as e:
print(type(e))
print(e)
```

This is the result of the MultiIndex not being sorted; in general, partial slices and other similar operations require the levels in the `MultiIndex`

to be in sorted (i.e., lexographical) order.

Pandas provides a number of convenience routines to perform this type of sorting; examples are the `sort_index()`

and `sortlevel()`

methods of the `DataFrame`

.

In [145]:

```
data = data.sort_index()
data
```

Out[145]:

char int a 1 0.002105 2 0.280923 b 1 0.072270 2 0.273800 c 1 0.008604 2 0.631968 dtype: float64

With the index sorted in this way, partial slicing will work as expected:

In [146]:

```
data['a':'b']
```

Out[146]:

char int a 1 0.002105 2 0.280923 b 1 0.072270 2 0.273800 dtype: float64

As we saw briefly before, it is possible to convert a dataset from a stacked multi-index to a simple two-dimensional representation, optionally specifying the level to use:

In [147]:

```
pop.unstack(level=0)
```

Out[147]:

state | California | New York | Texas |
---|---|---|---|

year | |||

2000 | 33871648 | 18976457 | 20851820 |

2010 | 37253956 | 19378102 | 25145561 |

In [148]:

```
pop.unstack(level=1)
```

Out[148]:

year | 2000 | 2010 |
---|---|---|

state | ||

California | 33871648 | 37253956 |

New York | 18976457 | 19378102 |

Texas | 20851820 | 25145561 |

The opposite of `unstack()`

is `stack()`

, which here can be used to recover the original series:

In [149]:

```
pop.unstack().stack()
```

Out[149]:

Another way to rearrange hierarchical data is to turn the index labels into columns; this can be accomplished with the `reset_index`

method.

Calling this on the population dictionary will result in a `DataFrame`

with a *state* and *year* column holding the information that was formerly in the index.

In [150]:

```
pop_flat = pop.reset_index(name='population') # specify the name of the data for the column
pop_flat
```

Out[150]:

state | year | population | |
---|---|---|---|

0 | California | 2000 | 33871648 |

1 | California | 2010 | 37253956 |

2 | New York | 2000 | 18976457 |

3 | New York | 2010 | 19378102 |

4 | Texas | 2000 | 20851820 |

5 | Texas | 2010 | 25145561 |

Often when working with data in the real world, the raw input data looks like this and it's useful to build a `MultiIndex`

from the column values.
This can be done with the `set_index`

method of the `DataFrame`

, which returns a multiply indexed `DataFrame`

:

In [151]:

```
pop_flat.set_index(['state', 'year'])
```

Out[151]:

population | ||
---|---|---|

state | year | |

California | 2000 | 33871648 |

2010 | 37253956 | |

New York | 2000 | 18976457 |

2010 | 19378102 | |

Texas | 2000 | 20851820 |

2010 | 25145561 |

We've previously seen that Pandas has built-in data aggregation methods, such as `mean()`

, `sum()`

, and `max()`

.
For hierarchically indexed data, these can be passed a `level`

parameter that controls which subset of the data the aggregate is computed on.

In [152]:

```
health_data
```

Out[152]:

subject | Bob | Guido | Sue | ||||
---|---|---|---|---|---|---|---|

type | HR | Temp | HR | Temp | HR | Temp | |

year | visit | ||||||

2013 | 1 | 48.0 | 38.1 | 19.0 | 38.4 | 52.0 | 38.8 |

2 | 34.0 | 38.0 | 37.0 | 36.9 | 31.0 | 37.6 | |

2014 | 1 | 41.0 | 37.0 | 52.0 | 38.9 | 38.0 | 37.4 |

2 | 47.0 | 36.9 | 46.0 | 36.4 | 42.0 | 36.6 |

Perhaps we'd like to average-out the measurements in the two visits each year. We can do this by naming the index level we'd like to explore, in this case the year:

In [153]:

```
data_mean = health_data.mean(level='year')
data_mean
```

Out[153]:

subject | Bob | Guido | Sue | |||
---|---|---|---|---|---|---|

type | HR | Temp | HR | Temp | HR | Temp |

year | ||||||

2013 | 41.0 | 38.05 | 28.0 | 37.65 | 41.5 | 38.2 |

2014 | 44.0 | 36.95 | 49.0 | 37.65 | 40.0 | 37.0 |

By further making use of the `axis`

keyword, we can take the mean among levels on the columns as well:

In [154]:

```
data_mean.mean(axis=1, level='type')
```

Out[154]:

type | HR | Temp |
---|---|---|

year | ||

2013 | 36.833333 | 37.966667 |

2014 | 44.333333 | 37.200000 |

Some of the most interesting studies of data come from combining different data sources.
These operations can involve anything from very straightforward concatenation of two different datasets, to more complicated database-style joins and merges that correctly handle any overlaps between the datasets.
`Series`

and `DataFrame`

s are built with this type of operation in mind, and Pandas includes functions and methods that make this sort of data wrangling fast and straightforward.

Here we'll take a look at simple concatenation of `Series`

and `DataFrame`

s with the `pd.concat`

function; later we'll dive into more sophisticated in-memory merges and joins implemented in Pandas.

For convenience, we'll define this function which creates a `DataFrame`

of a particular form that will be useful below:

In [176]:

```
def make_df(cols, ind):
"""Quickly make a DataFrame"""
data = {c: [str(c) + str(i) for i in ind]
for c in cols}
return pd.DataFrame(data, ind)
# example DataFrame
make_df('ABC', range(3))
```

Out[176]:

A | B | C | |
---|---|---|---|

0 | A0 | B0 | C0 |

1 | A1 | B1 | C1 |

2 | A2 | B2 | C2 |

In addition, we'll create a quick class that allows us to display multiple `DataFrame`

s side by side. The code makes use of the special `_repr_html_`

method, which IPython uses to implement its rich object display:

In [178]:

```
class display(object):
"""Display HTML representation of multiple objects"""
template = """<div style="float: left; padding: 10px;">
<p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
</div>"""
def __init__(self, *args):
self.args = args
def _repr_html_(self):
return '\n'.join(self.template.format(a, eval(a)._repr_html_())
for a in self.args)
def __repr__(self):
return '\n\n'.join(a + '\n' + repr(eval(a))
for a in self.args)
```

`pd.concat`

¶Pandas has a function, `pd.concat()`

, which has a similar syntax to `np.concatenate`

but contains a number of options that we'll discuss momentarily:

```
# Signature in Pandas v0.18
pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
keys=None, levels=None, names=None, verify_integrity=False,
copy=True)
```

`pd.concat()`

can be used for a simple concatenation of `Series`

or `DataFrame`

objects, just as `np.concatenate()`

can be used for simple concatenations of arrays:

In [180]:

```
ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
pd.concat([ser1, ser2])
```

Out[180]:

1 A 2 B 3 C 4 D 5 E 6 F dtype: object

In [181]:

```
df1 = make_df('AB', [1, 2])
df2 = make_df('AB', [3, 4])
display('df1', 'df2', 'pd.concat([df1, df2])')
```

Out[181]:

df1

A | B | |
---|---|---|

1 | A1 | B1 |

2 | A2 | B2 |

df2

A | B | |
---|---|---|

3 | A3 | B3 |

4 | A4 | B4 |

pd.concat([df1, df2])

A | B | |
---|---|---|

1 | A1 | B1 |

2 | A2 | B2 |

3 | A3 | B3 |

4 | A4 | B4 |

By default, the concatenation takes place row-wise within the `DataFrame`

(i.e., `axis=0`

).
Like `np.concatenate`

, `pd.concat`

allows specification of an axis along which concatenation will take place:

In [183]:

```
df3 = make_df('AB', [0, 1])
df4 = make_df('CD', [0, 1])
display('df3', 'df4', "pd.concat([df3, df4], axis=1)")
```

Out[183]:

df3

A | B | |
---|---|---|

0 | A0 | B0 |

1 | A1 | B1 |

df4

C | D | |
---|---|---|

0 | C0 | D0 |

1 | C1 | D1 |

pd.concat([df3, df4], axis=1)

A | B | C | D | |
---|---|---|---|---|

0 | A0 | B0 | C0 | D0 |

1 | A1 | B1 | C1 | D1 |

One important difference between `np.concatenate`

and `pd.concat`

is that Pandas concatenation *preserves indices*, even if the result will have duplicate indices:

In [184]:

```
x = make_df('AB', [0, 1])
y = make_df('AB', [2, 3])
y.index = x.index # make duplicate indices!
display('x', 'y', 'pd.concat([x, y])')
```

Out[184]:

x

A | B | |
---|---|---|

0 | A0 | B0 |

1 | A1 | B1 |

y

A | B | |
---|---|---|

0 | A2 | B2 |

1 | A3 | B3 |

pd.concat([x, y])

A | B | |
---|---|---|

0 | A0 | B0 |

1 | A1 | B1 |

0 | A2 | B2 |

1 | A3 | B3 |

Notice the repeated indices in the result.
While this is valid within `DataFrame`

s, the outcome is often undesirable.
`pd.concat()`

gives us a few ways to handle it.

In [185]:

```
try:
pd.concat([x, y], verify_integrity=True)
except ValueError as e:
print("ValueError:", e)
```

ValueError: Indexes have overlapping values: Int64Index([0, 1], dtype='int64')

Sometimes the index itself does not matter, and you would prefer it to simply be ignored.
This option can be specified using the `ignore_index`

flag.
With this set to true, the concatenation will create a new integer index for the resulting `Series`

:

In [186]:

```
display('x', 'y', 'pd.concat([x, y], ignore_index=True)')
```

Out[186]:

x

A | B | |
---|---|---|

0 | A0 | B0 |

1 | A1 | B1 |

y

A | B | |
---|---|---|

0 | A2 | B2 |

1 | A3 | B3 |

pd.concat([x, y], ignore_index=True)

A | B | |
---|---|---|

0 | A0 | B0 |

1 | A1 | B1 |

2 | A2 | B2 |

3 | A3 | B3 |

Another option is to use the `keys`

option to specify a label for the data sources; the result will be a hierarchically indexed series containing the data:

In [187]:

```
display('x', 'y', "pd.concat([x, y], keys=['x', 'y'])")
```

Out[187]:

x

A | B | |
---|---|---|

0 | A0 | B0 |

1 | A1 | B1 |

y

A | B | |
---|---|---|

0 | A2 | B2 |

1 | A3 | B3 |

pd.concat([x, y], keys=['x', 'y'])

A | B | ||
---|---|---|---|

x | 0 | A0 | B0 |

1 | A1 | B1 | |

y | 0 | A2 | B2 |

1 | A3 | B3 |

In practice, data from different sources might have different sets of column names, and `pd.concat`

offers several options in this case.
Consider the concatenation of the following two `DataFrame`

s, which have some (but not all!) columns in common:

In [188]:

```
df5 = make_df('ABC', [1, 2])
df6 = make_df('BCD', [3, 4])
display('df5', 'df6', 'pd.concat([df5, df6])')
```

Out[188]:

df5

A | B | C | |
---|---|---|---|

1 | A1 | B1 | C1 |

2 | A2 | B2 | C2 |

df6

B | C | D | |
---|---|---|---|

3 | B3 | C3 | D3 |

4 | B4 | C4 | D4 |

pd.concat([df5, df6])

A | B | C | D | |
---|---|---|---|---|

1 | A1 | B1 | C1 | NaN |

2 | A2 | B2 | C2 | NaN |

3 | NaN | B3 | C3 | D3 |

4 | NaN | B4 | C4 | D4 |

By default, the join is a union of the input column|s (`join='outer'`

), but we can change this to an intersection of the columns using `join='inner'`

:

Another option is to directly specify the index of the remaininig colums using the `join_axes`

argument, which takes a list of index objects.

In [190]:

```
display('df5', 'df6', "pd.concat([df5, df6])")
```

Out[190]:

df5

A | B | C | |
---|---|---|---|

1 | A1 | B1 | C1 |

2 | A2 | B2 | C2 |

df6

B | C | D | |
---|---|---|---|

3 | B3 | C3 | D3 |

4 | B4 | C4 | D4 |

pd.concat([df5, df6])

A | B | C | D | |
---|---|---|---|---|

1 | A1 | B1 | C1 | NaN |

2 | A2 | B2 | C2 | NaN |

3 | NaN | B3 | C3 | D3 |

4 | NaN | B4 | C4 | D4 |

`append()`

method¶Because direct array concatenation is so common, `Series`

and `DataFrame`

objects have an `append`

method that can accomplish the same thing in fewer keystrokes.
For example, rather than calling `pd.concat([df1, df2])`

, you can simply call `df1.append(df2)`

:

In [192]:

```
display('df1', 'df2', 'df1.append(df2)')
```

Out[192]:

df1

A | B | |
---|---|---|

1 | A1 | B1 |

2 | A2 | B2 |

df2

A | B | |
---|---|---|

3 | A3 | B3 |

4 | A4 | B4 |

df1.append(df2)

A | B | |
---|---|---|

1 | A1 | B1 |

2 | A2 | B2 |

3 | A3 | B3 |

4 | A4 | B4 |

Keep in mind that unlike the `append()`

and `extend()`

methods of Python lists, the `append()`

method in Pandas does not modify the original object–instead it creates a new object with the combined data.

It also is not a very efficient method, because it involves creation of a new index *and* data buffer.
Thus, if you plan to do multiple `append`

operations, it is generally better to build a list of `DataFrame`

s and pass them all at once to the `concat()`

function.

One essential feature offered by Pandas is its high-performance, in-memory join and merge operations.
If you have ever worked with databases, you should be familiar with this type of data interaction.
The main interface for this is the `pd.merge`

function, and we'll see few examples of how this can work in practice.

For convenience, we will start by redefining the `display()`

functionality:

In [133]:

```
class display(object):
"""Display HTML representation of multiple objects"""
template = """<div style="float: left; padding: 10px;">
<p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
</div>"""
def __init__(self, *args):
self.args = args
def _repr_html_(self):
return '\n'.join(self.template.format(a, eval(a)._repr_html_())
for a in self.args)
def __repr__(self):
return '\n\n'.join(a + '\n' + repr(eval(a))
for a in self.args)
```

The behavior implemented in `pd.merge()`

is a subset of what is known as *relational algebra*, which is a formal set of rules for manipulating relational data, and forms the conceptual foundation of operations available in most databases.
The strength of the relational algebra approach is that it proposes several primitive operations, which become the building blocks of more complicated operations on any dataset.
With this lexicon of fundamental operations implemented efficiently in a database or other program, a wide range of fairly complicated composite operations can be performed.

Pandas implements several of these fundamental building-blocks in the `pd.merge()`

function and the related `join()`

method of `Series`

and `Dataframe`

s.

The `pd.merge()`

function implements a number of types of joins: the *one-to-one*, *many-to-one*, and *many-to-many* joins.
All three types of joins are accessed via an identical call to the `pd.merge()`

interface; the type of join performed depends on the form of the input data.

Perhaps the simplest type of merge expresion is the one-to-one join, which is in many ways very similar to the column-wise concatenation that we have already seen.
As a concrete example, consider the following two `DataFrames`

which contain information on several employees in a company:

In [134]:

```
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
'hire_date': [2004, 2008, 2012, 2014]})
display('df1', 'df2')
```

Out[134]:

df1

employee | group | |
---|---|---|

0 | Bob | Accounting |

1 | Jake | Engineering |

2 | Lisa | Engineering |

3 | Sue | HR |

df2

employee | hire_date | |
---|---|---|

0 | Lisa | 2004 |

1 | Bob | 2008 |

2 | Jake | 2012 |

3 | Sue | 2014 |

To combine this information into a single `DataFrame`

, we can use the `pd.merge()`

function:

In [135]:

```
df3 = pd.merge(df1, df2)
df3
```

Out[135]:

employee | group | hire_date | |
---|---|---|---|

0 | Bob | Accounting | 2008 |

1 | Jake | Engineering | 2012 |

2 | Lisa | Engineering | 2004 |

3 | Sue | HR | 2014 |

The `pd.merge()`

function recognizes that each `DataFrame`

has an "employee" column, and automatically joins using this column as a key.
The result of the merge is a new `DataFrame`

that combines the information from the two inputs.
Notice that the order of entries in each column is not necessarily maintained: in this case, the order of the "employee" column differs between `df1`

and `df2`

, and the `pd.merge()`

function correctly accounts for this.
Additionally, keep in mind that the merge in general discards the index, except in the special case of merges by index (see the `left_index`

and `right_index`

keywords, discussed momentarily).

Many-to-one joins are joins in which one of the two key columns contains duplicate entries.
For the many-to-one case, the resulting `DataFrame`

will preserve those duplicate entries as appropriate:

In [136]:

```
df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
'supervisor': ['Carly', 'Guido', 'Steve']})
display('df3', 'df4', 'pd.merge(df3, df4)')
```

Out[136]:

df3

employee | group | hire_date | |
---|---|---|---|

0 | Bob | Accounting | 2008 |

1 | Jake | Engineering | 2012 |

2 | Lisa | Engineering | 2004 |

3 | Sue | HR | 2014 |

df4

group | supervisor | |
---|---|---|

0 | Accounting | Carly |

1 | Engineering | Guido |

2 | HR | Steve |

pd.merge(df3, df4)

employee | group | hire_date | supervisor | |
---|---|---|---|---|

0 | Bob | Accounting | 2008 | Carly |

1 | Jake | Engineering | 2012 | Guido |

2 | Lisa | Engineering | 2004 | Guido |

3 | Sue | HR | 2014 | Steve |

Many-to-many joins are a bit confusing conceptually, but are nevertheless well defined.
If the key column in both the left and right array contains duplicates, then the result is a many-to-many merge.
Consider the following, where we have a `DataFrame`

showing one or more skills associated with a particular group. By performing a many-to-many join, we can recover the skills associated with any individual person:

In [137]:

```
df5 = pd.DataFrame({'group': ['Accounting', 'Accounting',
'Engineering', 'Engineering', 'HR', 'HR'],
'skills': ['math', 'spreadsheets', 'coding', 'linux',
'spreadsheets', 'organization']})
display('df1', 'df5', "pd.merge(df1, df5)")
```

Out[137]:

df1

employee | group | |
---|---|---|

0 | Bob | Accounting |

1 | Jake | Engineering |

2 | Lisa | Engineering |

3 | Sue | HR |

df5

group | skills | |
---|---|---|

0 | Accounting | math |

1 | Accounting | spreadsheets |

2 | Engineering | coding |

3 | Engineering | linux |

4 | HR | spreadsheets |

5 | HR | organization |

pd.merge(df1, df5)

employee | group | skills | |
---|---|---|---|

0 | Bob | Accounting | math |

1 | Bob | Accounting | spreadsheets |

2 | Jake | Engineering | coding |

3 | Jake | Engineering | linux |

4 | Lisa | Engineering | coding |

5 | Lisa | Engineering | linux |

6 | Sue | HR | spreadsheets |

7 | Sue | HR | organization |

We've already seen the default behavior of `pd.merge()`

: it looks for one or more matching column names between the two inputs, and uses this as the key.
However, often the column names will not match so nicely, and `pd.merge()`

provides a variety of options for handling this.

`on`

keyword¶Most simply, you can explicitly specify the name of the key column using the `on`

keyword, which takes a column name or a list of column names:

In [198]:

```
display('df1', 'df2', "pd.merge(df1, df2, on='employee')")
```

Out[198]:

df1

employee | group | |
---|---|---|

0 | Bob | Accounting |

1 | Jake | Engineering |

2 | Lisa | Engineering |

3 | Sue | HR |

df2

employee | hire_date | |
---|---|---|

0 | Lisa | 2004 |

1 | Bob | 2008 |

2 | Jake | 2012 |

3 | Sue | 2014 |

pd.merge(df1, df2, on='employee')

employee | group | hire_date | |
---|---|---|---|

0 | Bob | Accounting | 2008 |

1 | Jake | Engineering | 2012 |

2 | Lisa | Engineering | 2004 |

3 | Sue | HR | 2014 |

`left_on`

and `right_on`

keywords¶At times you may wish to merge two datasets with different column names; for example, we may have a dataset in which the employee name is labeled as "name" rather than "employee".
In this case, we can use the `left_on`

and `right_on`

keywords to specify the two column names:

In [138]:

```
df3 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
'salary': [70000, 80000, 120000, 90000]})
display('df1', 'df3', 'pd.merge(df1, df3, left_on="employee", right_on="name")')
```

Out[138]:

df1

employee | group | |
---|---|---|

0 | Bob | Accounting |

1 | Jake | Engineering |

2 | Lisa | Engineering |

3 | Sue | HR |

df3

name | salary | |
---|---|---|

0 | Bob | 70000 |

1 | Jake | 80000 |

2 | Lisa | 120000 |

3 | Sue | 90000 |

pd.merge(df1, df3, left_on="employee", right_on="name")

employee | group | name | salary | |
---|---|---|---|---|

0 | Bob | Accounting | Bob | 70000 |

1 | Jake | Engineering | Jake | 80000 |

2 | Lisa | Engineering | Lisa | 120000 |

3 | Sue | HR | Sue | 90000 |

The result has a redundant column that we can drop if desired–for example, by using the `drop()`

method of `DataFrame`

s:

In [200]:

```
pd.merge(df1, df3, left_on="employee", right_on="name").drop('name', axis=1)
```

Out[200]:

employee | group | salary | |
---|---|---|---|

0 | Bob | Accounting | 70000 |

1 | Jake | Engineering | 80000 |

2 | Lisa | Engineering | 120000 |

3 | Sue | HR | 90000 |

`left_index`

and `right_index`

keywords¶Sometimes, rather than merging on a column, you would instead like to merge on an index. For example, your data might look like this:

In [142]:

```
df1.set_index?
```

In [147]:

```
df1a = df1.set_index([ 'group', 'employee'])
df1a
```

Out[147]:

group | employee |
---|---|

Accounting | Bob |

Engineering | Jake |

Lisa | |

HR | Sue |

In [139]:

```
df1a = df1.set_index(['employee', 'group'])
df2a = df2.set_index('employee')
display('df1a', 'df2a')
```

Out[139]:

df1a

group | |
---|---|

employee | |

Bob | Accounting |

Jake | Engineering |

Lisa | Engineering |

Sue | HR |

df2a

hire_date | |
---|---|

employee | |

Lisa | 2004 |

Bob | 2008 |

Jake | 2012 |

Sue | 2014 |

You can use the index as the key for merging by specifying the `left_index`

and/or `right_index`

flags in `pd.merge()`

:

In [140]:

```
display('df1a', 'df2a', "pd.merge(df1a, df2a, left_index=True, right_index=True)")
```

Out[140]:

df1a

group | |
---|---|

employee | |

Bob | Accounting |

Jake | Engineering |

Lisa | Engineering |

Sue | HR |

df2a

hire_date | |
---|---|

employee | |

Lisa | 2004 |

Bob | 2008 |

Jake | 2012 |

Sue | 2014 |

pd.merge(df1a, df2a, left_index=True, right_index=True)

group | hire_date | |
---|---|---|

employee | ||

Bob | Accounting | 2008 |

Jake | Engineering | 2012 |

Lisa | Engineering | 2004 |

Sue | HR | 2014 |

In [141]:

```
df1a, df2a, pd.merge(df1a, df2a, left_index=True, right_index=True)
```

Out[141]:

( group employee Bob Accounting Jake Engineering Lisa Engineering Sue HR, hire_date employee Lisa 2004 Bob 2008 Jake 2012 Sue 2014, group hire_date employee Bob Accounting 2008 Jake Engineering 2012 Lisa Engineering 2004 Sue HR 2014)

For convenience, `DataFrame`

s implement the `join()`

method, which performs a merge that defaults to joining on indices:

In [203]:

```
display('df1a', 'df2a', 'df1a.join(df2a)')
```

Out[203]:

df1a

group | |
---|---|

employee | |

Bob | Accounting |

Jake | Engineering |

Lisa | Engineering |

Sue | HR |

df2a

hire_date | |
---|---|

employee | |

Lisa | 2004 |

Bob | 2008 |

Jake | 2012 |

Sue | 2014 |

df1a.join(df2a)

group | hire_date | |
---|---|---|

employee | ||

Bob | Accounting | 2008 |

Jake | Engineering | 2012 |

Lisa | Engineering | 2004 |

Sue | HR | 2014 |

If you'd like to mix indices and columns, you can combine `left_index`

with `right_on`

or `left_on`

with `right_index`

to get the desired behavior:

In [204]:

```
display('df1a', 'df3', "pd.merge(df1a, df3, left_index=True, right_on='name')")
```

Out[204]:

df1a

group | |
---|---|

employee | |

Bob | Accounting |

Jake | Engineering |

Lisa | Engineering |

Sue | HR |

df3

name | salary | |
---|---|---|

0 | Bob | 70000 |

1 | Jake | 80000 |

2 | Lisa | 120000 |

3 | Sue | 90000 |

pd.merge(df1a, df3, left_index=True, right_on='name')

group | name | salary | |
---|---|---|---|

0 | Accounting | Bob | 70000 |

1 | Engineering | Jake | 80000 |

2 | Engineering | Lisa | 120000 |

3 | HR | Sue | 90000 |

We have glossed over one important consideration in performing a join: the type of set arithmetic used in the join. This comes up when a value appears in one key column but not the other:

In [149]:

```
df6 = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'], 'food': ['fish', 'beans', 'bread']},
columns=['name', 'food'])
df7 = pd.DataFrame({'name': ['Mary', 'Joseph'], 'drink': ['wine', 'beer']},
columns=['name', 'drink'])
display('df6', 'df7', 'pd.merge(df6, df7)')
```

Out[149]:

df6

name | food | |
---|---|---|

0 | Peter | fish |

1 | Paul | beans |

2 | Mary | bread |

df7

name | drink | |
---|---|---|

0 | Mary | wine |

1 | Joseph | beer |

pd.merge(df6, df7)

name | food | drink | |
---|---|---|---|

0 | Mary | bread | wine |

Here we have merged two datasets that have only a single "name" entry in common: Mary.
By default, the result contains the *intersection* of the two sets of inputs; this is what is known as an *inner join*.
We can specify this explicitly using the `how`

keyword, which defaults to `"inner"`

:

In [150]:

```
pd.merge(df6, df7, how='inner')
```

Out[150]:

name | food | drink | |
---|---|---|---|

0 | Mary | bread | wine |

Other options for the `how`

keyword are `'outer'`

, `'left'`

, and `'right'`

.
An *outer join* returns a join over the union of the input columns, and fills in all missing values with NAs:

In [151]:

```
display('df6', 'df7', "pd.merge(df6, df7, how='outer')")
```

Out[151]:

df6

name | food | |
---|---|---|

0 | Peter | fish |

1 | Paul | beans |

2 | Mary | bread |

df7

name | drink | |
---|---|---|

0 | Mary | wine |

1 | Joseph | beer |

pd.merge(df6, df7, how='outer')

name | food | drink | |
---|---|---|---|

0 | Peter | fish | NaN |

1 | Paul | beans | NaN |

2 | Mary | bread | wine |

3 | Joseph | NaN | beer |

The *left join* and *right join* return joins over the left entries and right entries, respectively:

In [152]:

```
display('df6', 'df7', "pd.merge(df6, df7, how='left')")
```

Out[152]:

df6

name | food | |
---|---|---|

0 | Peter | fish |

1 | Paul | beans |

2 | Mary | bread |

df7

name | drink | |
---|---|---|

0 | Mary | wine |

1 | Joseph | beer |

pd.merge(df6, df7, how='left')

name | food | drink | |
---|---|---|---|

0 | Peter | fish | NaN |

1 | Paul | beans | NaN |

2 | Mary | bread | wine |

In [153]:

```
display('df6', 'df7', "pd.merge(df6, df7, how='right')")
```

Out[153]:

df6

name | food | |
---|---|---|

0 | Peter | fish |

1 | Paul | beans |

2 | Mary | bread |

df7

name | drink | |
---|---|---|

0 | Mary | wine |

1 | Joseph | beer |

pd.merge(df6, df7, how='right')

name | food | drink | |
---|---|---|---|

0 | Mary | bread | wine |

1 | Joseph | NaN | beer |

In [154]:

```
pd.merge?
```

`suffixes`

Keyword¶Finally, you may end up in a case where your two input `DataFrame`

s have conflicting column names:

In [210]:

```
df8 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
'rank': [1, 2, 3, 4]})
df9 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
'rank': [3, 1, 4, 2]})
display('df8', 'df9', 'pd.merge(df8, df9, on="name")')
```

Out[210]:

df8

name | rank | |
---|---|---|

0 | Bob | 1 |

1 | Jake | 2 |

2 | Lisa | 3 |

3 | Sue | 4 |

df9

name | rank | |
---|---|---|

0 | Bob | 3 |

1 | Jake | 1 |

2 | Lisa | 4 |

3 | Sue | 2 |

pd.merge(df8, df9, on="name")

name | rank_x | rank_y | |
---|---|---|---|

0 | Bob | 1 | 3 |

1 | Jake | 2 | 1 |

2 | Lisa | 3 | 4 |

3 | Sue | 4 | 2 |

Because the output would have two conflicting column names, the merge function automatically appends a suffix `_x`

or `_y`

to make the output columns unique.
If these defaults are inappropriate, it is possible to specify a custom suffix using the `suffixes`

keyword:

In [211]:

```
display('df8', 'df9', 'pd.merge(df8, df9, on="name", suffixes=["_L", "_R"])')
```

Out[211]:

df8

name | rank | |
---|---|---|

0 | Bob | 1 |

1 | Jake | 2 |

2 | Lisa | 3 |

3 | Sue | 4 |

df9

name | rank | |
---|---|---|

0 | Bob | 3 |

1 | Jake | 1 |

2 | Lisa | 4 |

3 | Sue | 2 |

pd.merge(df8, df9, on="name", suffixes=["_L", "_R"])

name | rank_L | rank_R | |
---|---|---|---|

0 | Bob | 1 | 3 |

1 | Jake | 2 | 1 |

2 | Lisa | 3 | 4 |

3 | Sue | 4 | 2 |

Merge and join operations come up most often when combining data from different sources. Here we will consider an example of some data about US states and their populations. The data files can be found at http://github.com/jakevdp/data-USstates/:

In [221]:

```
pop = pd.read_csv('data/state-population.csv')
areas = pd.read_csv('data/state-areas.csv')
abbrevs = pd.read_csv('data/state-abbrevs.csv')
display('pop.head()', 'areas.head()', 'abbrevs.head()')
```

Out[221]:

pop.head()

state/region | ages | year | population | |
---|---|---|---|---|

0 | AL | under18 | 2012 | 1117489.0 |

1 | AL | total | 2012 | 4817528.0 |

2 | AL | under18 | 2010 | 1130966.0 |

3 | AL | total | 2010 | 4785570.0 |

4 | AL | under18 | 2011 | 1125763.0 |

areas.head()

state | area (sq. mi) | |
---|---|---|

0 | Alabama | 52423 |

1 | Alaska | 656425 |

2 | Arizona | 114006 |

3 | Arkansas | 53182 |

4 | California | 163707 |

abbrevs.head()

state | abbreviation | |
---|---|---|

0 | Alabama | AL |

1 | Alaska | AK |

2 | Arizona | AZ |

3 | Arkansas | AR |

4 | California | CA |

Given this information, say we want to compute a relatively straightforward result: rank US states and territories by their 2010 population density. We clearly have the data here to find this result, but we'll have to combine the datasets to find the result.

We'll start with a many-to-one merge that will give us the full state name within the population `DataFrame`

.
We want to merge based on the `state/region`

column of `pop`

, and the `abbreviation`

column of `abbrevs`

.
We'll use `how='outer'`

to make sure no data is thrown away due to mismatched labels.

In [222]:

```
merged = pd.merge(pop, abbrevs, how='outer',
left_on='state/region', right_on='abbreviation')
merged = merged.drop('abbreviation', 1) # drop duplicate info
merged.head()
```

Out[222]:

state/region | ages | year | population | state | |
---|---|---|---|---|---|

0 | AL | under18 | 2012 | 1117489.0 | Alabama |

1 | AL | total | 2012 | 4817528.0 | Alabama |

2 | AL | under18 | 2010 | 1130966.0 | Alabama |

3 | AL | total | 2010 | 4785570.0 | Alabama |

4 | AL | under18 | 2011 | 1125763.0 | Alabama |

Let's double-check whether there were any mismatches here, which we can do by looking for rows with nulls:

In [223]:

```
merged.isnull().any()
```

Out[223]:

state/region False ages False year False population True state True dtype: bool

Some of the `population`

info is null:

In [224]:

```
merged[merged['population'].isnull()].head()
```

Out[224]:

state/region | ages | year | population | state | |
---|---|---|---|---|---|

2448 | PR | under18 | 1990 | NaN | NaN |

2449 | PR | total | 1990 | NaN | NaN |

2450 | PR | total | 1991 | NaN | NaN |

2451 | PR | under18 | 1991 | NaN | NaN |

2452 | PR | total | 1993 | NaN | NaN |

It appears that all the null population values are from Puerto Rico prior to the year 2000; this is likely due to this data not being available from the original source.

More importantly, we see also that some of the new `state`

entries are also null, which means that there was no corresponding entry in the `abbrevs`

key!
Let's figure out which regions lack this match:

In [225]:

```
merged.loc[merged['state'].isnull(), 'state/region'].unique()
```

Out[225]:

array(['PR', 'USA'], dtype=object)

We can quickly infer the issue: our population data includes entries for Puerto Rico (PR) and the United States as a whole (USA), while these entries do not appear in the state abbreviation key. We can fix these quickly by filling in appropriate entries:

In [226]:

```
merged.loc[merged['state/region'] == 'PR', 'state'] = 'Puerto Rico'
merged.loc[merged['state/region'] == 'USA', 'state'] = 'United States'
merged.isnull().any()
```

Out[226]:

state/region False ages False year False population True state False dtype: bool

No more nulls in the `state`

column: we're all set!

Now we can merge the result with the area data using a similar procedure.
Examining our results, we will want to join on the `state`

column in both:

In [227]:

```
final = pd.merge(merged, areas, on='state', how='left')
final.head()
```

Out[227]:

state/region | ages | year | population | state | area (sq. mi) | |
---|---|---|---|---|---|---|

0 | AL | under18 | 2012 | 1117489.0 | Alabama | 52423.0 |

1 | AL | total | 2012 | 4817528.0 | Alabama | 52423.0 |

2 | AL | under18 | 2010 | 1130966.0 | Alabama | 52423.0 |

3 | AL | total | 2010 | 4785570.0 | Alabama | 52423.0 |

4 | AL | under18 | 2011 | 1125763.0 | Alabama | 52423.0 |

Again, let's check for nulls to see if there were any mismatches:

In [228]:

```
final.isnull().any()
```

Out[228]:

state/region False ages False year False population True state False area (sq. mi) True dtype: bool

There are nulls in the `area`

column; we can take a look to see which regions were ignored here:

In [230]:

```
final['state'][final['area (sq. mi)'].isnull()].unique()
```

Out[230]:

array(['United States'], dtype=object)

We see that our `areas`

`DataFrame`

does not contain the area of the United States as a whole.
We could insert the appropriate value (using the sum of all state areas, for instance), but in this case we'll just drop the null values because the population density of the entire United States is not relevant to our current discussion:

In [231]:

```
final.dropna(inplace=True)
final.head()
```

Out[231]:

state/region | ages | year | population | state | area (sq. mi) | |
---|---|---|---|---|---|---|

0 | AL | under18 | 2012 | 1117489.0 | Alabama | 52423.0 |

1 | AL | total | 2012 | 4817528.0 | Alabama | 52423.0 |

2 | AL | under18 | 2010 | 1130966.0 | Alabama | 52423.0 |

3 | AL | total | 2010 | 4785570.0 | Alabama | 52423.0 |

4 | AL | under18 | 2011 | 1125763.0 | Alabama | 52423.0 |

Now we have all the data we need. To answer the question of interest, let's first select the portion of the data corresponding with the year 2000, and the total population.
We'll use the `query()`

function to do this quickly:

In [232]:

```
data2010 = final.query("year == 2010 & ages == 'total'")
data2010.head()
```

Out[232]:

state/region | ages | year | population | state | area (sq. mi) | |
---|---|---|---|---|---|---|

3 | AL | total | 2010 | 4785570.0 | Alabama | 52423.0 |

91 | AK | total | 2010 | 713868.0 | Alaska | 656425.0 |

101 | AZ | total | 2010 | 6408790.0 | Arizona | 114006.0 |

189 | AR | total | 2010 | 2922280.0 | Arkansas | 53182.0 |

197 | CA | total | 2010 | 37333601.0 | California | 163707.0 |

Now let's compute the population density and display it in order. We'll start by re-indexing our data on the state, and then compute the result:

In [233]:

```
data2010.set_index('state', inplace=True)
density = data2010['population'] / data2010['area (sq. mi)']
```

In [234]:

```
density.sort_values(ascending=False, inplace=True)
density.head()
```

Out[234]:

state District of Columbia 8898.897059 Puerto Rico 1058.665149 New Jersey 1009.253268 Rhode Island 681.339159 Connecticut 645.600649 dtype: float64

The result is a ranking of US states plus Washington, DC, and Puerto Rico in order of their 2010 population density, in residents per square mile. We can see that by far the densest region in this dataset is Washington, DC (i.e., the District of Columbia); among states, the densest is New Jersey.

We can also check the end of the list:

In [235]:

```
density.tail()
```

Out[235]:

state South Dakota 10.583512 North Dakota 9.537565 Montana 6.736171 Wyoming 5.768079 Alaska 1.087509 dtype: float64

We see that the least dense state, by far, is Alaska, averaging slightly over one resident per square mile.

This type of messy data merging is a common task when trying to answer questions using real-world data sources.

An essential piece of analysis of large data is efficient summarization: computing aggregations like `sum()`

, `mean()`

, `median()`

, `min()`

, and `max()`

, in which a single number gives insight into the nature of a potentially large dataset.

we'll use the same `display`

magic function as usual:

In [90]:

```
class display(object):
"""Display HTML representation of multiple objects"""
template = """<div style="float: left; padding: 10px;">
<p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
</div>"""
def __init__(
```