## What is Pandas ?

Pandas - short for PANel DAta - is a special python library that can be used for data handling, manipulation, statistical analysis and much more. In simple words Pandas can be described as Excel for Python geeks.

It's statistics functions are very powerful and useful for generating insights from data. Pandas has large number of methods to compute descriptive statistics and other related operations on Series and DataFrames.

## Pandas Series:

Pandas Series is similar to a NumPy array (in fact it is built on top of the NumPy array object). The differentiates is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't limited numeric data, it can hold any arbitrary Python Object. Lets create initial data.

```
import numpy as np
import pandas as pd
labels = ['a','b','c']
my_list = [10,20,30]
arr = np.array([10,20,30])
d = {'a':10,'b':20,'c':30}
pd.Series(data=my_list,index=labels)
```

OUT:

a 10

b 20

c 30

dtype: int64

A pandas Series can hold a variety of data types.

```
pd.Series(data=labels)
```

OUT:

0 a

1 b

2 c

dtype: object

The key to using a Series is understanding its index.

```
ser1 = pd.Series([1,2,3,4],index = ['USA', 'Germany','USSR', 'Japan'])
ser1
```

OUT:

USA 1

Germany 2

USSR 3

Japan 4

dtype: int64

```
ser2 = pd.Series([1,2,5,4],index = ['USA', 'Germany','Italy', 'Japan'])
ser2
```

OUT:

USA 1

Germany 2

Italy 5

Japan 4

dtype: int64

```
ser1['USA']
```

OUT: 1

Simple operations can be done based on index:

```
ser1 + ser2
```

OUT:

Germany 4.0

Italy NaN

Japan 8.0

USA 2.0

USSR NaN

dtype: float64

## Pandas DataFrames:

DataFrames are the core of pandas. We can define a DataFrame as a bunch of Series objects put together to share the same index ( each column is a series data type in fact ). Lets create initial data.

```
import pandas as pd
import numpy as np
from numpy.random import randn
np.random.seed(101)
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())
df['W']
```

OUT:

A 2.706850

B 0.651118

C -2.018168

D 0.188695

E 0.190794

Name: W, dtype: float64

SQL Syntax.

```
df.W
```

OUT:

A 2.706850

B 0.651118

C -2.018168

D 0.188695

E 0.190794

Name: W, dtype: float64

You can pass a list of column names.

```
df[['W','Z']]
```

DataFrame Columns are just Series.

```
type(df['W'])
```

OUT: pandas.core.series.Series

Removing/Deleting Columns.

```
df.drop('new',axis=1,inplace=True)
```

Removing/Deleting rows.

```
df.drop('E',axis=0)
```

Selecting Rows.

```
df.loc['A']
```

Selecting Rows based on position instead of label.

```
df.iloc[2]
```

Selecting subset of rows and columns.

```
df.loc['B','Y']
```

```
df.loc[['A','B'],['W','Y']]
```

Selecting subset of rows with a certain condition.

```
df[df['W']>0]
```

Selecting column with a certain condition.

```
df[df['W']>0]['Y']
```

OUT:

A 0.907969

B -0.848077

D -0.933237

E 2.605967

Name: Y, dtype: float64

Selecting 2 columns with a certain condition.

```
df[df['W']>0][['Y','X']]
```

For several conditions you can use | and & with parenthesis.

```
df[(df['W']>0) & (df['Y'] > 1)]
```

Reset to default 0,1...n index.

```
df.reset_index()
```

Setting new index.

```
newind = 'CA NY WY OR CO'.split()
df['States'] = newind
df.set_index('States',inplace=True)
```

## Missing Data:

Dropping rows with empty values.

```
df.dropna()
```

Dropping columns with empty values.

```
df.dropna(axis=1)
```

Filling empty values.

```
df.fillna(value='FILL VALUE')
```

Filling empty values with column mean/average.

```
df['A'].fillna(value=df['A'].mean())
```

## Standart operations:

Info on Unique Values

```
df['col2'].unique()
```

Number of distinct elements.

```
df['col2'].nunique()
```

Counts of unique rows in the DataFrame.

```
df['col2'].value_counts()
```

Applying Functions.

```
def times2(x):
return x*2
df['col1'].apply(times2)
```

Applying SUM to column.

```
df['col1'].sum()
```

Permanently Removing a Column.

```
del df['col1']
```

Get column and index names.

```
df.columns
df.index
```

Sorting and Ordering a DataFrame.

```
df.sort_values(by='col1')
#inplace=False by default
```

Check for Null Values.

```
df.isnull()
```

## Data Input and Output:

CSV Input.

```
df = pd.read_csv('example')
```

Excel Input.

```
pd.read_excel('Excel_1.xlsx',sheetname='Sheet1')
```

Excel Output.

```
df.to_excel('Excel_2.xlsx',sheet_name='Sheet1')
```

HTML Input.

```
conda install lxml
conda install html5lib
conda install BeautifulSoup4
df = pd.read_html('http://www.fdic.gov/bank/individual/failed/banklist.html')
```

SQL Input.

```
from sqlalchemy import create_engine
engine = create_engine('sqlite:///:memory:')
df.to_sql('data', engine)
sql_df = pd.read_sql('data',con=engine)
```

## Statistical functions and operations:

Groupby.

```
df.groupby('Col')
```

Aggregate methods.

```
df.mean()
df.groupby('Col1').mean()
df.std()
df.min()
df.max()
df.count()
df.describe()
df.describe().transpose()
```