Data Wrangling Made Easy: Pandas Powers Your Analysis.

What is Pandas ?


Pandas - short for PANel DAta - is a special python library that can be used for data handling, manipulation, statistical analysis and much more. In simple words Pandas can be described as Excel for Python geeks.

Pandas DS cheat sheet.
Pandas for Data Science meme.

Python Knowledge Base: Make coding great again.
- Updated: 2024-07-26 by Andrey BRATUS, Senior Data Analyst.




    It's statistics functions are very powerful and useful for generating insights from data. Pandas has large number of methods to compute descriptive statistics and other related operations on Series and DataFrames.


    Pandas is a powerful Python library that has become the go-to tool for data scientists when it comes to data manipulation, cleaning, and analysis. To help you get the most out of this fantastic library, we've compiled a list of essential Pandas tips every data scientist should know.

    Use the .read_csv() function to quickly import large datasets and easily manipulate them as DataFrames, allowing for seamless data exploration and manipulation.

    Apply the .drop() function to remove unnecessary columns or rows, keeping your DataFrame lean and easy to handle, boosting performance and making your analysis more efficient.

    Use the .groupby() function to effortlessly group your data using specified columns, allowing for advanced aggregation and simplified analytical processes across subsets of your data.

    Implement the built-in .concat(), .merge(), and .join() functions to combine multiple DataFrames, creating more extensive and comprehensive datasets for in-depth analysis.

    Take advantage of .fillna() and .dropna() functions to deal with missing values appropriately, which is an essential part of any data cleaning process.

    Utilize the .apply(), .map(), and .applymap() methods for effective element-wise operations, significantly simplifying complex data manipulation and transformation tasks.

    Master the .pivot_table() function to create powerful multi-dimensional summaries, enabling meaningful insight extraction from large datasets with ease.

    Benefit from the .rolling() function for conducting rolling-window calculations, making time-series analysis more manageable and informative.

    Use the .dt accessor to manipulate date and time columns effectively, as working with datetime objects is a common challenge for data scientists.

    Finally, don't forget to utilize the .to_csv(), .to_excel() or .to_sql() functions to export your cleaned and processed DataFrame into the desired format, ensuring that your results are ready to be shared and analyzed further.

    In conclusion, mastering these essential Pandas tips will undoubtedly make your data manipulation and analysis tasks more efficient and effective, enabling you to get the most out of your Data Science projects using Python and Pandas.



  1. Pandas Series:


  2. Pandas Series is similar to a NumPy array (in fact it is built on top of the NumPy array object). The differentiates is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't limited numeric data, it can hold any arbitrary Python Object. Lets create initial data.

    
    import numpy as np
    import pandas as pd
    
    labels = ['a','b','c']
    my_list = [10,20,30]
    arr = np.array([10,20,30])
    d = {'a':10,'b':20,'c':30}
    
    pd.Series(data=my_list,index=labels)
    

    OUT:
    a 10
    b 20
    c 30
    dtype: int64


    A pandas Series can hold a variety of data types.

    
    pd.Series(data=labels)
    

    OUT:
    0 a
    1 b
    2 c
    dtype: object


    The key to using a Series is understanding its index.

    
    ser1 = pd.Series([1,2,3,4],index = ['USA', 'Germany','USSR', 'Japan']) 
    ser1 
    

    OUT:
    USA 1
    Germany 2
    USSR 3
    Japan 4
    dtype: int64



    
    ser2 = pd.Series([1,2,5,4],index = ['USA', 'Germany','Italy', 'Japan'])  
    ser2 
    

    OUT:
    USA 1
    Germany 2
    Italy 5
    Japan 4
    dtype: int64


    
    ser1['USA']
    

    OUT: 1


    Simple operations can be done based on index:

    
    ser1 + ser2
    

    OUT:
    Germany 4.0
    Italy NaN
    Japan 8.0
    USA 2.0
    USSR NaN
    dtype: float64


  3. Pandas DataFrames:


  4. DataFrames are the core of pandas. We can define a DataFrame as a bunch of Series objects put together to share the same index ( each column is a series data type in fact ). Lets create initial data.

    
    import pandas as pd
    import numpy as np
    
    from numpy.random import randn
    np.random.seed(101)
    
    df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())
    
    df['W']
    

    OUT:
    A 2.706850
    B 0.651118
    C -2.018168
    D 0.188695
    E 0.190794
    Name: W, dtype: float64


    SQL Syntax.

    
    df.W
    

    OUT:
    A 2.706850
    B 0.651118
    C -2.018168
    D 0.188695
    E 0.190794
    Name: W, dtype: float64


    You can pass a list of column names.

    
    df[['W','Z']]
    

    DataFrame Columns are just Series.

    
    type(df['W'])
    

    OUT: pandas.core.series.Series


    Removing/Deleting Columns.

    
    df.drop('new',axis=1,inplace=True)
    

    Removing/Deleting rows.

    
    df.drop('E',axis=0)
    

    Selecting Rows.

    
    df.loc['A']
    

    Selecting Rows based on position instead of label.

    
    df.iloc[2]
    

    Selecting subset of rows and columns.

    
    df.loc['B','Y']
    
    
    df.loc[['A','B'],['W','Y']]
    

    Selecting subset of rows with a certain condition.

    
    df[df['W']>0]
    

    Selecting column with a certain condition.

    
    df[df['W']>0]['Y']
    

    OUT:
    A 0.907969
    B -0.848077
    D -0.933237
    E 2.605967
    Name: Y, dtype: float64


    Selecting 2 columns with a certain condition.

    
    df[df['W']>0][['Y','X']]
    

    For several conditions you can use | and & with parenthesis.

    
    df[(df['W']>0) & (df['Y'] > 1)]
    

    Reset to default 0,1...n index.

    
    df.reset_index()
    

    Setting new index.

    
    newind = 'CA NY WY OR CO'.split()
    df['States'] = newind
    df.set_index('States',inplace=True)
    
  5. Missing Data:


  6. Dropping rows with empty values.

    
    df.dropna()
    

    Dropping columns with empty values.

    
    df.dropna(axis=1)
    

    Filling empty values.

    
    df.fillna(value='FILL VALUE')
    

    Filling empty values with column mean/average.

    
    df['A'].fillna(value=df['A'].mean())
    
  7. Standart operations:


  8. Info on Unique Values

    
    df['col2'].unique()
    

    Number of distinct elements.

    
    df['col2'].nunique()
    

    Counts of unique rows in the DataFrame.

    
    df['col2'].value_counts()
    

    Applying Functions.

    
    def times2(x):
        return x*2
    
    
    df['col1'].apply(times2)    
    

    Applying SUM to column.

    
    df['col1'].sum()    
    

    Permanently Removing a Column.

    
    del df['col1']    
    

    Get column and index names.

    
    df.columns
    
    df.index    
    

    Sorting and Ordering a DataFrame.

    
    df.sort_values(by='col1') 
    #inplace=False by default    
    

    Check for Null Values.

    
    df.isnull()    
    
  9. Data Input and Output:


  10. CSV Input.

    
    df = pd.read_csv('example')  
    

    Excel Input.

    
    pd.read_excel('Excel_1.xlsx',sheetname='Sheet1') 
    

    Excel Output.

    
    df.to_excel('Excel_2.xlsx',sheet_name='Sheet1') 
    

    HTML Input.

    
    conda install lxml
    conda install html5lib
    conda install BeautifulSoup4
    
    df = pd.read_html('http://www.fdic.gov/bank/individual/failed/banklist.html')
    

    SQL Input.

    
    from sqlalchemy import create_engine
    
    engine = create_engine('sqlite:///:memory:')
    
    df.to_sql('data', engine)
    
    sql_df = pd.read_sql('data',con=engine)
    
  11. Statistical functions and operations:


  12. Groupby.

    
    df.groupby('Col')
    

    Aggregate methods.

    
    df.mean()
    
    df.groupby('Col1').mean()
    
    df.std()
    
    df.min()
    
    df.max()
    
    df.count()
    
    df.describe()
    
    df.describe().transpose()
    



See also related topics: