Pandas

위키피디아에 따르면 Pandas를 다음과 같이 정의할 수 있다.

In computer programming, pandas is a software library written for the Python programming language for data manipulation and analysis.

Pandas는 행과 열을 가지는 2차원 데이터를 다루는데 유용한 파이선 Library이고, 이는 Data를 조작하거나 분석하는데 유용하다. Pandas의 Dataframe으로 이러한 작업을 진행하고, 하나의 열을 Series라고 한다.

Series

  • 데이터 생성하기

  • 데이터 분석하기

  • 데이터 선택하기

  • 데이터 변경하기

Other functions

  • Series → numpy array

  • Series.max

    • 코드

    >>> idx = pd.MultiIndex.from_arrays([
    ...     ['warm', 'warm', 'cold', 'cold'],
    ...     ['dog', 'falcon', 'fish', 'spider']],
    ...     names=['blooded', 'animal'])
    >>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx)
    >>> s
    blooded  animal
    warm     dog       4
            falcon    2
    cold     fish      0
            spider    8
    Name: legs, dtype: int64
    

DataFrame

  • 데이터 생성하기

  • 데이터 선택하기

  • 데이터 변환하기

  • 데이터 종류와 전처리

  • group by, transform

  • pivot/pivot_table, stack/unstack

  • 데이터 병합하기

Other functions

  • Delete rows

    • 코드

    # Deleting columns
    
    # Delete the "Area" column from the dataframe
    data = data.drop("Area", axis=1)
    
    # alternatively, delete columns using the columns parameter of drop
    data = data.drop(columns="area")
    
    # Delete the Area column from the dataframe in place
    # Note that the original 'data' object is changed when inplace=True
    data.drop("Area", axis=1, inplace=True)
    
    # Delete multiple columns from the dataframe
    data = data.drop(["Y2001", "Y2002", "Y2003"], axis=1)
    
  • Append rows

    • 코드

    # Importing pandas as pd
    import pandas as pd
    
    # Creating the first Dataframe using dictionary
    df1 = df = pd.DataFrame({"a":[1, 2, 3, 4],
                            "b":[5, 6, 7, 8]})
    
    # Creating the Second Dataframe using dictionary
    df2 = pd.DataFrame({"a":[1, 2, 3],
                        "b":[5, 6, 7]})
    
    # to append df2 at the end of df1 dataframe
    df1 = df1.append(df2)
    
  • numpy array → DataFrame

    • 코드

    numpy_data = np.array([[1, 2], [3, 4]])
    df = pd.DataFrame(data=numpy_data, index=["row1", "row2"], columns=["column1", "column2"])
    
  • read_csv and to_csv

# read_csv
pd.read_csv('sample1.csv')
pd.read_csv('sample2.csv', names=['c1', 'c2', 'c3'])
pd.read_csv('sample1.csv', index_col='c1')
# This defines separator as being one single white space or more
pd.read_table('sample3.txt', sep='\s+')

# to_csv
df.to_csv('sample6.csv')
df.to_csv('sample7.txt', sep='|')
df.to_csv('sample8.csv', na_rep='누락')
  • Handle the dataframe

# Drop rows of Pandas DataFrame whose value in certain columns is NaN
df[pd.notnull(df['IC50 (nM)'])]

# Reset the index
df.reset_index(drop=True)

# Get header list
list(df.columns.values)
list(df)

# Select rows from a Pandas DataFrame based on values in a column
df.loc[df['favorite_color'] == 'yellow']
df[df['A'].str.contains("hello")]
df[df['A'].str.contains("Hello|Britain")==True]
array = ['yellow', 'green']
df.loc[df['favorite_color'].isin(array)]

# Change column names
df.rename(index=str, columns={"A": "a", "C": "c"})
  • Column 내용 수정

  • Reset index

    • 코드

    >>> df = pd.DataFrame([('bird', 389.0),
    ...                 ('bird', 24.0),
    ...                 ('mammal', 80.5),
    ...                 ('mammal', np.nan)],
    ...                 index=['falcon', 'parrot', 'lion', 'monkey'],
    ...                 columns=('class', 'max_speed'))
    >>> df
            class  max_speed
    falcon    bird      389.0
    parrot    bird       24.0
    lion    mammal       80.5
    monkey  mammal        NaN
    
  • Iterate rows

    • ilocs

      • 코드

      # import pandas package as pd
      import pandas as pd
      
      # Define a dictionary containing students data
      data = {'Name': ['Ankit', 'Amit', 'Aishwarya', 'Priyanka'],
                      'Age': [21, 19, 20, 18],
                      'Stream': ['Math', 'Commerce', 'Arts', 'Biology'],
                      'Percentage': [88, 92, 95, 70]}
      
      # Convert the dictionary into DataFrame
      df = pd.DataFrame(data, columns = ['Name', 'Age', 'Stream', 'Percentage'])
      
      print("Given Dataframe :\n", df)
      
      print("\nIterating over rows using iloc function :\n")
      
      # iterate through each row and select
      # 0th and 2nd index column respectively.
      for i in range(len(df)) :
      print(df.iloc[i, 0], df.iloc[i, 2])
      
      • 출력

      Given Dataframe :
              Name  Age    Stream  Percentage
      0      Ankit   21      Math          88
      1       Amit   19  Commerce          92
      2  Aishwarya   20      Arts          95
      3   Priyanka   18   Biology          70
      
      Iterating over rows using iloc function :
      
      Ankit Math
      Amit Commerce
      Aishwarya Arts
      Priyanka Biology
      
  • 정렬

    • 코드

    In [1]: import pandas as pd

    In [2]: personnel_df = pd.DataFrame({‘sequence’: [1, 3, 2], …: ‘name’: [‘park’, ‘lee’, ‘choi’], …: ‘age’: [30, 20, 40]})

    In [3]: personnel_df

    Out[3]: age name sequence 0 30 park 1 1 20 lee 3 2 40 choi 2

Reference