Pandas完全指南

文章目录
  1. 1. 前言
  2. 2. 导入pandas
  3. 3. 创建列表
    1. 3.1. 创建普通列表
    2. 3.2. 创建时间列表
  4. 4. 创建矩阵
    1. 4.1. 根据列表(Series)创建矩阵
    2. 4.2. 根据字典创建矩阵
  5. 5. 矩阵属性、检视数据
    1. 5.1. 行数列数
    2. 5.2. 索引
    3. 5.3. 列名
    4. 5.4.
    5. 5.5. 矩阵信息
    6. 5.6. 矩阵描述信息
    7. 5.7. top5 数据
    8. 5.8. tail5 数据
    9. 5.9. 某一列值统计
    10. 5.10. 查看每一列唯一值统计
  6. 6. 排序
    1. 6.1. 根据索引(index)排序
    2. 6.2. 根据值排序
  7. 7. 选取数据
    1. 7.1. 选取某一列
    2. 7.2. 根据索引选取某几行数据
    3. 7.3. 根据列名选择某几列数据
    4. 7.4. 根据索引和列名选择数据
    5. 7.5. 根据行索引和列索引取数据
    6. 7.6. 根据布尔表达式表达式取数据
  8. 8. 添加/删除列、更新、替换数据
    1. 8.1. 设置某矩阵项值
    2. 8.2. 根据条件创建新列
    3. 8.3. 根据已有列创建新列
    4. 8.4. 替换数据
    5. 8.5. 列名重命名
    6. 8.6. 重设索引
    7. 8.7. 删除列
    8. 8.8. 处理Nan数据
      1. 8.8.1. 检查是否Nan值
      2. 8.8.2. 删除掉包含null值的行
      3. 8.8.3. 替换Nan
  9. 9. 聚合、分组、统计
    1. 9.1. 返回每一列的平均数
    2. 9.2. 返回列之间的相关性
    3. 9.3. 返回每一列中非null值数量
    4. 9.4. 返回每一列中最大值
    5. 9.5. 返回每一列中最小值
    6. 9.6. 返回每一列的中值
    7. 9.7. 返回每一列的标准偏差
    8. 9.8. 分组后取TopN
    9. 9.9. 多重分组后取TopN
    10. 9.10. 分组之后取平均值
    11. 9.11. 分组之后计数
    12. 9.12. 分组后唯一值统计
    13. 9.13. 分组后求和
    14. 9.14. 分组后求平均值,最大值,最小值
    15. 9.15. 分组后不同列使用不同求值函数
    16. 9.16. 多个分组结果拼接
    17. 9.17. 遍历分组
    18. 9.18. 获取分组信息
    19. 9.19. 取分组后的某一组
    20. 9.20. 分组后过滤
  10. 10. 数据透视
  11. 11. 合并、连接、拼接(Merge, join, and concatenate)
    1. 11.1. 拼接(concatenate)
    2. 11.2. 连接(Join)
    3. 11.3. 根据某一列进行join
  12. 12. 数据导入导出
    1. 12.1. 从csv中导入数据
    2. 12.2. 导出数据到csv

前言

Pandas 是一个Python语言实现的,开源,易于使用的数据架构以及数据分析工具。在Pandas中主要有两种数据类型,可以简单的理解为:

  • Series:一维数组(列表)
  • DateFrame:二维数组(矩阵)

在线实验:Pandas完全指南.ipynb

学习资料:

导入pandas

1
2
3
import pandas as pd
import numpy as np
from IPython.display import Image

创建列表

创建普通列表

1
s = pd.Series([1, 3, 6, np.nan, 23, 3]) # type(s) === 'pandas.core.series.Series'

创建时间列表

1
dates = pd.date_range('20200101', periods=6)

创建矩阵

根据列表(Series)创建矩阵

1
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=['a', 'b', 'c', 'd'])
1
df
abcd
2020-01-01-1.3657741.1698990.607591-2.029687
2020-01-02-0.967683-0.8004480.1236730.700337
2020-01-031.7906090.5606660.3440510.799520
2020-01-042.0686630.320610-1.6606310.416631
2020-01-05-0.956351-0.6570501.241433-0.652496
2020-01-06-1.1358701.8880930.5333640.080852
1
2
3
4
5
df2 = pd.DataFrame({
'a':pd.Series([1, 2, 3, 4]),
'b':pd.Timestamp('20180708'),
'c':pd.Categorical(['cate1', 'cate2', 'cate3', 'cate4'])
})
1
df2
abc
012018-07-08cate1
122018-07-08cate2
232018-07-08cate3
342018-07-08cate4

根据字典创建矩阵

1
2
3
4
5
6
7
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy', 'Jack', 'Tim'], 
'age': [20, 32, 36, 24, 23, 18, 27],
'gender': np.random.choice(['M','F'],size=7),
'score': [25, 94, 57, 62, 70, 88, 67],
'country': np.random.choice(['US','CN'],size=7),
}
df3 = pd.DataFrame(data, columns = ['name', 'age', 'gender', 'score', 'country'])
1
df3
nameagegenderscorecountry
0Jason20F25US
1Molly32F94US
2Tina36F57US
3Jake24M62CN
4Amy23F70US
5Jack18M88CN
6Tim27F67CN

矩阵属性、检视数据

行数列数

1
df.shape
(6, 4)

索引

1
df.index
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06'],
              dtype='datetime64[ns]', freq='D')

列名

1
df.columns
Index(['a', 'b', 'c', 'd'], dtype='object')

1
df.values
array([[-1.36577441,  1.16989918,  0.60759059, -2.02968684],
       [-0.96768326, -0.80044798,  0.12367311,  0.70033731],
       [ 1.79060939,  0.56066552,  0.34405077,  0.79952019],
       [ 2.06866329,  0.32060998, -1.6606308 ,  0.41663058],
       [-0.95635134, -0.65704975,  1.24143335, -0.65249624],
       [-1.1358703 ,  1.88809265,  0.53336403,  0.08085195]])

矩阵信息

1
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6 entries, 2020-01-01 to 2020-01-06
Freq: D
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   a       6 non-null      float64
 1   b       6 non-null      float64
 2   c       6 non-null      float64
 3   d       6 non-null      float64
dtypes: float64(4)
memory usage: 240.0 bytes

矩阵描述信息

1
df.describe()
abcd
count6.0000006.0000006.0000006.000000
mean-0.0944010.4136280.198247-0.114141
std1.5772601.0389030.9849211.074899
min-1.365774-0.800448-1.660631-2.029687
25%-1.093824-0.4126350.178768-0.469159
50%-0.9620170.4406380.4387070.248741
75%1.1038691.0175910.5890340.629411
max2.0686631.8880931.2414330.799520
1
2
3
4
### 更改索引
df.index = pd.date_range('2020/06/01', periods=df.shape[0])

df
abcd
2020-06-01-1.3657741.1698990.607591-2.029687
2020-06-02-0.967683-0.8004480.1236730.700337
2020-06-031.7906090.5606660.3440510.799520
2020-06-042.0686630.320610-1.6606310.416631
2020-06-05-0.956351-0.6570501.241433-0.652496
2020-06-06-1.1358701.8880930.5333640.080852

top5 数据

1
df.head(1)
abcd
2020-06-01-1.3657741.1698990.607591-2.029687

tail5 数据

1
df.tail(5)
abcd
2020-06-02-0.967683-0.8004480.1236730.700337
2020-06-031.7906090.5606660.3440510.799520
2020-06-042.0686630.320610-1.6606310.416631
2020-06-05-0.956351-0.6570501.241433-0.652496
2020-06-06-1.1358701.8880930.5333640.080852

某一列值统计

1
df['a'].value_counts(dropna=False)
 1.790609    1
-1.135870    1
 2.068663    1
-0.967683    1
-1.365774    1
-0.956351    1
Name: a, dtype: int64

查看每一列唯一值统计

1
df.apply(pd.Series.value_counts)
abcd
-2.029687NaNNaNNaN1.0
-1.660631NaNNaN1.0NaN
-1.3657741.0NaNNaNNaN
-1.1358701.0NaNNaNNaN
-0.9676831.0NaNNaNNaN
-0.9563511.0NaNNaNNaN
-0.800448NaN1.0NaNNaN
-0.657050NaN1.0NaNNaN
-0.652496NaNNaNNaN1.0
0.080852NaNNaNNaN1.0
0.123673NaNNaN1.0NaN
0.320610NaN1.0NaNNaN
0.344051NaNNaN1.0NaN
0.416631NaNNaNNaN1.0
0.533364NaNNaN1.0NaN
0.560666NaN1.0NaNNaN
0.607591NaNNaN1.0NaN
0.700337NaNNaNNaN1.0
0.799520NaNNaNNaN1.0
1.169899NaN1.0NaNNaN
1.241433NaNNaN1.0NaN
1.7906091.0NaNNaNNaN
1.888093NaN1.0NaNNaN
2.0686631.0NaNNaNNaN

排序

根据索引(index)排序

1
2
3
# sort_index(axis=, ascending=)
# axis:0-行排序,1-列排序; ascending:True-升序,False-降序
df.sort_index(axis=0, ascending=False)
abcd
2020-06-06-1.1358701.8880930.5333640.080852
2020-06-05-0.956351-0.6570501.241433-0.652496
2020-06-042.0686630.320610-1.6606310.416631
2020-06-031.7906090.5606660.3440510.799520
2020-06-02-0.967683-0.8004480.1236730.700337
2020-06-01-1.3657741.1698990.607591-2.029687
1
df.sort_index(axis=1, ascending=False)
dcba
2020-06-01-2.0296870.6075911.169899-1.365774
2020-06-020.7003370.123673-0.800448-0.967683
2020-06-030.7995200.3440510.5606661.790609
2020-06-040.416631-1.6606310.3206102.068663
2020-06-05-0.6524961.241433-0.657050-0.956351
2020-06-060.0808520.5333641.888093-1.135870

根据值排序

1
df.sort_values(by='a', ascending=False)
abcd
2020-06-042.0686630.320610-1.6606310.416631
2020-06-031.7906090.5606660.3440510.799520
2020-06-05-0.956351-0.6570501.241433-0.652496
2020-06-02-0.967683-0.8004480.1236730.700337
2020-06-06-1.1358701.8880930.5333640.080852
2020-06-01-1.3657741.1698990.607591-2.029687
1
df.sort_values(by=['a','b'], ascending=True)
abcd
2020-06-01-1.3657741.1698990.607591-2.029687
2020-06-06-1.1358701.8880930.5333640.080852
2020-06-02-0.967683-0.8004480.1236730.700337
2020-06-05-0.956351-0.6570501.241433-0.652496
2020-06-031.7906090.5606660.3440510.799520
2020-06-042.0686630.320610-1.6606310.416631

选取数据

选取某一列

1
df['a'] # 等效于df.a
2020-06-01   -1.365774
2020-06-02   -0.967683
2020-06-03    1.790609
2020-06-04    2.068663
2020-06-05   -0.956351
2020-06-06   -1.135870
Freq: D, Name: a, dtype: float64

根据索引选取某几行数据

1
df['2020-06-01':'2020-06-02'] # 选取索引以2020-06-01开始,到2020-06-02结束的数据
abcd
2020-06-01-1.3657741.1698990.607591-2.029687
2020-06-02-0.967683-0.8004480.1236730.700337

根据列名选择某几列数据

1
df[['c', 'b']]
cb
2020-06-010.6075911.169899
2020-06-020.123673-0.800448
2020-06-030.3440510.560666
2020-06-04-1.6606310.320610
2020-06-051.241433-0.657050
2020-06-060.5333641.888093

根据索引和列名选择数据

loc[行名选择, 列名选择],未指定行名或列名,或者指定为:则表示选择当前所有行,或列

1
df.loc['2020-06-01']
a   -1.365774
b    1.169899
c    0.607591
d   -2.029687
Name: 2020-06-01 00:00:00, dtype: float64
1
df.loc['2020-06-01', 'b']
1.1698991845802456
1
df.loc[:, 'b'] # type(df.loc[:, 'b']) === 'pandas.core.series.Series',而type(df.loc[:, ['b']]) === ’pandas.core.frame.DataFrame‘
2020-06-01    1.169899
2020-06-02   -0.800448
2020-06-03    0.560666
2020-06-04    0.320610
2020-06-05   -0.657050
2020-06-06    1.888093
Freq: D, Name: b, dtype: float64
1
df.loc[:, ['a', 'b']]
ab
2020-06-01-1.3657741.169899
2020-06-02-0.967683-0.800448
2020-06-031.7906090.560666
2020-06-042.0686630.320610
2020-06-05-0.956351-0.657050
2020-06-06-1.1358701.888093

根据行索引和列索引取数据

1
df.iloc[0,0] # === df.loc['2020-06-01', 'a']
-1.3657744117360429
1
df.iloc[0, :] # ==== df.loc['2020-06-01', :]
a   -1.365774
b    1.169899
c    0.607591
d   -2.029687
Name: 2020-06-01 00:00:00, dtype: float64

根据布尔表达式表达式取数据

只有当布尔表达式为真时的数据才会被选择

1
df[df.a > 1]
abcd
2020-06-031.7906090.5606660.3440510.799520
2020-06-042.0686630.320610-1.6606310.416631
1
df[(df['a'] > 1) & (df['d'] <0)]
abcd

添加/删除列、更新、替换数据

设置某矩阵项值

1
2
df.loc['2020-06-01', 'a'] = np.nan
df.loc['2020-06-06', 'c'] = np.nan
1
df
abcd
2020-06-01NaN1.1698990.607591-2.029687
2020-06-02-0.967683-0.8004480.1236730.700337
2020-06-031.7906090.5606660.3440510.799520
2020-06-042.0686630.320610-1.6606310.416631
2020-06-05-0.956351-0.6570501.241433-0.652496
2020-06-06-1.1358701.888093NaN0.080852

根据条件创建新列

1
df['e'] = np.where((df['a'] > 1) & (df['d']<0), 1, 0)
1
df
abcde
2020-06-01NaN1.1698990.607591-2.0296870
2020-06-02-0.967683-0.8004480.1236730.7003370
2020-06-031.7906090.5606660.3440510.7995200
2020-06-042.0686630.320610-1.6606310.4166310
2020-06-05-0.956351-0.6570501.241433-0.6524960
2020-06-06-1.1358701.888093NaN0.0808520

根据已有列创建新列

1
2
tmp = df.copy()
df.loc[:,'f'] = tmp.apply(lambda row: row['b']+ row['d'], axis=1)
1
df
abcdef
2020-06-01NaN1.1698990.607591-2.0296870-0.859788
2020-06-02-0.967683-0.8004480.1236730.7003370-0.100111
2020-06-031.7906090.5606660.3440510.79952001.360186
2020-06-042.0686630.320610-1.6606310.41663100.737241
2020-06-05-0.956351-0.6570501.241433-0.6524960-1.309546
2020-06-06-1.1358701.888093NaN0.08085201.968945

替换数据

1
2
# 将所有等于1的值替换成20
df.replace(1,20)
abcdef
2020-06-01NaN1.1698990.607591-2.0296870-0.859788
2020-06-02-0.967683-0.8004480.1236730.7003370-0.100111
2020-06-031.7906090.5606660.3440510.79952001.360186
2020-06-042.0686630.320610-1.6606310.41663100.737241
2020-06-05-0.956351-0.6570501.241433-0.6524960-1.309546
2020-06-06-1.1358701.888093NaN0.08085201.968945
1
2
# 使用one替换1,three替换3
df.replace([1,3],['one','three'])
abcdef
2020-06-01NaN1.1698990.607591-2.0296870-0.859788
2020-06-02-0.967683-0.8004480.1236730.7003370-0.100111
2020-06-031.7906090.5606660.3440510.79952001.360186
2020-06-042.0686630.320610-1.6606310.41663100.737241
2020-06-05-0.956351-0.6570501.241433-0.6524960-1.309546
2020-06-06-1.1358701.888093NaN0.08085201.968945

列名重命名

1
df.rename(columns={'c':'cc'})
abccdef
2020-06-01NaN1.1698990.607591-2.0296870-0.859788
2020-06-02-0.967683-0.8004480.1236730.7003370-0.100111
2020-06-031.7906090.5606660.3440510.79952001.360186
2020-06-042.0686630.320610-1.6606310.41663100.737241
2020-06-05-0.956351-0.6570501.241433-0.6524960-1.309546
2020-06-06-1.1358701.888093NaN0.08085201.968945

重设索引

1
2
# 将a设置为索引
df.set_index('a')
bcdef
a
NaN1.1698990.607591-2.0296870-0.859788
-0.967683-0.8004480.1236730.7003370-0.100111
1.7906090.5606660.3440510.79952001.360186
2.0686630.320610-1.6606310.41663100.737241
-0.956351-0.6570501.241433-0.6524960-1.309546
-1.1358701.888093NaN0.08085201.968945

删除列

1
df.drop(columns=['a', 'f'])
bcde
2020-06-011.1698990.607591-2.0296870
2020-06-02-0.8004480.1236730.7003370
2020-06-030.5606660.3440510.7995200
2020-06-040.320610-1.6606310.4166310
2020-06-05-0.6570501.241433-0.6524960
2020-06-061.888093NaN0.0808520

处理Nan数据

检查是否Nan值

1
df.isnull()
abcdef
2020-06-01TrueFalseFalseFalseFalseFalse
2020-06-02FalseFalseFalseFalseFalseFalse
2020-06-03FalseFalseFalseFalseFalseFalse
2020-06-04FalseFalseFalseFalseFalseFalse
2020-06-05FalseFalseFalseFalseFalseFalse
2020-06-06FalseFalseTrueFalseFalseFalse
1
df.notnull() # df.isnull()反操作
abcdef
2020-06-01FalseTrueTrueTrueTrueTrue
2020-06-02TrueTrueTrueTrueTrueTrue
2020-06-03TrueTrueTrueTrueTrueTrue
2020-06-04TrueTrueTrueTrueTrueTrue
2020-06-05TrueTrueTrueTrueTrueTrue
2020-06-06TrueTrueFalseTrueTrueTrue

删除掉包含null值的行

1
2
3
4
### dropna(axis=, how=):丢弃NaN数据,
# axis:0-按行丢弃),1-按列丢弃; how:'any'-只要含有NaN数据就丢弃,'all'-所有数据都为NaN时丢弃

df.dropna(axis=0)
abcdef
2020-06-02-0.967683-0.8004480.1236730.7003370-0.100111
2020-06-031.7906090.5606660.3440510.79952001.360186
2020-06-042.0686630.320610-1.6606310.41663100.737241
2020-06-05-0.956351-0.6570501.241433-0.6524960-1.309546

替换Nan

1
2
#### 使用1000替换Nan
df.fillna(1000)
abcdef
2020-06-011000.0000001.1698990.607591-2.0296870-0.859788
2020-06-02-0.967683-0.8004480.1236730.7003370-0.100111
2020-06-031.7906090.5606660.3440510.79952001.360186
2020-06-042.0686630.320610-1.6606310.41663100.737241
2020-06-05-0.956351-0.6570501.241433-0.6524960-1.309546
2020-06-06-1.1358701.8880931000.0000000.08085201.968945
1
2
# 使用平均值替换所有null值
df.fillna(df.mean())
abcdef
2020-06-010.1598741.1698990.607591-2.0296870-0.859788
2020-06-02-0.967683-0.8004480.1236730.7003370-0.100111
2020-06-031.7906090.5606660.3440510.79952001.360186
2020-06-042.0686630.320610-1.6606310.41663100.737241
2020-06-05-0.956351-0.6570501.241433-0.6524960-1.309546
2020-06-06-1.1358701.8880930.1312230.08085201.968945

聚合、分组、统计

返回每一列的平均数

1
df.mean()
a    0.159874
b    0.413628
c    0.131223
d   -0.114141
e    0.000000
f    0.299488
dtype: float64

返回列之间的相关性

1
df.corr()
abcdef
a1.0000000.101781-0.6800850.508954NaN0.318586
b0.1017811.000000-0.171353-0.266608NaN0.587598
c-0.680085-0.1713531.000000-0.437212NaN-0.605077
d0.508954-0.266608-0.4372121.000000NaN0.623208
eNaNNaNNaNNaNNaNNaN
f0.3185860.587598-0.6050770.623208NaN1.000000

返回每一列中非null值数量

1
df.count()
a    5
b    6
c    5
d    6
e    6
f    6
dtype: int64

返回每一列中最大值

1
df.max()
a    2.068663
b    1.888093
c    1.241433
d    0.799520
e    0.000000
f    1.968945
dtype: float64

返回每一列中最小值

1
df.min()
a   -1.135870
b   -0.800448
c   -1.660631
d   -2.029687
e    0.000000
f   -1.309546
dtype: float64

返回每一列的中值

1
df.median()
a   -0.956351
b    0.440638
c    0.344051
d    0.248741
e    0.000000
f    0.318565
dtype: float64

返回每一列的标准偏差

1
df.std()
a    1.620114
b    1.038903
c    1.085770
d    1.074899
e    0.000000
f    1.280342
dtype: float64

分组后取TopN

1
2
3
4
5
### 取每个国家下,分值前二的记录

# 先排序
df4 = df3.sort_values(['country','score'],ascending=[1, 0],inplace=False)
df4
nameagegenderscorecountry
5Jack18M88CN
6Tim27F67CN
3Jake24M62CN
1Molly32F94US
4Amy23F70US
2Tina36F57US
0Jason20F25US
1
2
# 取值
df4.groupby(['country']).head(2)
nameagegenderscorecountry
5Jack18M88CN
6Tim27F67CN
1Molly32F94US
4Amy23F70US

多重分组后取TopN

1
2
3
4
5
### 取每个国家下,分值前二的记录

# 先排序
df5 = df3.sort_values(['country','gender', 'score'],ascending=[1, 0, 0],inplace=False)
df5
nameagegenderscorecountry
5Jack18M88CN
3Jake24M62CN
6Tim27F67CN
1Molly32F94US
4Amy23F70US
2Tina36F57US
0Jason20F25US
1
2
df5 = df5.groupby(['country', 'gender']).head(1) # 注意此处取1
df5
nameagegenderscorecountry
5Jack18M88CN
6Tim27F67CN
1Molly32F94US
1
df5.groupby(['country']).head(2)
nameagegenderscorecountry
5Jack18M88CN
6Tim27F67CN
1Molly32F94US

分组之后取平均值

1
2
3
scoreMean = df3.groupby(['gender'])['score'].mean()
scoreMean = pd.DataFrame(scoreMean) # 等效于socreMean = scoreMean.to_frame()
scoreMean
score
gender
F62.6
M75.0
1
2
#### 合并
df3.merge(scoreMean,left_on='gender',right_index=True)
nameagegenderscore_xcountryscore_y
0Jason20F25US62.6
1Molly32F94US62.6
2Tina36F57US62.6
4Amy23F70US62.6
6Tim27F67CN62.6
3Jake24M62CN75.0
5Jack18M88CN75.0
1
df3
nameagegenderscorecountry
0Jason20F25US
1Molly32F94US
2Tina36F57US
3Jake24M62CN
4Amy23F70US
5Jack18M88CN
6Tim27F67CN

分组之后计数

1
df3.groupby(['country'])['gender'].count().to_frame()
gender
country
CN3
US4
1
2
3
### 按性别统计每个国家的人数

df3.groupby(['country', 'gender'])['gender'].count().to_frame()
gender
countrygender
CNF1
M2
USF4

分组后唯一值统计

1
df3.groupby(['country'])['gender'].nunique().to_frame()
gender
country
CN2
US1

分组后求和

1
2
# 默认是所有数值类型列求和
df3.groupby('country').sum()
agescore
country
CN69217
US111246
1
2
# 指定列求和
df3.groupby('country')['score'].sum() # 等效于df3.groupby(['country'])['score'].apply(np.sum)
country
CN    217
US    246
Name: score, dtype: int64
1
import matplotlib.pyplot as plt
1
2
3
plt.clf()
df3.groupby('country').sum().plot(kind='bar')
plt.show()
<Figure size 432x288 with 0 Axes>

png

1
df3.groupby('country')['score'].sum().plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x7f040a967a90>

png

分组后求平均值,最大值,最小值

1
df3.groupby('country').agg({'score':['min','max','mean']})
score
minmaxmean
country
CN628872.333333
US259461.500000
1
2
# 跟上面效果一致
df3.groupby('country')['score'].agg([np.min, np.max, np.mean])
aminamaxmean
country
CN628872.333333
US259461.500000

分组后不同列使用不同求值函数

1
2
df3.groupby('country').agg({'score': ['max','min', 'std'],
'age': ['sum', 'count', 'max']})
scoreage
maxminstdsumcountmax
country
CN886213.79613569327
US942528.757608111436

多个分组结果拼接

1
2
3
4
t1=df3.groupby('country')['score'].mean().to_frame()
t2 = df3.groupby('country')['age'].sum().to_frame()

t1.merge(t2,left_index=True,right_index=True)
scoreage
country
CN72.33333369
US61.500000111

遍历分组

1
2
3
4
grouped = df3.groupby('country')
for name,group in grouped:
print(name)
print(group)
CN
   name  age gender  score country
3  Jake   24      M     62      CN
5  Jack   18      M     88      CN
6   Tim   27      F     67      CN
US
    name  age gender  score country
0  Jason   20      F     25      US
1  Molly   32      F     94      US
2   Tina   36      F     57      US
4    Amy   23      F     70      US
1
2
3
4
grouped = df3.groupby(['country', 'gender'])
for name,group in grouped:
print(name)
print(group)
('CN', 'F')
  name  age gender  score country
6  Tim   27      F     67      CN
('CN', 'M')
   name  age gender  score country
3  Jake   24      M     62      CN
5  Jack   18      M     88      CN
('US', 'F')
    name  age gender  score country
0  Jason   20      F     25      US
1  Molly   32      F     94      US
2   Tina   36      F     57      US
4    Amy   23      F     70      US

获取分组信息

1
df3.groupby('country').groups
{'CN': Int64Index([3, 5, 6], dtype='int64'),
 'US': Int64Index([0, 1, 2, 4], dtype='int64')}

取分组后的某一组

1
df3.groupby('country').get_group('CN')
nameagegenderscorecountry
3Jake24M62CN
5Jack18M88CN
6Tim27F67CN

分组后过滤

1
df3.groupby('name').filter(lambda x: len(x) >= 3)
nameagegenderscorecountry

数据透视

1
2
3
4
5
# 数据透视的值项只能是数值类型
# pivot(index =,columns=,values=):透视数据
# index:透视的列(作为索引, 且值都是唯一的); columns-用于进一步细分index;values查看具体值

df3.pivot(index ='name',columns='gender',values=['score','age'])
scoreage
genderFMFM
name
Amy70.0NaN23.0NaN
JackNaN88.0NaN18.0
JakeNaN62.0NaN24.0
Jason25.0NaN20.0NaN
Molly94.0NaN32.0NaN
Tim67.0NaN27.0NaN
Tina57.0NaN36.0NaN
1
2
3
4
# pivot_table(index =,columns=,values=):透视数据
# index:透视的列(作为索引, 且值都是唯一的); columns-用于进一步细分index;values查看具体值;fill_value:0-用0替换Nan; margins:True-汇总

pd.pivot_table(df3,index=['country', 'gender'], values=['score'],aggfunc=np.sum)
score
countrygender
CNF67
M150
USF246
1
pd.pivot_table(df3,index=['country', 'gender'], values=['score', 'age'],aggfunc=[np.sum, np.mean],fill_value=0,margins=True)
summean
agescoreagescore
countrygender
CNF276727.00000067.000000
M4215021.00000075.000000
USF11124627.75000061.500000
All18046325.71428666.142857
1
df3
nameagegenderscorecountry
0Jason20F25US
1Molly32F94US
2Tina36F57US
3Jake24M62CN
4Amy23F70US
5Jack18M88CN
6Tim27F67CN

合并、连接、拼接(Merge, join, and concatenate)

拼接(concatenate)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
t1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
print('-----t1----')
print(t1)

t2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D': ['D4', 'D5', 'D6', 'D7']},
index=[4, 5, 6, 7])

print('----t2-----')
print(t2)

t3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
'B': ['B8', 'B9', 'B10', 'B11'],
'C': ['C8', 'C9', 'C10', 'C11'],
'D': ['D8', 'D9', 'D10', 'D11']},
index=[8, 9, 10, 11])

print('-----t3----')
print(t2)
frames = [t1, t2, t3]

pd.concat(frames)
-----t1----
    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2
3  A3  B3  C3  D3
----t2-----
    A   B   C   D
4  A4  B4  C4  D4
5  A5  B5  C5  D5
6  A6  B6  C6  D6
7  A7  B7  C7  D7
-----t3----
    A   B   C   D
4  A4  B4  C4  D4
5  A5  B5  C5  D5
6  A6  B6  C6  D6
7  A7  B7  C7  D7
ABCD
0A0B0C0D0
1A1B1C1D1
2A2B2C2D2
3A3B3C3D3
4A4B4C4D4
5A5B5C5D5
6A6B6C6D6
7A7B7C7D7
8A8B8C8D8
9A9B9C9D9
10A10B10C10D10
11A11B11C11D11
1
2
3
# concat类似:linux的split命令把文件分成多个,然后在拼接成一个完成文件

Image(url="http://static.cyub.vip/images/202001/pandas.concat.png")
1
2
3
4
5
6
7
8
t4 = pd.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'],
'D': ['D2', 'D3', 'D6', 'D7'],
'F': ['F2', 'F3', 'F6', 'F7']},
index=[2, 3, 6, 7])

print('-----t4----')

pd.concat([t1, t4], axis=1, sort=False) # 此时相当于out joiner
-----t4----
ABCDBDF
0A0B0C0D0NaNNaNNaN
1A1B1C1D1NaNNaNNaN
2A2B2C2D2B2D2F2
3A3B3C3D3B3D3F3
6NaNNaNNaNNaNB6D6F6
7NaNNaNNaNNaNB7D7F7
1
Image(url="http://static.cyub.vip/images/202001/pandas.concat.outer_join.png")
1
pd.concat([t1, t4], axis=1, join='inner')
ABCDBDF
2A2B2C2D2B2D2F2
3A3B3C3D3B3D3F3
1
Image(url="http://static.cyub.vip/images/202001/pandas.concat.inner_join.png")
1
t1.append([t2,t3]) # 相当于pd.concat([t1, t2, t3])
ABCD
0A0B0C0D0
1A1B1C1D1
2A2B2C2D2
3A3B3C3D3
4A4B4C4D4
5A5B5C5D5
6A6B6C6D6
7A7B7C7D7
8A8B8C8D8
9A9B9C9D9
10A10B10C10D10
11A11B11C11D11

连接(Join)

join(on=None, how=‘left’, lsuffix=’’, rsuffix=’’, sort=False)

on:join的键,默认是矩阵的index, how:join方式,left-相当于左连接,outer,inner

更多查看Database-style DataFrame or named Series joining/merging

1
2
3
4
5
6
7
8
9
10
11
12
13
14
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']},
index=['K0', 'K1', 'K2'])

print('----left----')
print(left)

right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
'D': ['D0', 'D2', 'D3']},
index=['K0', 'K2', 'K3'])
print('---right----')
print(right)

left.join(right) # 相当于 pd.merge(left, right, left_index=True, right_index=True, how='left')
----left----
     A   B
K0  A0  B0
K1  A1  B1
K2  A2  B2
---right----
     C   D
K0  C0  D0
K2  C2  D2
K3  C3  D3
ABCD
K0A0B0C0D0
K1A1B1NaNNaN
K2A2B2C2D2
1
Image(url="http://static.cyub.vip/images/202001/pandas.join.left.png")
1
left.join(right, how='outer') # 相当于pd.merge(left, right, left_index=True, right_index=True, how='outer')
ABCD
K0A0B0C0D0
K1A1B1NaNNaN
K2A2B2C2D2
K3NaNNaNC3D3
1
Image(url="http://static.cyub.vip/images/202001/pandas.join.outer.png")
1
left.join(right, how='inner') #相当于pd.merge(left, right, left_index=True, right_index=True, how='inner')
ABCD
K0A0B0C0D0
K2A2B2C2D2
1
Image(url="http://static.cyub.vip/images/202001/pandas.join.inner.png")

根据某一列进行join

left.join(right, on=key_or_keys)= pd.merge(left, right, left_on=key_or_keys, right_index=True,
how=‘left’, sort=False) // 使用left矩阵的key_or_keys列与right矩阵的index进行join

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'key': ['K0', 'K1', 'K0', 'K1']})

print('----left----')
print(left)

right = pd.DataFrame({'C': ['C0', 'C1'],
'D': ['D0', 'D1']},
index=['K0', 'K1'])

print('----right----')
print(right)


left.join(right, on='key') # 相当于pd.merge(left, right, left_on='key', right_index=True,how='left', sort=False);
----left----
    A   B key
0  A0  B0  K0
1  A1  B1  K1
2  A2  B2  K0
3  A3  B3  K1
----right----
     C   D
K0  C0  D0
K1  C1  D1
ABkeyCD
0A0B0K0C0D0
1A1B1K1C1D1
2A2B2K0C0D0
3A3B3K1C1D1
1
Image(url="http://static.cyub.vip/images/202001/pandas.join.key.left.png")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#### 多列的join

left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1']})

print('----left----')
print(left)

index = pd.MultiIndex.from_tuples([('K0', 'K0'), ('K1', 'K0'),
('K2', 'K0'), ('K3', 'K11')])


right = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=index)

print('----right----')
print(right)

left.join(right, on=['key1', 'key2'])
----left----
    A   B key1 key2
0  A0  B0   K0   K0
1  A1  B1   K0   K1
2  A2  B2   K1   K0
3  A3  B3   K2   K1
----right----
         C   D
K0 K0   C0  D0
K1 K0   C1  D1
K2 K0   C2  D2
K3 K11  C3  D3
ABkey1key2CD
0A0B0K0K0C0D0
1A1B1K0K1NaNNaN
2A2B2K1K0C1D1
3A3B3K2K1NaNNaN
1
Image(url="http://static.cyub.vip/images/202001/pandas.join.keys.left.png")
1
left.join(right, on=['key1', 'key2'], how='inner')
ABkey1key2CD
0A0B0K0K0C0D0
2A2B2K1K0C1D1
1
Image(url="http://static.cyub.vip/images/202001/pandas.join.keys.inner.png")

数据导入导出

从csv中导入数据

1
pd.read_csv('../dataset/game_daily_stats_20200127_20200202.csv', names=['id', '日期', '游戏id', '游戏名称', '国家', '国家码', '下载数', '下载用户数', '成功下载数', '成功下载用户数','安装数', '安装用户数'],na_filter = False)
id日期游戏id游戏名称国家国家码下载数下载用户数成功下载数成功下载用户数安装数安装用户数
075643162020-01-271Uphill Rush Water Park Racing俄罗斯RU111111
175643172020-01-271Uphill Rush Water Park Racing肯尼亚KE222200
275643182020-01-271Uphill Rush Water Park Racing刚果金CD110000
375643192020-01-271Uphill Rush Water Park Racing尼泊尔NP110000
475643202020-01-271Uphill Rush Water Park Racing索马里SO111111
.......................................
17988680104812020-02-02175Soccer Star 2022 World Legend: Football game赞比亚ZM220000
17988780104822020-02-02175Soccer Star 2022 World Legend: Football game尼日利亚NG112222
17988880104832020-02-02175Soccer Star 2022 World Legend: Football game埃及EG220000
17988980104842020-02-02175Soccer Star 2022 World Legend: Football game科特迪瓦CI332222
17989080104852020-02-02175Soccer Star 2022 World Legend: Football game约旦JO110000

179891 rows × 12 columns

导出数据到csv

1
df.to_csv('/tmp/pandas.csv', encoding="utf_8_sig")