如何合并数据在R使用R合并,dplyr,或data.table
R有许多通过公共列连接数据帧的快速、优雅的方法宜宾道教灵符网请符。我想向你们展示其中的三个:
1. 基数R的merge()函数
2. Dplyr的join函数族
3. 数据宜宾道教灵符网请符。表的括号语法
一、获取并导入数据
在这个例子中,我将使用我最喜欢的演示数据集之一——来自美国交通统计局的航班延误时间宜宾道教灵符网请符。如果您想跟随,请访问
的查找表。 或者
宜宾道教灵符网请符,你可以下载这两个数据集,加上我在一个文件中的R代码和一个解释不同类型的数据合并的PowerPoint,在这里: 要用基本R读入文件,我首先解压缩航班延误文件,然后用read.csv()导入航班延误数据和代码查找文件
宜宾道教灵符网请符。如果您正在运行该代码,则您下载的延迟文件的名称可能与下面代码中的名称不同。另外,请注意查找文件不寻常的.csv_扩展名。 unzip("673598238_T_ONTIME_REPORTING.zip")
mydf - read.csv("673598238_T_ONTIME_REPORTING.csv",
sep = ",", quote="\"")
mylookup - read.csv("L_UNIQUE_CARRIERS.csv_",
quote="\"", sep = "," )
接下来
宜宾道教灵符网请符,我将用head()查看这两个文件: head(mydf)
FL_DATE OP_UNIQUE_CARRIER ORIGIN DEST DEP_DELAY_NEW X1 2019-08-01 DL ATL DFW 31 NA2 2019-08-01 DL DFW ATL 0 NA3 2019-08-01 DL IAH ATL 40 NA4 2019-08-01 DL PDX SLC 0 NA5 2019-08-01 DL SLC PDX 0 NA6 2019-08-01 DL DTW ATL 10 NA
head(mylookup)
Code Description1 02Q Titan Airways2 04Q Tradewind Aviation3 05Q Comlux Aviation, AG4 06Q Master Top Linhas Aereas Ltd.5 07Q Flair Airlines Ltd.6 09Q Swift Air, LLC d/b/a Eastern Air Lines d/b/a Eastern
二、与底R合并
mydf延迟数据帧只有航空信息的代码
宜宾道教灵符网请符。我想用mylookup中的航空名称添加一列。一种基于R的方法是使用merge()函数,使用基本语法merge(df1, df2)。数据帧1和数据帧2的顺序无关紧要,但无论哪个是第一个都被认为是x,第二个是y。华东CIO大会、华东CIO联盟、CDLC中国数字化灯塔大会、CXO数字化研学之旅、数字化江湖-讲武堂,数字化江湖-大侠传、数字化江湖-论剑、CXO系列管理论坛(陆家嘴CXO管理论坛、宁波东钱湖CXO管理论坛等)、数字化转型网,走进灯塔工厂系列、ECIO大会等 如果你想要连接的列没有相同的名称,你需要告诉归并你想要连接的列:by
宜宾道教灵符网请符。X为X数据帧的列名,由。Y表示Y,比如merge(df1, df2, by。x = "df1ColName", by。y = "df2ColName")。 您还可以告诉归并是否需要包含参数all的所有行,包括没有匹配的行,还是只需要匹配的行
宜宾道教灵符网请符。X和所有。y。在这种情况下,我想要所有的行从延迟数据;如果查找表中没有航空代码,我仍然需要该信息。但我不需要查找表中不在延迟数据中的行(其中有一些已不再飞行的旧航空的代码)。因此,所有。x = TRUE但所有。y = FALSE。代码如下: joined_df - merge(mydf, mylookup, by.x = "OP_UNIQUE_CARRIER",
by.y = "Code", all.x = TRUE, all.y = FALSE)
新的连接数据帧包括一个名为Description的列
宜宾道教灵符网请符,其中包含基于航空代码的航空名称: head(joined_df)
OP_UNIQUE_CARRIER FL_DATE ORIGIN DEST DEP_DELAY_NEW X Description1 9E 2019-08-12 JFK SYR 0 NA Endeavor Air Inc.2 9E 2019-08-12 TYS DTW 0 NA Endeavor Air Inc.3 9E 2019-08-12 ORF LGA 0 NA Endeavor Air Inc.4 9E 2019-08-13 IAH MSP 6 NA Endeavor Air Inc.5 9E 2019-08-12 DTW JFK 58 NA Endeavor Air Inc.6 9E 2019-08-12 SYR JFK 0 NA Endeavor Air Inc.
三、与dplyr连接
dplyr包的连接函数使用SQL数据库语法
宜宾道教灵符网请符。左连接意味着:包括左边的所有内容(merge()中的x数据帧是什么)和从右边(y)数据帧匹配的所有行。如果联接列有相同的名称,你只需要left_join(x, y).如果它们没有相同的名称,你需要一个by参数,比如left_join(x, y, by = c("df1ColName"= "df2ColName"))。 注意by的语法:它是一个命名向量,左右列名都用引号括起来
宜宾道教灵符网请符。 更新:从dplyr 1.1.0版本开始(2023年1月29日在CRAN上)
宜宾道教灵符网请符,dplyr连接有一个额外的by语法,使用join_by(): left_join(x, y, by = join_by(df1ColName == df2ColName))
新的join_by()帮助函数使用了不带引号的列名和==布尔运算符,包的作者说,这个运算符在R上下文中比在c上下文中更有意义("col1" = "col2"),因为=是为了给变量赋值,而不是测试是否相等
宜宾道教灵符网请符。 左连接保留左数据帧中的所有行,只匹配来自右数据帧的行
宜宾道教灵符网请符。 下面是使用left_join()导入和合并两个数据集的代码
宜宾道教灵符网请符。它首先加载dplyr和readr包,然后用read_csv()读入这两个文件。当使用read_csv()时,我不需要先解压缩文件。 library(dplyr)
library(readr)
mytibble - read_csv("673598238_T_ONTIME_REPORTING.zip")
mylookup_tibble - read_csv("L_UNIQUE_CARRIERS.csv_")
joined_tibble - left_join(mytibble, mylookup_tibble,
by = join_by(OP_UNIQUE_CARRIER == Code))
注意
宜宾道教灵符网请符,dplyr的旧by语法没有join_by()仍然有效 joined_tibble - left_join(mytibble, mylookup_tibble,
by = c("OP_UNIQUE_CARRIER" = "Code"))
Read_csv()创建tibbles,这是一种具有一些额外功能的数据帧类型
宜宾道教灵符网请符。Left_join()将两者合并。看一下语法:在这种情况下,顺序很重要。Left_join()意味着包含左边或第一个数据集的所有行,但只包含与第二个数据集匹配的行。并且,因为我需要通过两个不同名称的列来连接,所以我包含了一个by参数。 在dplyr的开发版中
宜宾道教灵符网请符,新的连接语法是: joined_tibble2 - left_join(mytibble, mylookup_tibble,
by = join_by(OP_UNIQUE_CARRIER == Code))
但是,由于大多数人可能都有CRAN版本,所以在本文的其余部分中,我将使用dplyr最初的命名向量语法,直到join_by()成为CRAN版本的一部分
宜宾道教灵符网请符。 我们可以使用dplyr的glimpse()函数查看结果的结构
宜宾道教灵符网请符,这是查看数据帧顶部几项的另一种方式: glimpse(joined_tibble)Observations: 658,461Variables: 7
$ FL_DATE date 2019-08-01, 2019-08-01, 2019-08-01, 2019-08-01, 2019-08-01…
$ OP_UNIQUE_CARRIER chr "DL", "DL", "DL", "DL", "DL", "DL", "DL", "DL", "DL", "DL",…
$ ORIGIN chr "ATL", "DFW", "IAH", "PDX", "SLC", "DTW", "ATL", "MSP", "JF…
$ DEST chr "DFW", "ATL", "ATL", "SLC", "PDX", "ATL", "DTW", "JFK", "MS…
$ DEP_DELAY_NEW dbl 31, 0, 40, 0, 0, 10, 0, 22, 0, 0, 0, 17, 5, 2, 0, 0, 8, 0, …
$ X6 lgl NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Description chr "Delta Air Lines Inc.", "Delta Air Lines Inc.", "Delta Air …
这个合并的数据集现在有一个新列,列中包含航空的名称
宜宾道教灵符网请符。如果您自己运行这段代码的一个版本,您可能会注意到dplyr比基数R快得多。 原文:
R has a number of quick, elegant ways to join data frames by a common column. I’d like to show you three of them:
· base R’s merge() function
· dplyr’s join family of functions
· data.table’s bracket syntax
Get and import the data
For this example I’ll use one of my favorite demo data sets—flight delay times from the U.S. Bureau of Transportation Statistics. If you want to follow along, head to
Or, you can download these two data sets—plus my R code in a single file and a PowerPoint explaining different types of data merges—here:
To read in the file with base R, I’d first unzip the flight delay file and then import both flight delay data and the code lookup file with read.csv(). If you’re running the code, the delay file you downloaded will likely have a different name than in the code below. Also, note the lookup file’s unusual .csv_ extension.
unzip("673598238_T_ONTIME_REPORTING.zip")
mydf - read.csv("673598238_T_ONTIME_REPORTING.csv",
sep = ",", quote="\"")
mylookup - read.csv("L_UNIQUE_CARRIERS.csv_",
quote="\"", sep = "," )
Next, I’ll take a peek at both files with head():
head(mydf)
FL_DATE OP_UNIQUE_CARRIER ORIGIN DEST DEP_DELAY_NEW X1 2019-08-01 DL ATL DFW 31 NA2 2019-08-01 DL DFW ATL 0 NA3 2019-08-01 DL IAH ATL 40 NA4 2019-08-01 DL PDX SLC 0 NA5 2019-08-01 DL SLC PDX 0 NA6 2019-08-01 DL DTW ATL 10 NA
head(mylookup)
Code Description1 02Q Titan Airways2 04Q Tradewind Aviation3 05Q Comlux Aviation, AG4 06Q Master Top Linhas Aereas Ltd.5 07Q Flair Airlines Ltd.6 09Q Swift Air, LLC d/b/a Eastern Air Lines d/b/a Eastern
Merges with base R
The mydf delay data frame only has airline information by code. I’d like to add a column with the airline names from mylookup. One base R way to do this is with the merge() function, using the basic syntax merge(df1, df2). The order of data frame 1 and data frame 2 doesn't matter, but whichever one is first is considered x and the second one is y.
If the columns you want to join by don’t have the same name, you need to tell merge which columns you want to join by: by.x for the x data frame column name, and by.y for the y one, such as merge(df1, df2, by.x = "df1ColName", by.y = "df2ColName").
You can also tell merge whether you want all rows, including ones without a match, or just rows that match, with the arguments all.x and all.y. In this case, I’d like all the rows from the delay data; if there’s no airline code in the lookup table, I still want the information. But I don’t need rows from the lookup table that aren’t in the delay data (there are some codes for old airlines that don’t fly anymore in there). So, all.x equals TRUE but all.y equals FALSE. Here's the code:
joined_df - merge(mydf, mylookup, by.x = "OP_UNIQUE_CARRIER",
by.y = "Code", all.x = TRUE, all.y = FALSE)
The new joined data frame includes a column called Description with the name of the airline based on the carrier code:
head(joined_df)
OP_UNIQUE_CARRIER FL_DATE ORIGIN DEST DEP_DELAY_NEW X Description1 9E 2019-08-12 JFK SYR 0 NA Endeavor Air Inc.2 9E 2019-08-12 TYS DTW 0 NA Endeavor Air Inc.3 9E 2019-08-12 ORF LGA 0 NA Endeavor Air Inc.4 9E 2019-08-13 IAH MSP 6 NA Endeavor Air Inc.5 9E 2019-08-12 DTW JFK 58 NA Endeavor Air Inc.6 9E 2019-08-12 SYR JFK 0 NA Endeavor Air Inc.
Joins with dplyr
The dplyr package uses SQL database syntax for its join functions. A left join means: Include everything on the left (what was the x data frame in merge()) and all rows that match from the right (y) data frame. If the join columns have the same name, all you need is left_join(x, y). If they don’t have the same name, you need a by argument, such as left_join(x, y, by = c("df1ColName" = "df2ColName")).
Note the syntax for by: It’s a named vector, with both the left and right column names in quotation marks.
Update: Starting with dplyr version 1.1.0 (on CRAN as of January 29, 2023), dplyr joins have an additional by syntax using join_by():
left_join(x, y, by = join_by(df1ColName == df2ColName))
The new join_by() helper function uses unquoted column names and the == boolean operator, which package authors say makes more sense in an R context than c("col1" = "col2"), since = is meant for assigning a value to a variable, not testing for equality.
A left join keeps all rows in the left data frame and only matching rows from the right data frame.
The code to import and merge both data sets using left_join() is below. It starts by loading the dplyr and readr packages, and then reads in the two files with read_csv(). When using read_csv(), I don’t need to unzip the file first.
library(dplyr)
library(readr)
mytibble - read_csv("673598238_T_ONTIME_REPORTING.zip")
mylookup_tibble - read_csv("L_UNIQUE_CARRIERS.csv_")
joined_tibble - left_join(mytibble, mylookup_tibble,
by = join_by(OP_UNIQUE_CARRIER == Code))
Note that dplyr's older by syntax without join_by() still works
joined_tibble - left_join(mytibble, mylookup_tibble,
by = c("OP_UNIQUE_CARRIER" = "Code"))
read_csv() creates tibbles, which are a type of data frame with some extra features. left_join() merges the two. Take a look at the syntax: In this case, order matters. left_join() means include all rows on the left, or first, data set, but only rows that match from the second one. And, because I need to join by two differently named columns, I included a by argument.
The new join syntax in the development-only version of dplyr would be:
joined_tibble2 - left_join(mytibble, mylookup_tibble,
by = join_by(OP_UNIQUE_CARRIER == Code))
Since most people likely have the CRAN version, however, I will use dplyr's original named-vector syntax in the rest of this article, until join_by() becomes part of the CRAN version.
We can look at the structure of the result with dplyr’s glimpse() function, which is another way to see the top few items of a data frame:
glimpse(joined_tibble)Observations: 658,461Variables: 7
$ FL_DATE date 2019-08-01, 2019-08-01, 2019-08-01, 2019-08-01, 2019-08-01…
$ OP_UNIQUE_CARRIER chr "DL", "DL", "DL", "DL", "DL", "DL", "DL", "DL", "DL", "DL",…
$ ORIGIN chr "ATL", "DFW", "IAH", "PDX", "SLC", "DTW", "ATL", "MSP", "JF…
$ DEST chr "DFW", "ATL", "ATL", "SLC", "PDX", "ATL", "DTW", "JFK", "MS…
$ DEP_DELAY_NEW dbl 31, 0, 40, 0, 0, 10, 0, 22, 0, 0, 0, 17, 5, 2, 0, 0, 8, 0, …
$ X6 lgl NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Description chr "Delta Air Lines Inc.", "Delta Air Lines Inc.", "Delta Air …
This joined data set now has a new column with the name of the airline. If you run a version of this code yourself, you’ll probably notice that dplyr is way faster than base R.
CXO联盟(CXO union)是一家聚焦于CIO,CDO,cto,ciso,cfo,coo,chro,cpo,ceo等人群的平台组织,其中在CIO会议领域的领头羊,目前举办了大量的CIO大会、CIO论坛、CIO活动、CIO会议、CIO峰会、CIO会展
宜宾道教灵符网请符。如华东CIO会议、华南cio会议、华北cio会议、中国cio会议、西部CIO会议。在这里,你可以参加大量的IT大会、IT行业会议、IT行业论坛、IT行业会展、数字化论坛、数字化转型论坛,在这里你可以认识很多的首席信息官、首席数字官、首席财务官、首席技术官、首席人力资源官、首席运营官、首席执行官、IT总监、财务总监、信息总监、运营总监、采购总监、供应链总监。 数字化转型网(资讯媒体,是企业数字化转型的必读参考,在这里你可以学习大量的知识,如财务数字化转型、供应链数字化转型、运营数字化转型、生产数字化转型、人力资源数字化转型、市场营销数字化转型
宜宾道教灵符网请符。通过关注我们的公众号,你就知道如何实现企业数字化转型?数字化转型如何做? 【CXO联盟部分会员】天能控股集团有限CEO、南京钢铁集团有限CEO、陕西有色金属控股集团有限责任CEO、四川长虹电子控股集团有限CEO、紫金矿业集团股份有限CEO、杭州市实业投资集团有限CEO、湖南华菱钢铁集团有限责任CEO、广州医药集团有限CEO、中国有色矿业集团有限CEO、万向集团CEO、冀南钢铁集团有限CEO、中天钢铁集团有限CEO、北京电子控股有限责任CEO、比亚迪股份有限CEO、敬业集团有限CEO、TCL集团股份有限CEO、海信集团有限CEO、超威电源集团有限CEO、海澜集团有限CEO、无锡产业发展集团有限CEO、北京金隅集团股份有限CEO、河北津西钢铁集团股份有限CEO、中国重型汽车集团有限CEO、山东东明石化集团有限CEO、雅戈尔集团股份有限CEO、南山集团有限CEO、中国黄金集团有限CEO、江阴澄星实业集团有限CEO、四川省宜宾五粮液集团有限CEO、亨通集团有限CEO、杭州钢铁集团有限CEO、新华联集团有限CEO、酒泉钢铁(集团)有限责任CEO、协鑫集团有限CEO、广西柳州钢铁集团有限CEO、辽宁方大集团实业有限CEO、日照钢铁控股集团有限CEO、河北新华联合冶金控股集团有限CEO、长城汽车股份有限CEO、万达控股集团有限CEO、江铃汽车集团有限CEO、传化集团有限CEO、宁波金田投资控股有限CEO、江苏悦达集团有限CEO、利华益集团股份有限CEO、中兴通讯股份有限CEO、扬子江药业集团CEO、内蒙古伊利实业集团股份有限CEO、贵州茅台酒股份有限CEO、正邦集团有限CEO、徐州工程机械集团有限CEO、包头钢铁(集团)有限责任CEO
本文链接:https://fuzhouwang.org/index.php/post/11508.html
转载声明:本站文章中有转载或采集其他网站内容, 如有转载的文章涉及到您的权益及版权,还麻烦及时联系我们,我们将及时删除,谢谢配合。