Pandas Joining and merging DataFrame: Exercise-14 with SolutionWrite a Pandas program to merge two given dataframes with different columns. Show
Test Data: data1: key1 key2 P Q 0 K0 K0 P0 Q0 1 K0 K1 P1 Q1 2 K1 K0 P2 Q2 3 K2 K1 P3 Q3 data2: key1 key2 R S 0 K0 K0 R0 S0 1 K1 K0 R1 S1 2 K1 K0 R2 S2 3 K2 K0 R3 S3 Sample Solution: Python Code :
Sample Output: Original DataFrames: key1 key2 P Q 0 K0 K0 P0 Q0 1 K0 K1 P1 Q1 2 K1 K0 P2 Q2 3 K2 K1 P3 Q3 -------------------- key1 key2 R S 0 K0 K0 R0 S0 1 K1 K0 R1 S1 2 K1 K0 R2 S2 3 K2 K0 R3 S3 Merge two dataframes with different columns: P Q R S key1 key2 0 P0 Q0 NaN NaN K0 K0 1 P1 Q1 NaN NaN K0 K1 2 P2 Q2 NaN NaN K1 K0 3 P3 Q3 NaN NaN K2 K1 4 NaN NaN R0 S0 K0 K0 5 NaN NaN R1 S1 K1 K0 6 NaN NaN R2 S2 K1 K0 7 NaN NaN R3 S3 K2 K0 Python Code Editor: Have another way to solve this solution? Contribute your code (and comments) through Disqus. Previous: Write a Pandas program to combine the columns of two potentially differently-indexed DataFrames into a single result DataFrame. Python: Tips of the DayUsing yield to create a simple iterator: >>> def foo(lst): >>> for x in lst: >>> yield x >>> yield x*2 >>> a = [1, 3] >>> list(foo(a)) [1, 2, 3, 6] Topics covered in this section: There are a few methods you can use to combine data frames in Python. These methods are .concat(), and .merge(). Both of them are apart of the Pandas library. So let’s import Pandas! pandas.concat() methodThe pandas.concat() method combines two data frames by stacking them on top of each other. If one of the data frames does not contain a variable column or variable rows, observations in that data frame will be filled with NaN values. pandas.merge() MethodThe pandas.merge() method joins two data frames by a “key” variable that contains unique values. With pandas.merge(), you can only combine 2 data frames at a time. If you have more than 2 data frames to merge, you will have to use this method multiple times. Let’s go over pandas.merge() and some of the available arguments to pass. Here is the general structure and the recommended bare minimum arguments to pass. This method has more argument to pass if desired. pandas.merge(left_data_frame, right_data_frame, on= , how= )
Data used for ExamplesNormally, we would use a real data set for our examples. However, for this section we will create a few data sets so it will be easier to demonstrate what is occurring. Creating 4 data frames dataframe1 = pd.DataFrame({'ID': ['0011','0013','0014','0016','0017'], 'First Name': ['Joseph', 'Mike', 'Jordan', 'Steven', 'Susan']}) dataframe2 = pd.DataFrame({'ID': ['0010','0011','0013','0014','0017'], 'Last Name': ['Gordan', 'Johnson', 'Might' , 'Jackson', 'Shack']}) dataframe3 = pd.DataFrame({'ID': ['0020','0022','0025'], 'First Name': ['Adam', 'Jackie', 'Sue']}) dataframe4 = pd.DataFrame({'Key': ['0020','0022','0025'], 'Scores': [95, 90, 80]})
Concatenate ExamplesWith pandas.concat(), you pass the data frames that you want to join as a list. You can pass as many as you need to join. Remember, this method joins the data frames by rows (stacking them on top of each other) by default. To join them by columns (put the data frames side to side), you pass the “axis= 1” argument. Concatenation Example 1. Joining data frames by rows (stacking one on top of another)If you were to join data frames by rows with an uneven number of columns, i.e. one data frame has more columns than the other, the data frame without the columns will have “NaN” values filled in where data is missing. This will be shown in example 2. new_concat_ROWS_dataframe = pd.concat([dataframe1, dataframe3]) new_concat_ROWS_dataframe
Look at the index, notice how Python kept the original index from both data frames and just stacked them on top of each other as well. If you wish for a new index starting at 0, pass the “ignore_index” argument as “true”. new_concat_ROWS_dataframe = pd.concat([dataframe1, dataframe3], ignore_index= "true") new_concat_ROWS_dataframe
Passing the “axis= 1” argument will join the data frames by columns, placing the data frames next to each other. If one data frame has more rows than the other, the data frame that has less rows will be filled with “NaN” values where the extra rows will occur. new_concat_COL_dataframe = pd.concat([dataframe1, dataframe3], axis=1) new_concat_COL_dataframe
In this situation, it’s clear that stacking the data frames on top of each other (joining by rows) is what we want. The second example would make for a horrible data frame. Merge ExamplesMerge Example 1. Keep matching observations onlyPassing the “how = ‘inner'” argument will keep observations that have a match on the merge variable in both data frames. new_merged_dataframe = pd.merge(dataframe1, dataframe2, on= "ID", how= "inner") new_merged_dataframe
Passing the “how = ‘outer'” argument will keep all observations from both data frames. new_OUTER_merged_dataframe = pd.merge(dataframe1, dataframe2, on= "ID", how= "outer") new_OUTER_merged_dataframe
Passing the “how= ‘left'” argument will keep all observations in the data frame that is being passed in the left argument regardless if there is a matching value in the data frame that is being passed in the right argument. Rows without a matching merge variable from the right data frame will be discarded. new_LEFT_merged_dataframe = pd.merge(dataframe1, dataframe2, on= "ID", how= "left") new_LEFT_merged_dataframe
Passing the “how= ‘right'” argument will keep all observations in the data frame that is being passed in the right argument regardless if there is a matching value in the data frame that is being passed in the left argument. Rows without a matching merge variable from the left data frame will be discarded. new_RIGHT_merged_dataframe = pd.merge(dataframe1, dataframe2, on= "ID", how= "right") new_RIGHT_merged_dataframe
If the two data frames each contain the unique identifier, but are stored under different columns, you can merge using the left_on and right_on arguments. If going this route, you have to pass both arguments. The general structure is left_on = column_name_with_unique_identifier, right_on = column_name_with_unique_identifier. Note: Merging data frames this way will keep both columns stated in the left_on and right_on arguments. merged_dataframe = pd.merge(dataframe3, dataframe4, left_on= "ID", right_on= "Key", how= "inner") merged_dataframe
There is more useful information and advanced concatenation/ merging methods on the official documentation page. How do I merge two DataFrames with different number of rows?The concat() function can be used to concatenate two Dataframes by adding the rows of one to the other. The merge() function is equivalent to the SQL JOIN clause. 'left', 'right' and 'inner' joins are all possible.
Can you merge DataFrames of different lengths Pandas?It can be done using the merge() method.
How do I merge two DataFrames in Pandas based on rows?The concat() function in pandas is used to append either columns or rows from one DataFrame to another. The concat() function does all the heavy lifting of performing concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the other axes.
How do I merge two DataFrames with different indexes?How to Merge Two Pandas DataFrames on Index. Use join: By default, this performs a left join. df1. join(df2). Use merge. By default, this performs an inner join. pd. merge(df1, df2, left_index=True, right_index=True). Use concat. By default, this performs an outer join.. |