Pandas merge two dataframes with different number of rows

Last update on August 19 2022 21:50:47 (UTC/GMT +8 hours)

Pandas Joining and merging DataFrame: Exercise-14 with Solution

Write a Pandas program to merge two given dataframes with different columns.

Test Data:

data1:
  key1 key2   P   Q
0   K0   K0  P0  Q0
1   K0   K1  P1  Q1
2   K1   K0  P2  Q2
3   K2   K1  P3  Q3
data2:
  key1 key2   R   S
0   K0   K0  R0  S0
1   K1   K0  R1  S1
2   K1   K0  R2  S2
3   K2   K0  R3  S3

Sample Solution:

Python Code :

import pandas as pd
data1 = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                     'P': ['P0', 'P1', 'P2', 'P3'],
                     'Q': ['Q0', 'Q1', 'Q2', 'Q3']}) 
data2 = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                      'R': ['R0', 'R1', 'R2', 'R3'],
                      'S': ['S0', 'S1', 'S2', 'S3']})
print("Original DataFrames:")
print(data1)
print("--------------------")
print(data2)
print("\nMerge two dataframes with different columns:")
result = pd.concat([data1,data2], axis=0, ignore_index=True)
print(result)

Sample Output:

Original DataFrames:
  key1 key2   P   Q
0   K0   K0  P0  Q0
1   K0   K1  P1  Q1
2   K1   K0  P2  Q2
3   K2   K1  P3  Q3
--------------------
  key1 key2   R   S
0   K0   K0  R0  S0
1   K1   K0  R1  S1
2   K1   K0  R2  S2
3   K2   K0  R3  S3

Merge two dataframes with different columns:
     P    Q    R    S key1 key2
0   P0   Q0  NaN  NaN   K0   K0
1   P1   Q1  NaN  NaN   K0   K1
2   P2   Q2  NaN  NaN   K1   K0
3   P3   Q3  NaN  NaN   K2   K1
4  NaN  NaN   R0   S0   K0   K0
5  NaN  NaN   R1   S1   K1   K0
6  NaN  NaN   R2   S2   K1   K0
7  NaN  NaN   R3   S3   K2   K0

Python Code Editor:

Have another way to solve this solution? Contribute your code (and comments) through Disqus.

Previous: Write a Pandas program to combine the columns of two potentially differently-indexed DataFrames into a single result DataFrame.
Next: Write a Pandas program to Combine two DataFrame objects by filling null values in one DataFrame with non-null values from other DataFrame.

Python: Tips of the Day

Using yield to create a simple iterator:

>>> def foo(lst):
>>>   for x in lst:
>>>       yield x
>>>       yield x*2

>>> a = [1, 3]
>>> list(foo(a))
[1, 2, 3, 6]

Topics covered in this section:

  • Join and merge options available
  • Data used in this example
  • Concatenate examples
    • Concatenate (stack) by rows
    • Concatenate (stack) by columns
  • Merge examples
    • Merge and keep matching observations only
    • Merge and keep all observations
    • Merge and keep matching observations and all observations from the left data frame
    • Merge and keep matching observations and all observations from the right data frame
    • Merging on different columns with unique values

There are a few methods you can use to combine data frames in Python. These methods are .concat(), and .merge(). Both of them are apart of the Pandas library. So let’s import Pandas!

pandas.concat() method

The pandas.concat() method combines two data frames by stacking them on top of each other. If one of the data frames does not contain a variable column or variable rows, observations in that data frame will be filled with NaN values.

pandas.merge() Method

The pandas.merge() method joins two data frames by a “key” variable that contains unique values.

With pandas.merge(), you can only combine 2 data frames at a time. If you have more than 2 data frames to merge, you will have to use this method multiple times. Let’s go over pandas.merge() and some of the available arguments to pass. Here is the general structure and the recommended bare minimum arguments to pass. This method has more argument to pass if desired.

pandas.merge(left_data_frame, right_data_frame, on= , how= )

  • left is one of the data frames
  • right is the other data frame
  • on is the variable, a.k.a the column, you want to merge on. This variable has to have the same name in both data frames.
    • If the data frames has different column names for the merge variables you can use left_on and right_on.
      • left_on is the variable name in the left data frame to be merged on
      • right_on is the variable name in the left data frame to be merged on
  • how is where you pass the options of merging. These include:
    • “inner”, where only the observations with matching values based on the “on” argument that is passed are kept.
    • “left”, where all observations will be kept from the data frame in the left argument regardless if there is matching values with the data frame in the right argument. Observations that do not have a matching value based on the on argument in the “right” data frame will be discarded.
    • “right”, where all observations will be kept from the data frame in the right argument regardless if there is matching values with the data frame in the left argument. Observations that do not have a matching value based on the on argument in the “left” data frame will be discarded.
    • “outer”, all observations will be kept from both data frames.

Data used for Examples

Normally, we would use a real data set for our examples. However, for this section we will create a few data sets so it will be easier to demonstrate what is occurring.

Creating 4 data frames

dataframe1 = pd.DataFrame({'ID': ['0011','0013','0014','0016','0017'], 
                           'First Name': ['Joseph', 'Mike', 'Jordan', 'Steven', 'Susan']})
dataframe2 = pd.DataFrame({'ID': ['0010','0011','0013','0014','0017'], 
                           'Last Name': ['Gordan', 'Johnson', 'Might' , 'Jackson', 'Shack']})
dataframe3 = pd.DataFrame({'ID': ['0020','0022','0025'],
                           'First Name': ['Adam', 'Jackie', 'Sue']})
dataframe4 = pd.DataFrame({'Key': ['0020','0022','0025'],
                           'Scores': [95, 90, 80]})

First NameID
0 Joseph 0011
1 Mike 0013
2 Jordan 0014
3 Steven 0016
4 Susan 0017
IDLast Name
0 0010 Gordan
1 0011 Johnson
2 0013 Might
3 0014 Jackson
4 0017 Shack
First NameID
0 Adam 0020
1 Jackie 0022
2 Sue 0025
KeyScores
0 0020 95
1 0022 90
2 0025 80

Concatenate Examples

With pandas.concat(), you pass the data frames that you want to join as a list. You can pass as many as you need to join. Remember, this method joins the data frames by rows (stacking them on top of each other) by default. To join them by columns (put the data frames side to side), you pass the “axis= 1” argument.

Concatenation Example 1. Joining data frames by rows (stacking one on top of another)

If you were to join data frames by rows with an uneven number of columns, i.e. one data frame has more columns than the other, the data frame without the columns will have “NaN” values filled in where data is missing. This will be shown in example 2.

new_concat_ROWS_dataframe = pd.concat([dataframe1, dataframe3])

new_concat_ROWS_dataframe

First NameID
0 Joseph 0011
1 Mike 0013
2 Jordan 0014
3 Steven 0016
4 Susan 0017
0 Adam 0020
1 Jackie 0022
2 Sue 0025

Look at the index, notice how Python kept the original index from both data frames and just stacked them on top of each other as well. If you wish for a new index starting at 0, pass the “ignore_index” argument as “true”.

new_concat_ROWS_dataframe = pd.concat([dataframe1, dataframe3], ignore_index= "true")

new_concat_ROWS_dataframe

First NameID
0 Joseph 0011
1 Mike 0013
2 Jordan 0014
3 Steven 0016
4 Susan 0017
5 Adam 0020
6 Jackie 0022
7 Sue 0025
Concatenation Example 2. Joining data frames by columns (stacking them side by side)

Passing the “axis= 1” argument will join the data frames by columns, placing the data frames next to each other. If one data frame has more rows than the other, the data frame that has less rows will be filled with “NaN” values where the extra rows will occur.

new_concat_COL_dataframe = pd.concat([dataframe1, dataframe3], axis=1)

new_concat_COL_dataframe

First NameIDFirst NameID
0 Joseph 0011 Adam 0020
1 Mike 0013 Jackie 0022
2 Jordan 0014 Sue 0025
3 Steven 0016 NaN Nan
4 Susan 0017 NaN Nan

In this situation, it’s clear that stacking the data frames on top of each other (joining by rows) is what we want. The second example would make for a horrible data frame.

Merge Examples

Merge Example 1. Keep matching observations only

Passing the “how = ‘inner'” argument will keep observations that have a match on the merge variable in both data frames.

new_merged_dataframe = pd.merge(dataframe1, dataframe2, on= "ID", how= "inner")

new_merged_dataframe

First NameIDLast Name
0 Joseph 0011 Johnson
1 Mike 0013 Might
2 Jordan 0014 Jackson
3 Susan 0017 Shack
Merge Example 2. Keep all observations

Passing the “how = ‘outer'” argument will keep all observations from both data frames.

new_OUTER_merged_dataframe = pd.merge(dataframe1, dataframe2, on= "ID", how= "outer")

new_OUTER_merged_dataframe

First NameIDLast Name
0 Joseph 0011 Johnson
1 Mike 0013 Might
2 Jordan 0014 Jackson
3 Steven 0016 NaN
4 Susan 0017 Shack
5 Nan 0010 Gordan
Merge Example 3. Keeping match observations and all observations from the left data frame

Passing the “how= ‘left'” argument will keep all observations in the data frame that is being passed in the left argument regardless if there is a matching value in the data frame that is being passed in the right argument. Rows without a matching merge variable from the right data frame will be discarded.

new_LEFT_merged_dataframe = pd.merge(dataframe1, dataframe2, on= "ID", how= "left")

new_LEFT_merged_dataframe

First NameIDLast Name
0 Joseph 0011 Johnson
1 Mike 0013 Might
2 Jordan 0014 Jackson
3 Steven 0016 NaN
4 Susan 0017 Shack
Merge Example 4. Keeping match observations and all observations from the right data frame

Passing the “how= ‘right'” argument will keep all observations in the data frame that is being passed in the right argument regardless if there is a matching value in the data frame that is being passed in the left argument. Rows without a matching merge variable from the left data frame will be discarded.

new_RIGHT_merged_dataframe = pd.merge(dataframe1, dataframe2, on= "ID", how= "right")

new_RIGHT_merged_dataframe

First NameIDLast Name
0 Joseph 0011 Johnson
1 Mike 0013 Might
2 Jordan 0014 Jackson
3 Susan 0017 Shack
4 NaN 0010 Gordan
Merge Example 5. Merging on different columns with unique values

If the two data frames each contain the unique identifier, but are stored under different columns, you can merge using the left_on and right_on arguments. If going this route, you have to pass both arguments. The general structure is left_on = column_name_with_unique_identifier, right_on = column_name_with_unique_identifier.

Note: Merging data frames this way will keep both columns stated in the left_on and right_on arguments.

merged_dataframe = pd.merge(dataframe3, dataframe4, left_on= "ID", right_on= "Key", how= "inner")

merged_dataframe

First NameIDKeyScores
0 Adam 0020 0020 95
1 Jackie 0022 0022 90
2 Sue 0025 0025 80

There is more useful information and advanced concatenation/ merging methods on the official documentation page.

How do I merge two DataFrames with different number of rows?

The concat() function can be used to concatenate two Dataframes by adding the rows of one to the other. The merge() function is equivalent to the SQL JOIN clause. 'left', 'right' and 'inner' joins are all possible.

Can you merge DataFrames of different lengths Pandas?

It can be done using the merge() method.

How do I merge two DataFrames in Pandas based on rows?

The concat() function in pandas is used to append either columns or rows from one DataFrame to another. The concat() function does all the heavy lifting of performing concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the other axes.

How do I merge two DataFrames with different indexes?

How to Merge Two Pandas DataFrames on Index.
Use join: By default, this performs a left join. df1. join(df2).
Use merge. By default, this performs an inner join. pd. merge(df1, df2, left_index=True, right_index=True).
Use concat. By default, this performs an outer join..