Watch out for duplicate column names in pandas DataFrames

Pandas is a powerful data analysis engine. While it is most of the time easy to use and behaves as a good panda should, sometimes you just need to be aware of some details and their implications to stay on its friendly side. For example, a DataFrame

  • a) allows columns with duplicate names, and
  • b) when retrieving columns by such a name, pandas gives you a DataFrame that contains all columns with that name for each column name you selected.

This happens silently and may quickly and unexpectedly increase the size of your DataFrame quite significantly!

Consider the following example, with a dataset containing two feature groups (with duplicate feature names) and one target (e.g., for learning a linear regression model):

For this example, one would probably expect the feature DataFrame to contain only the features from group1. Thus, it would look like this:

However, due to the duplicate column names, our selected features (features_from_group1) contain the group1_a column name twice. Now, since pandas returns all columns matching for each given column name in features_from_group1, we a get a DataFrame with “too many” columns:

This makes sense, in a way. However, when you are not aware of it, you are quickly increasing the size of your DataFrame (or in the example above, the feature set to learn from) by quite a large margin without noticing. Depending on the amount of duplicate column names, this may blow up your memory, hinder your machine learning methods from learning efficiently (or from learning at all), or cause your HDD to fill up quickly when saving models 😉

Leave a Reply

Your email address will not be published. Required fields are marked *