More

Joining two datasets based on matching values in multiple fields

Joining two datasets based on matching values in multiple fields


I am trying to find a way to join CSV data to a vector layer only when they both share matching values in multiple fields, rather than just one field.

This is a problem I'm running into with multiple projects, but I'll just describe one here as an example:

The data I am working with is as follows:

  • a shapefile layer containing a list of all U.S. towns/cities with population over 10,000.
  • 49 CSV files (one for each of the lower 48 states + 1 for D.C.) from the Bureau of Justice Statistics (BJS), containing data on violent crime rates by city.

I am trying to join the data from the BJS .csv files onto the "U.S. Towns and Cities" layer so that each town/city feature contains data for violent crime in that city.

The problem is that the the "Joins" interface will only let me select one field to join on. So for instance, when go to "US Towns & Cities" --> Properties --> Joins and click "+" to create a new join, this is what I have (as of QGIS version 2.12):

So here I am joining the BJS data for Georgia to the "U.S Towns and Cities" layer by city name… now assume that I've done the same for the BJS Florida data.

Now, there are several ways that this causes trouble. The most significant problem is that there are name collisions for cities in different states. For instance, the "U.S. towns & cities" map layer contains 8 cities named "Gainesville", ALL of which now have the BJS data for both Gainesville, Georgia AND Gainesville, Florida associated with them:

I need to only join the crime data for for a city when both the city name AND the state name coincide so that Gainesville, Florida's data only gets associated with Gainesville, Florida and not Gainesville, Georgia… How do I do this?

Also, another problem caused by the above approach, which isn't necessarily causing incorrect results, but seems "wrong" is that every single city now contains nearly 1000 NULL fields (~20 of them for each of the 49 CSV files)… I feel like there must be a better way to do this.


One way would be to concatenate the two id columns to a new virtual field with the field calculator for each table. Then join on the virtual fields.

Personally I would go for a db solution instead like Spatialite or PostGIS where you have more control of the join.