Shell-fu - join

Let's say you have multiple two-fields csv files that you want to join on their first field. For example:

a.csv     b.csv
-----     -----
a,1       b,10
b,2       c,20

=>

Result
------
a,1,0
b,2,10
c,0,20

Here are the basic join switches that will be used:

  • -a1 -a2: we also want the unpairable lines of each file
  • -j1: join on the first field of each file
  • -t,: specify the separator

For a more complete explanation of those options, see join's man page.

That is all what is needed to join two files (the sort command is added for reference although the files are already sorted):

$ join -a1 -a2 -j1 -t, <(sort -t, -k1 a.csv) <(sort -t, -k1 b.csv)
a,1
b,2,10
c,20

So far so good, however we now don't know to which file belonged a and c. Let's fix that.

Now comes the -o options which allows to control what will be outputted. A '0' means to output the joined field and additional FILENUM.FIELD which field of which file (1 or 2) is to be added to the output. There's however a nice simplification using the keyword auto that automagically determines how many fields from each file to output based on their first line.

Using -o 'auto' the output becomes:

$ join -a1 -a2 -j1 -t, -o 'auto' a.csv b.csv
a,1,
b,2,10
c,,20

Now it's clear that a belongs to a.csv and c to b.csv. It is however nice to complete the command with -e0 to replace missing fields with a 0.

$ join -a1 -a2 -j1 -t, -o 'auto' -e0 a.csv b.csv
a,1,0
b,2,10
c,0,20

Perfect. Now we can add a new file:

d.csv
-----
a,100
x,200

and join them all:

$ join -a1 -a2 -j1 -t, -o 'auto' -e0 <(join -a1 -a2 -j1 -t, -o 'auto' -e0 a.csv b.csv) d.csv
a,1,0,100
b,2,10,0
c,0,20,0
x,0,0,200

You can then integrate that in some crazy one-liner or into a shell script to join any number of csv you want.