CSV to Postgres import tools

One of my on and off again Drupal projects is a food information database, built mostly around whatever the current government official database is. These days that is the USDA FNDDS, or Food and Nutrient Database for Dietary Studies. There is an excellent Drupal module called Views Database Connector (or "VDC") which allows you to create views on databases other than the one that the site runs on. It has an unfortunate limitation which is caused by Drupal's database system, but it is easily mitigated.You cannot query any tables with the same names as the ones you're using in the Drupal database already. Solve this problem by using a database prefix with drupal that doesn't cause collisions - every time you create a drupal instance, you should give it a unique prefix.

The database is unfortunately distributed only in CSV and MS Access formats. This is unfortunate because the provided data dictionary is not very good, MS Access is hard to get data out of even with access to Access, and CSV files do not include information on the format of data in the files. Just import them into a spreadsheet and analyze them, you say? Alas, some of these files have over a million rows, and even the latest Excel cannot process them, let alone the much more limited LibreOffice Calc - which crashes just trying.

So what we want to do in order to make actual use of this information is to get it into a RDBMS where we can use SQL to query it, in my case with Drupal and VDC. At the time I was interested in getting it into Postgres because that was what I was using for Drupal, but it turns out that this is kind of a best case because the psql command has excellent support for importing CSV. It can import very large volumes of data very quickly, which is what we want here since we have very large files to work with. Even if our main database is using MariaDB/MySQL, we can still load the appropriate modules and access a Postgres database from Drupal.

Attached to this blog you will find two gzipped Perl scripts, csvcount.pl and csvtopg.pl. The former is what I started with, and it's useful even if you're not going to import into Postgres; It characterizes CSV files by reading their contents. It tells you not only the maximum length used for each field (which is important for figuring out what data types you might use) but also the other thing you need to know for that - what kind of data is in each field. It detects integers, floating point numbers, alphanumeric data, alpha with a newline in it, and also empty fields which never contain any data at all. For various reasons this is really all you could want to know for Postgres - you really only need to know the field length to choose numeric types because all Postgres alpha fields are based on the same type under the skin, but the other types are provided for compatibility reasons.

csvtopg.pl, on the other hand, outputs the psql commands necessary to import CSV files into the database. It does the same job as csvcount.pl, but instead of its output designed to be human readable, it tells Postgres to create tables and then import data into them. This script was written to function on Linux, and has not been tested on Windows, but I think that it ought to work there as well. You can either redirect csvcount.pl's output into a file and then feed that to psql, or just pipe it through psql directly.

I'll illustrate these scripts briefly using a CSV export of the frequencies on a Baofeng radio. Here's the top few lines of the file:

Location,Name,Frequency,Duplex,Offset,Tone,rToneFreq,cToneFreq,DtcsCode,DtcsPolarity,Mode,TStep,Skip,Comment,URCALL,RPT1CALL,RPT2CALL
0,FRS1,462.562500,,0.000000,,88.5,88.5,023,NN,NFM,5.00,,,,,,
1,FRS2,462.587500,,0.000000,,88.5,88.5,023,NN,NFM,5.00,,,,,,
2,FRS3,462.612500,,0.000000,,88.5,88.5,023,NN,NFM,5.00,,,,,,

But what kind of data is in each of these fields? If the file is very long then maybe you can't assume that the data is consistent all the way down. csvcount will tell us:

$ csvcount.pl baofeng-freqs.csv 
file baofeng-freqs.csv - 61 lines
column 0 name Location length 2 int 60 
column 1 name Name length 7 alpha 60 
column 2 name Frequency length 10 float 60 
column 3 name Duplex length 1 empty 59

And so on. But what if we want to pull this into a database table? csvtopg has the following output when run against the same file:

$ csvtopg.pl baofeng-freqs.csv 
DROP TABLE IF EXISTS baofeng-freqs;
CREATE TABLE IF NOT EXISTS baofeng-freqs(
  location INTEGER,
  name TEXT,
  frequency REAL,
  duplex TEXT,
  offset REAL,
  tone TEXT,
  rtonefreq REAL,
  ctonefreq REAL,
  dtcscode INTEGER,
  dtcspolarity TEXT,
  mode TEXT,
  tstep REAL,
  skip VARCHAR(1),
  comment TEXT,
  urcall TEXT,
  rpt1call TEXT,
  rpt2call TEXT,
  unnamed1 TEXT
);

COPY baofeng-freqs
 FROM '/home/drink/Documents/baofeng-freqs.csv'
 DELIMITER ','
 NULL ''
 CSV HEADER;

This could be copy and pasted into a psql window, or piped through "psql databasename" in order to simply one-shot the process.

There are a few settings in the script, which I didn't see a reason to implement any way other than in the script itself for my use case:

#$settings{'dbschema'} = 'public';
Postgres' default schema is 'public'. Change this to use a different schema, which lets you store multiple data sets with the same table names in the same database. Commented out by default since there is no need to specify it if you're not using it.

$settings{'csv_delimiter'} = ',';
If your CSV file is actually separated by something other than commas, which is unusual but does happen, change the delimiter character here.

$settings{'text_threshold'} = 1;
In Postgres, all VARCHAR data is actually just TEXT behind the scenes, except that VARCHAR enforces a length limit when data is entered. You can set this to 0 if you want to just use all TEXT data types.

$settings{'drop_table'} = 1;
If you don't want the drop table statements, set this to 0. This will prevent you from accidentally recreating tables you've already imported.