Playing with IDataReader and SqlBulkCopy

Aug 25
2009

For importing huge amounts of data into SQL Server, there’s really nothing quite like SqlBulkCopy. I’ve recently had a need to manipulate a (roughly) 330,000 line CSV file and import the results of that manipulation into a single table. Doing this record by record can take minutes, but with SqlBulkCopy, importing that many records can be done in about 4 seconds on my development machine (and it’s definitely not the fastest PC in the world).

Out of Memory

Originally I was reading in the file, manipulating the data and writing out another CSV file I could use with DTS. However, SqlBulkCopy.WriteToServer doesn’t take a CSV file directly, it only takes either a DataTable, DataRow[] or IDataReader, so at first, while writing out the CSV file I was also building up a DataTable to pass to it. For a file like mine of only a few hundred thousand records, it wasn’t a big problem to build that DataTable in memory – it was only taking a few hundred MB – but it occurred to me that there could be a problem if the number of records increased modestly to a million or so. In fact, with a file of only 4 million records, I’d probably be looking at a System.OutOfMemoryException.

IDataReader

The solution to this problem is to write a class which implements IDataReader and pass this to SqlBulkCopy. There are a few implementations out there already, but I couldn’t find anything both free and in C#. I didn’t look terribly hard though, and I was curious to try writing a basic implementation myself just to see how difficult it would be.
It turns out it’s not very difficult at all, it depends on how much effort you want to put in. For a simple spike like this I just wanted to see how long it took to implement enough of IDataReader for SqlBulkCopy to work so I could then see how much memory was being used. This is (part of) what I ended up with :

public class CSVDataReader : IDataReader
{
    private StreamReader stream;
    private Dictionary<string, int> columnsByName = new Dictionary<string,int>();
    private Dictionary<int, string> columnsByOrdinal = new Dictionary<int,string>();
    private string[] currentRow;
    private bool _isClosed = true;
 
    public CSVDataReader(string fileName)
    {
        if (!File.Exists(fileName))
            throw new Exception("File [" + fileName + "] does not exist.");
 
        this.stream = new StreamReader(fileName);
 
        string[] headers = stream.ReadLine().Split(',');
        for (int i=0; i < headers.Length; i++)
        {
            columnsByName.Add(headers[i], i);
            columnsByOrdinal.Add(i, headers[i]);
        }
 
        _isClosed = false;
    }
 
    public void Close()
    {
        if (stream != null) stream.Close();
        _isClosed = true;
    }
 
    public int FieldCount
    {
        get { return columnsByName.Count; }
    }
 
    /// <summary>
    /// This is the main function that does the work - it reads in the next line of data and parses the values into ordinals.
    /// </summary>
    /// <returns>A value indicating whether the EOF was reached or not.</returns>
    public bool Read()
    {
        if (stream == null) return false;
        if (stream.EndOfStream) return false;
 
        currentRow = stream.ReadLine().Split(',');
        return true;
    }
 
    public object GetValue(int i)
    {
        return currentRow[i];
    }
 
    public string GetName(int i)
    {
        return columnsByOrdinal[i];
    }
 
    public int GetOrdinal(string name)
    {
        return columnsByName[name];
    }
 
    //Other IDataReader methods/properties here, but all throwing not implemented exceptions.
}

It turns out you only need to implement these few properties and methods for SqlBulkCopy (I’m not even sure you need implement this much). Once I had this, it was a mere four lines to import the CSV file into SQL Server :

SqlBulkCopy sbc = new SqlBulkCopy(mySqlConnection);
sbc.DestinationTableName = "MyTable";
sbc.BulkCopyTimeout = 6000; //10 Minutes
sbc.WriteToServer(new CSVDataReader(myFileName));

Of course, this relies on all of MyTable’s columns being of type varchar and the column headers in the CSV file need to match up with the column headers in the table, but this is supposed to be a simple spike.
The first time I ran this, it was using about 12MB of memory for a 12MB file (my original 330K line file), and while this was an improvement over the 100s of MB for building the DataTable, it didn’t really tell me anything about how it might scale. So, I generated a file with about 35 million rows in it just to see what would happen. I was pleasantly surprised to find that it only used about 12MB from start to finish – this is clearly the benefit of using this DataReader model, the whole file/data structure is never in memory so we’re not generating enormous data structures to pass around.

If I have to do something similar to this in the future, I’ll probably tidy up this CSVDataReader and use it again. I may even implement the rest of it…

Visit Our Friends!

A few highly recommended friends...

Pages List

General info about this blog...