VAR, COVAR and COREL in Actionscript

Aug 28
2009

Someone was asking whether or not it’s possible to call Excel functions from a Flex project because they needed to use VAR, COVAR and COREL. The short answer is, of course, no (at least not to my knowledge, and, even if it were possible, I don’t think it’s something you really want to encourage). A better answer would explain that these functions aren’t especially complex and that an Actionscript implementation is fairly straightforward. The wikipedia page on covariance is a little daunting if you’ve never really thought about what these functions entail (or have forgotten), but essentially it boils down to this :

private function corel(X:Array, Y:Array):Number
{
    //correlation coeff between two random variables X and Y is defined as :
    //correlation(X,Y) = covar(X,Y)/(sqrt(Var(X)) * sqrt(Var(Y)))
    //
    //var(X) = covar(X,X);
    //covar(X,Y) = E((X-xm)(Y-ym)); where xm, ym are the population means.
    return covar(X, Y)/Math.sqrt(covar(X, X) * covar(Y, Y));
}
 
private function covar(X:Array, Y:Array):Number
{
    //Sample covariance is Sum((X-xm)(Y-ym))/n-1
    //where n is the sample size and xm & ym are sample means.
    //I'll assume that X and Y are the same size...
    var total:Number = 0;
    var xm:Number = average(X);
    var ym:Number = (X == Y)? xm : average(Y);
    for (var i:int=0; i<X.length; i++)
        total += (X[i]-xm)*(Y[i]-ym);
    return total/(X.length - 1); 
}
 
private function average(X:Array):Number
{
    //Sample mean or average = Sum(X)/(sample size)
    var total:Number = 0;
    for each(var x:Number in X)
        total += x;
    return total/X.length;
}

Of course, it’s not quite that simple – my covar won’t return the same value as COVAR in Excel, because Excel uses the biased estimator (i.e. it divides by n rather than (n-1)), but this cancels out when calculating the correlation coefficient, and correcting covar to the biased estimator is trivial (multply by (n-1)/n). Personally, I think Excel is wrong for defaulting to the biased estimator for COVAR, especially since VAR uses the unbiased one (and, as I’ve written in the comments, VAR(X)=COVAR(X,X)).

Playing with IDataReader and SqlBulkCopy

Aug 25
2009

For importing huge amounts of data into SQL Server, there’s really nothing quite like SqlBulkCopy. I’ve recently had a need to manipulate a (roughly) 330,000 line CSV file and import the results of that manipulation into a single table. Doing this record by record can take minutes, but with SqlBulkCopy, importing that many records can be done in about 4 seconds on my development machine (and it’s definitely not the fastest PC in the world).

Out of Memory

Originally I was reading in the file, manipulating the data and writing out another CSV file I could use with DTS. However, SqlBulkCopy.WriteToServer doesn’t take a CSV file directly, it only takes either a DataTable, DataRow[] or IDataReader, so at first, while writing out the CSV file I was also building up a DataTable to pass to it. For a file like mine of only a few hundred thousand records, it wasn’t a big problem to build that DataTable in memory – it was only taking a few hundred MB – but it occurred to me that there could be a problem if the number of records increased modestly to a million or so. In fact, with a file of only 4 million records, I’d probably be looking at a System.OutOfMemoryException.

IDataReader

The solution to this problem is to write a class which implements IDataReader and pass this to SqlBulkCopy. There are a few implementations out there already, but I couldn’t find anything both free and in C#. I didn’t look terribly hard though, and I was curious to try writing a basic implementation myself just to see how difficult it would be.
It turns out it’s not very difficult at all, it depends on how much effort you want to put in. For a simple spike like this I just wanted to see how long it took to implement enough of IDataReader for SqlBulkCopy to work so I could then see how much memory was being used. This is (part of) what I ended up with :

public class CSVDataReader : IDataReader
{
    private StreamReader stream;
    private Dictionary<string, int> columnsByName = new Dictionary<string,int>();
    private Dictionary<int, string> columnsByOrdinal = new Dictionary<int,string>();
    private string[] currentRow;
    private bool _isClosed = true;
 
    public CSVDataReader(string fileName)
    {
        if (!File.Exists(fileName))
            throw new Exception("File [" + fileName + "] does not exist.");
 
        this.stream = new StreamReader(fileName);
 
        string[] headers = stream.ReadLine().Split(',');
        for (int i=0; i < headers.Length; i++)
        {
            columnsByName.Add(headers[i], i);
            columnsByOrdinal.Add(i, headers[i]);
        }
 
        _isClosed = false;
    }
 
    public void Close()
    {
        if (stream != null) stream.Close();
        _isClosed = true;
    }
 
    public int FieldCount
    {
        get { return columnsByName.Count; }
    }
 
    /// <summary>
    /// This is the main function that does the work - it reads in the next line of data and parses the values into ordinals.
    /// </summary>
    /// <returns>A value indicating whether the EOF was reached or not.</returns>
    public bool Read()
    {
        if (stream == null) return false;
        if (stream.EndOfStream) return false;
 
        currentRow = stream.ReadLine().Split(',');
        return true;
    }
 
    public object GetValue(int i)
    {
        return currentRow[i];
    }
 
    public string GetName(int i)
    {
        return columnsByOrdinal[i];
    }
 
    public int GetOrdinal(string name)
    {
        return columnsByName[name];
    }
 
    //Other IDataReader methods/properties here, but all throwing not implemented exceptions.
}

It turns out you only need to implement these few properties and methods for SqlBulkCopy (I’m not even sure you need implement this much). Once I had this, it was a mere four lines to import the CSV file into SQL Server :

SqlBulkCopy sbc = new SqlBulkCopy(mySqlConnection);
sbc.DestinationTableName = "MyTable";
sbc.BulkCopyTimeout = 6000; //10 Minutes
sbc.WriteToServer(new CSVDataReader(myFileName));

Of course, this relies on all of MyTable’s columns being of type varchar and the column headers in the CSV file need to match up with the column headers in the table, but this is supposed to be a simple spike.
The first time I ran this, it was using about 12MB of memory for a 12MB file (my original 330K line file), and while this was an improvement over the 100s of MB for building the DataTable, it didn’t really tell me anything about how it might scale. So, I generated a file with about 35 million rows in it just to see what would happen. I was pleasantly surprised to find that it only used about 12MB from start to finish – this is clearly the benefit of using this DataReader model, the whole file/data structure is never in memory so we’re not generating enormous data structures to pass around.

If I have to do something similar to this in the future, I’ll probably tidy up this CSVDataReader and use it again. I may even implement the rest of it…

Visit Our Friends!

A few highly recommended friends...

Archives

All entries, chronologically...

Pages List

General info about this blog...