Uploading large files with ASP.NET

Oct 20
2009

I’ve recently had a need for users of an intranet application to upload comparatively large text files (1-200MB) to a web server. There are only a couple of ways I can think of to get around the limits imposed by IIS and ASP.NET without writing code : train the users to upload smaller files which could be concatenated at the server, or allow the users to “upload” from a network share which the server has access to. These are obviously inelegant solutions, and after a little research I’ve found that the necessary code to enable large uploads yourself is surprisingly easy to write.

Jon Galloway has written a useful article which gives a more complete and eloquent view of the ASP.NET large file upload problem than I have time for here. Particularly of interest is the discussion he links to, titled “HttpHandler or HttpModule for file upload, large files, progress indicator?”. I’ve adapted or rewritten some of the suggested code from there for my solution. I’m particularly indebted to Travis Whidden whose code is much more complete than mine (it handles multiple files, for a start). Part of the reason I ended up rewriting it is because I didn’t think this solution handled a particular edge case, but I also needed to understand what it was doing, and the best way for me to do that was rewrite from scratch (this took me about a day of thinking, poking around and testing – I don’t claim to be a fast learner :) .

Essentially the code consists of two classes, UploadModule and RequestProcessor, along with some minor changes in the web.config file :

UploadModule

In order to intercept the HttpRequest and deal with it in a stream-wise fashion, we have to implement an IHttpModule :

using System;
using System.Diagnostics;
using System.Text;
using System.Web;
using System.Reflection;
 
namespace MyIntranetSite
{
    public class UploadModule : IHttpModule
    {
        #region IHttpModule Members
 
        void IHttpModule.Dispose()
        {
        }
 
        void IHttpModule.Init(HttpApplication context)
        {
            context.BeginRequest += new EventHandler(context_BeginRequest);
        }
 
        void  context_BeginRequest(object sender, EventArgs e)
        {
            HttpApplication application = (HttpApplication) sender;
            HttpContext context = application.Context;
 
            if (context.Request.ContentType.IndexOf("multipart/form-data") == -1)
            {
                //Not our bag, baby.
                return;
            }
 
            try
            {
                HttpWorkerRequest workerRequest = (HttpWorkerRequest) context.GetType().GetProperty("WorkerRequest", BindingFlags.Instance | BindingFlags.NonPublic).GetValue(context, null);
                if (workerRequest.HasEntityBody())
                {
                    long defaultBuffer = 500000; 
                    long contentLength = long.Parse((workerRequest.GetKnownRequestHeader(HttpWorkerRequest.HeaderContentLength)));
 
                    byte[] preloadedBufferData = workerRequest.GetPreloadedEntityBody();
                    RequestProcessor rp = new RequestProcessor(contentLength);
                    rp.ReadBuffer(ref preloadedBufferData);
 
                    long remaining = contentLength - preloadedBufferData.Length;
                    byte[] bufferData;
                    while (remaining > 0)
                    {
                        bufferData = new byte[(remaining > defaultBuffer)? defaultBuffer : remaining];
                        remaining -= bufferData.Length;
                        workerRequest.ReadEntityBody(bufferData, bufferData.Length);
                        rp.ReadBuffer(ref bufferData);
                    }
                }
            }
            catch(Exception ex)
            {
                EventLog.WriteEntry("Custom ASP.NET Upload Module", ex.Message);
            }
 
            context.Response.Redirect(context.Request.RawUrl);
        }
 
        #endregion
    }
}

This handles ALL “multipart/form-data” requests at the moment. You would probably want to check the url of the request and match it against a list of expected pages, otherwise all of your web requests (that is, every postback) would get processed by this module, and much of your code would be bypassed!

RequestProcessor

This class takes a series of byte array buffers and parses them for a start and end pattern. Hopefully, I’ve commented my code well enough for it to be read and understood. However, I should say that this only deals with single file uploads at the moment and expects to be able to write to “C:\temp\”. It would be possible to improve the code to handle multiple files and making the upload directory configurable would be fairly trivial, but I think it’s more useful as a learning tool if I keep it simple for now.

using System;
using System.Diagnostics;
using System.Collections.Generic;
using System.IO;
using System.Text;
 
namespace MyIntranetSite
{
 
    //
    /// <summary>
    /// Takes byte[] chunks from an HTTP request and processes them looking for (currently) the first file.
    /// Each file will be wrapped by the lines :
    /// (start) "Content-Type: [some content type]\r\n"
    /// (end) "-----------------------------[a form post ID]\r\n\r\n" 
    ///       (that's 29 "-"s followed by a number, followed by 2 * carriage return + newline
    /// 
    /// Problems arise because the start and end patterns could span two buffers.
    /// This means we can't write from the latest buffer - we have to always be writing from the previous buffer,
    /// since we can never know if the latest buffer (assuming there are more bytes to read) contains the start of the
    /// end pattern, but not all of it.
    /// </summary>
    public class RequestProcessor : IDisposable 
    {
        public long Length { get; private set; }
        public long BytesRead { get; private set; }
        public List<string> FinishedFiles = new List<string>();
 
        private BufferChunk previous;
        private bool _startFound = false;
        private bool _endFound = false;
        private List<byte> startPatternBegin;
        private List<byte> startPatternEnd;
        private List<byte> endPattern;
        private FileStream currentFileStream;
        private string currentFileName = Guid.NewGuid() + ".bin";
 
        public RequestProcessor(long length)
        {
            Length = length;
            BytesRead = 0;
            startPatternBegin = new List<byte>(Encoding.UTF8.GetBytes("Content-Type: "));
            startPatternEnd = new List<byte>(Encoding.UTF8.GetBytes("\r\n\r\n"));
        }
 
        public void ReadBuffer(ref byte[] buffer)
        {
            if (_endFound) return;
 
            BufferChunk current = new BufferChunk(ref buffer);
            if (previous == null)
            {
                //first buffer chunk
                //the first line of this will give the form content separator, which is also the endPattern
                int i = 0;
                endPattern = new List<byte>();
                while (current.Data[i] != Encoding.UTF8.GetBytes("\r")[0])
                {
                    endPattern.Add(current.Data[i]);
                    i++;
                }
            }
 
            //Merge the previous and current buffers
            List<byte> mergedBuffers = new List<byte>();
            if (previous != null) mergedBuffers.AddRange(previous.Data);
            mergedBuffers.AddRange(current.Data);
 
            if (!_startFound)
            {
                //Look for start pattern in the current buffer.
                //It could span this buffer and the one before (in which case the start point is in THIS buffer)
                //or it could span this buffer and the next (in which case the start point is in the NEXT buffer)
                //The latter case has to be checked when the next buffer comes in.
 
                int startBegin;
                if ((startBegin = FindBytePattern(mergedBuffers, startPatternBegin, 0)) != -1)
                {
                    //found a content-type declaration, look for the end of that line :
                    int startEnd;
                    if ((startEnd = FindBytePattern(mergedBuffers, startPatternEnd, startBegin + startPatternBegin.Count)) != -1)
                    {
                        //found the end of the line
                        if (startEnd + startPatternEnd.Count < mergedBuffers.Count - 1)
                        {
                            int startByte = startEnd + startPatternEnd.Count;
                            if (previous != null)
                            {
                                current.Start = startByte - previous.Data.Count;
                            }
                            else
                            {
                                current.Start = startByte;
                            }
                            _startFound = true;
                        }
                        // else the start byte is in the next buffer.
                    }
                }
 
                if (!_startFound) current.Start = current.Data.Count;
            }
 
            if (_startFound && !_endFound)
            {
                //Look for the end pattern in the current buffer
                //As with the start it could span beginning (in which case the last byte is in the PREVIOUS buffer)
                //Or it could span the end (in which case the last byte is in THIS buffer)
                //The latter case has to be checked when the next buffer comes in.
 
                int endBegin;
                int searchStart = previous != null? previous.Start : current.Start;
                if ((endBegin = FindBytePattern(mergedBuffers, endPattern, searchStart)) != -1)
                {
                    int endByte = endBegin - 1;
                    if (previous != null)
                    {
                        if (endByte < previous.Data.Count)
                            previous.End = endByte;
                        else
                            current.End = endByte - previous.Data.Count;
                    }
                    else
                    {
                        current.End = endByte;
                    }
                    _endFound = true;
                }
                // else the end byte is in the next buffer.
 
                if (!_endFound && previous != null) previous.End = previous.Data.Count;
            }
            BytesRead += current.Data.Count;
 
            //FILE CREATION
            if (previous != null && _startFound && previous.WriteBytes > 0)
            {
                //Write out the previous buffer from Start to End
                if (currentFileStream == null)
                {
                    currentFileStream = File.OpenWrite(@"C:\temp\" + currentFileName);
                }
                currentFileStream.Write(previous.Data.ToArray(), previous.Start, previous.WriteBytes);
            }
 
            if (_startFound && _endFound || BytesRead == Length)
            {
                //Write out the current buffer from Start to End
                if (currentFileStream == null)
                {
                    currentFileStream = File.OpenWrite(@"C:\temp\" + currentFileName);
                }
                currentFileStream.Write(current.Data.ToArray(), current.Start, current.WriteBytes);
                currentFileStream.Close();
                currentFileStream.Dispose();
            }
 
            previous = current;
        }
 
        private static int FindBytePattern(List<byte> container, List<byte> pattern, int startIndex)
        {
            int i, position;
            if (pattern.Count > container.Count - startIndex) return -1;
 
            for (position = startIndex; position < container.Count; position++)
            {
                if (container[position] == pattern[0])
                {
                    for(i = 1; i < pattern.Count; i++)
                    {
                        if (position + i == container.Count || pattern[i] != container[position + i]) break;
                    }
                    if (i == pattern.Count) return position;
                }
            }
 
            return -1;
        }
 
        #region IDisposable Members
 
        void IDisposable.Dispose()
        {
            if (currentFileStream != null)
            {
                currentFileStream.Close();
                currentFileStream.Dispose();
            }
        }
 
        #endregion
    }
 
    public class BufferChunk
    {
        public List<byte> Data;
        public int Start;
        public int End;
        public int WriteBytes { get { return End - Start; } }
 
        public BufferChunk(ref byte[] buffer)
        {
            Data = new List<byte>(buffer);
            Start = 0;
            End = Data.Count;
        }
    }
}

These files can sit within your web project or in a separate assembly if you want. Personally I’d rather have them sitting with the web project since that makes them more straightforward to debug and more obvious as to where they belong.

Web.Config changes

This is almost laughably trivial :

<httpModules>
    <add name="UploadModule" type="MyIntranetSite.UploadModule"/>
</httpModules>

And that’s it! There’s obviously a lot more that could be done (such as the progress indicator Travis incorporated), but this seems like a decent start to me.

Ideally, ASP.NET 3.0(?) and IIS 7.0 would address this kind of problem once and for all, but I’m not holding my breath. I also suspect a lot of businesses will remain on IE6.0, IIS 5/6 and ASP.NET 2.0 for another few years, so this approach will remain relevant a little while longer.

Update (a warning)

It’s entirely possible that the parser will bomb out with some kind of error on occasion. If this happens when the number of bytes left to process is greater than the ASP.NET maxRequestLength (or the IIS request length) then the site will (seemingly) silently fail and you’ll get the dreaded “Connection was reset” error page!

Playing with IDataReader and SqlBulkCopy

Aug 25
2009

For importing huge amounts of data into SQL Server, there’s really nothing quite like SqlBulkCopy. I’ve recently had a need to manipulate a (roughly) 330,000 line CSV file and import the results of that manipulation into a single table. Doing this record by record can take minutes, but with SqlBulkCopy, importing that many records can be done in about 4 seconds on my development machine (and it’s definitely not the fastest PC in the world).

Out of Memory

Originally I was reading in the file, manipulating the data and writing out another CSV file I could use with DTS. However, SqlBulkCopy.WriteToServer doesn’t take a CSV file directly, it only takes either a DataTable, DataRow[] or IDataReader, so at first, while writing out the CSV file I was also building up a DataTable to pass to it. For a file like mine of only a few hundred thousand records, it wasn’t a big problem to build that DataTable in memory – it was only taking a few hundred MB – but it occurred to me that there could be a problem if the number of records increased modestly to a million or so. In fact, with a file of only 4 million records, I’d probably be looking at a System.OutOfMemoryException.

IDataReader

The solution to this problem is to write a class which implements IDataReader and pass this to SqlBulkCopy. There are a few implementations out there already, but I couldn’t find anything both free and in C#. I didn’t look terribly hard though, and I was curious to try writing a basic implementation myself just to see how difficult it would be.
It turns out it’s not very difficult at all, it depends on how much effort you want to put in. For a simple spike like this I just wanted to see how long it took to implement enough of IDataReader for SqlBulkCopy to work so I could then see how much memory was being used. This is (part of) what I ended up with :

public class CSVDataReader : IDataReader
{
    private StreamReader stream;
    private Dictionary<string, int> columnsByName = new Dictionary<string,int>();
    private Dictionary<int, string> columnsByOrdinal = new Dictionary<int,string>();
    private string[] currentRow;
    private bool _isClosed = true;
 
    public CSVDataReader(string fileName)
    {
        if (!File.Exists(fileName))
            throw new Exception("File [" + fileName + "] does not exist.");
 
        this.stream = new StreamReader(fileName);
 
        string[] headers = stream.ReadLine().Split(',');
        for (int i=0; i < headers.Length; i++)
        {
            columnsByName.Add(headers[i], i);
            columnsByOrdinal.Add(i, headers[i]);
        }
 
        _isClosed = false;
    }
 
    public void Close()
    {
        if (stream != null) stream.Close();
        _isClosed = true;
    }
 
    public int FieldCount
    {
        get { return columnsByName.Count; }
    }
 
    /// <summary>
    /// This is the main function that does the work - it reads in the next line of data and parses the values into ordinals.
    /// </summary>
    /// <returns>A value indicating whether the EOF was reached or not.</returns>
    public bool Read()
    {
        if (stream == null) return false;
        if (stream.EndOfStream) return false;
 
        currentRow = stream.ReadLine().Split(',');
        return true;
    }
 
    public object GetValue(int i)
    {
        return currentRow[i];
    }
 
    public string GetName(int i)
    {
        return columnsByOrdinal[i];
    }
 
    public int GetOrdinal(string name)
    {
        return columnsByName[name];
    }
 
    //Other IDataReader methods/properties here, but all throwing not implemented exceptions.
}

It turns out you only need to implement these few properties and methods for SqlBulkCopy (I’m not even sure you need implement this much). Once I had this, it was a mere four lines to import the CSV file into SQL Server :

SqlBulkCopy sbc = new SqlBulkCopy(mySqlConnection);
sbc.DestinationTableName = "MyTable";
sbc.BulkCopyTimeout = 6000; //10 Minutes
sbc.WriteToServer(new CSVDataReader(myFileName));

Of course, this relies on all of MyTable’s columns being of type varchar and the column headers in the CSV file need to match up with the column headers in the table, but this is supposed to be a simple spike.
The first time I ran this, it was using about 12MB of memory for a 12MB file (my original 330K line file), and while this was an improvement over the 100s of MB for building the DataTable, it didn’t really tell me anything about how it might scale. So, I generated a file with about 35 million rows in it just to see what would happen. I was pleasantly surprised to find that it only used about 12MB from start to finish – this is clearly the benefit of using this DataReader model, the whole file/data structure is never in memory so we’re not generating enormous data structures to pass around.

If I have to do something similar to this in the future, I’ll probably tidy up this CSVDataReader and use it again. I may even implement the rest of it…

Visit Our Friends!

A few highly recommended friends...

Archives

All entries, chronologically...

Pages List

General info about this blog...