wsgiref can be slow with large payloads

I mentioned using wsgiref in my previous entry regarding architecture of my PowerPoint search application, slideboxx.  Part of slideboxx delivers PowerPoint files to the browser, as you might imagine, this can result in large payloads of data to transmit.  During early prototyping and development things went smoothly; however, when I started testing with larger, more real life data I noticed a substantial slowdown. The reason seems perhaps obvious in retrospect but was subtle at the time.  If you work your way back through the object hierarchy you find that by default SimpleHandler in wsgiref iterates through the value that it is asked to return (SimpleHandler inherits this behaviour from BaseHandler) [line 12 of the snippet from wsgiref/handlers.py]:

    def finish_response(self):
        """Send any iterable data, then close self and the iterable

        Subclasses intended for use in asynchronous servers will
        want to redefine this method, such that it sets up callbacks
        in the event loop to iterate over the data, and to call
        'self.close()' once the response is finished.
        """

        if not self.result_is_file() or not self.sendfile():
            for data in self.result:
                self.write(data)
            self.finish_content()
        self.close()

So if you return a string to the handler it will send each character or byte of the string individually. For a multimegabyte file this can be quite slow.  The solution is to wrap any strings you return in a list. Alternatively if you are expecting a very large file, because file objects are iterable, you can return a file object directly.  There are likely some optimizations possible here but for my purposes this simple method works.  Here’s some example code to demonstrate the differences, first a simple server with methods to return as a string, as a list, and as a file, respectively:

from wsgiref import simple_server
import time

def returnAsString( env, start_response ) :
    t0 = time.time( )
    bytes = file( 'example.ppt', 'rb+' ).read( )
    t1 = time.time( )
    print 'Time to read %d bytes = %.2f seconds' % ( len( bytes ), t1 - t0 )
    headers = [ ('Content-type', 'application/vnd.ms-powerpoint' ) ]
    start_response( '200 OK', headers )
    return bytes

def returnAsList( env, start_response ) :
    t0 = time.time( )
    bytes = file( 'example.ppt', 'rb+' ).read( )
    t1 = time.time( )
    print 'Time to read %d bytes = %.2f seconds' % ( len( bytes ), t1 - t0 )
    headers = [ ('Content-type', 'application/vnd.ms-powerpoint' ) ]
    start_response( '200 OK', headers )
    return [ bytes ]

def returnAsFile( env, start_response ) :
    f = file( 'example.ppt', 'rb+' )
    headers = [ ('Content-type', 'application/vnd.ms-powerpoint' ) ]
    start_response( '200 OK', headers )
    return f

def exampleApp( env, start_response ) :
    '''dispatch to according to URL'''
    pathInfo = env[ 'PATH_INFO' ] 
    if pathInfo == '/returnAsString' :
        return returnAsString( env, start_response )
    if pathInfo == '/returnAsList' :
        return returnAsList( env, start_response )
    if pathInfo == '/returnAsFile' :
        return returnAsFile( env, start_response )
httpd = simple_server.make_server( 'localhost', 8081, exampleApp )
httpd.serve_forever( )

Then a simple driver using HTTPlib to make requests and time the results:

import httplib
import time

connection = httplib.HTTPConnection( 'localhost', 8081 )
connection.connect( )

t0 = time.time( )
connection.request( 'GET', '/returnAsString' )
data = connection.getresponse( ).read( )
t1 = time.time( )
dtString = t1 - t0
print 'Time to fetch %d bytes as string = %.2f seconds' % \
    ( len( data ), dtString )

del data
data = None
t0 = time.time( )
connection.request( 'GET', '/returnAsList' )
data = connection.getresponse( ).read( )
t1 = time.time( )
dtList = t1 - t0
print 'Time to fetch %d bytes as list = %.2f seconds' % \
    ( len( data ),  dtList )

del data
data = None

t0 = time.time( )
connection.request( 'GET', '/returnAsFile' )
data = connection.getresponse( ).read( )
t1 = time.time( )
dtList = t1 - t0
print 'Time to fetch %d bytes as file = %.2f seconds' % \
    ( len( data ),  dtList )

And the results:

Time to fetch 688128 bytes as string = 16.84 seconds
Time to fetch 688128 bytes as list = 0.01 seconds
Time to fetch 688128 bytes as file = 0.07 seconds

so you can see that pulling the entire file into memory and sending as one string by passing it back to the wsgi handler as a list containing the string is the fastest and returning one byte at a time is much much slower; returning the file object to the wsgiref handler is a good alternative as well.

Advertisement

About this entry