Mitchell Currie

Elegant handling of file uploads (Python/Bottle).

Mon 21 November 2016

0 Comments

Rationale

So uploading a file seems like a pretty simple thing huh? It is, but there's also a few gotchas that can really cause trouble if you're not careful. We'll be examining the simplest approach we can think of, and then explaining how to overcome the issues it can cause.

Basics

Let's define a file uploader page where a user can see all the files that have been uploaded, and can upload a file of their own from their computer. Let's also say that the website has a folder called /uploads where it puts all the uploads.

This page will just receive the file from the form and save it to disk as received. The list of files will just show what's in the directory, pretty simple.

Worked Example

First, let's define a html template we'll use for both examples. It uses Jinja2 syntax and is relatively simple.

Jinja2 HTML template

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>File Uploads</title>
</head>
<body>
{% if message %} <h3> {{ message }} </h3> {% endif %}
<h2> File Uploader <form id="file_form"  method="post" enctype="multipart/form-data">
    <input type="file" name="my_upload" id="upload_file"/>
    <input type="submit" value="Start upload" />
</form></h2>
<table>
    <thead><td>File</td><td>Type</td><td>Date</td><td>Size</td></thead>
    {% for file in files %}
    <tr><td><a href="{{ file.path }}"> {{ file.name }}</a></td><td>{{ file.type }}</td><td>{{ file.date }}</td><td>{{ file.size }}</td></tr>
    {% endfor %}
</table>
</body>
</html>

Produces a simple webpage, shouldn't be too difficult to grasp, let's take a look at the service next.

python service for simplest example

from bottle import route, run, default_app, request, jinja2_view, static_file, BaseRequest
from os import listdir, stat, path
from datetime import datetime
import magic


@route('/uploads/<file>', name='static')
def serve_upload(file):
    return static_file(file, root='uploads')


# Index
@route('/', method=['GET', 'POST'])
@jinja2_view('files.html', template_lookup=['views'])
def file_list():
    message = None
    if request.method == 'POST':
        upload = request.files.get('my_upload')
        if not path.exists('{0}/{1}'.format(uploads_dir, upload.filename)):
            upload.save(uploads_dir)
            message = ('Saved ok' if path.exists('{0}/{1}'.format(uploads_dir, upload.filename)) else 'Error Saving')
        else:
            message = "Error file already exists"

    files = list()
    for f in listdir(uploads_dir):
        full = path.join(uploads_dir, f)
        files.append({'name': f,
                      'size': stat(full).st_size,
                      'type': mime.from_file(full).decode('utf-8'),
                      'path': 'uploads/'+f,
                      'date': datetime.fromtimestamp(path.getctime(full))})
    return {'files': files, 'message': message}

# Create app instance
app = application = default_app()
BaseRequest.MEMFILE_MAX = 8096 * 1024  # 8mb
uploads_dir = 'uploads'
mime = magic.Magic(mime=True)
# Run bottle internal test server when invoked directly ie: non-uxsgi mode
if __name__ == '__main__':
    run(app=app, host='0.0.0.0', port=8081)

So, as we intended, fairly simple. There's only one page ('/') which handles both listing and posting (which then lists the end result), and serve_upload which just tells UWSGI to serve a static file based on the name given.

If you run this, you should notice it works and your file is visible, however let's examine some potential drawbacks.

Multiple files with the same name can't be uploaded, it'll either error - or replace the existing file.
There could be characters in the filename that are problematic for some file systems or WWW servers (you won't see this under UWSGI).
Not in this example, but depending how you concatenate strings, and if someone manually posts the form, you could have someone write '../.htconfig' as a path (and it might work).
Can't limit access based on ACL or ownership with just static file.

Probably relying on filenames for physical disk path could be the biggest limitation, other mentioned items relate more to security. What we need is more than just phsyical files on disk, let's use some sort of storage to record details about the files, based on the following list of nice things:

Server storage path not related to filename user provides
User can still see the original filename of file when uploaded
File behaves correctly based on mime type when downloading (e.g. image file not download but open in browser).
Possibility to identify file and record who is owner (won't cover in this example).

Essentially, we want a database. You could use SQL or SQLite without or with an ORM like SQLAlchemy - for this example I'm going to wuss out and use shelve (based on pickle) as it's barely adding any lines of code (but it has many portability and perofrmance issues for the real world!)

Here's the complete solution (using the same HTML view):

The more sophisticated way

from bottle import route, run, default_app, request, jinja2_view, response, BaseRequest
from os import path
from datetime import datetime, time
from time import mktime
import magic
import shelve
from io import BytesIO


@route('/uploads2/<file_id>', method='GET')
def download_attachment(file_id):
    file_bytes = None
    d = shelve.open(file_store)
    if file_id in d:
        file = d[file_id]
        with open(file['path'], 'rb') as f:
            file_buffer = BytesIO(f.read())
            file_bytes = file_buffer.read()
        response.set_header('Content-Disposition', 'inline; filename="{0}"'.format(file['name']))
        response.set_header('Content-Type', file['type'])
    else:
        response.status = 404
    d.close()
    return file_bytes


# Index
@route('/', method=['GET', 'POST'])
@jinja2_view('files.html', template_lookup=['views'])
def file_list():
    message = None
    d = shelve.open(file_store)
    if request.method == 'POST':
        time_stamp = datetime.now()  # get a timestamp
        upload = request.files.get('my_upload')
        string_id = str(mktime(time_stamp.timetuple()))
        details = dict({'name': upload.filename,
                        'size': upload.content_length,
                        'type': upload.content_type,
                        'path': path.join(uploads_dir, string_id),
                        'date': time_stamp})
        upload.save(details['path'])
        if path.exists(details['path']):
            message = 'Saved ok'
            d[string_id] = details
        else:
            message = 'Error Saving'

    files = list(d.values())
    d.close()
    return {'files': files, 'message': message}

# Create app instance
app = application = default_app()
BaseRequest.MEMFILE_MAX = 8096 * 1024  # 8mb
uploads_dir = 'uploads2'
file_store = 'uploads.db'
mime = magic.Magic(mime=True)
# Run bottle internal test server when invoked directly ie: non-uxsgi mode
if __name__ == '__main__':
    run(app=app, host='0.0.0.0', port=8082)

Phew Let's break it down.

As you can see, it's overall structure is similar to before but we've added some more bits in the middle.

We're using a .db file for pickle to store records, the primary take-away here is that items are identified by their upload time (timestamp) as their unique identifier, and that's both on disk and in-data (it could differ but then the mapping would need to be explicit rather than implicit and this serves our needs).

If you look in the uploads folder (for us uploads2) you will see like below:

➜  file_uploader_test ls uploads2
1479451548.0

So that's pretty simple, all the files just use a time-stamp, but won't this confuse people?

No it won't - this isn't what people see,as we're storing name in the pickle we show it for the listing page, and also do some clever trick in the download_attachment method. More specifically, we're sending the client back the original filename and mime-type through Content-Disposition and Content-Type respectively, if you upload README.TXT, the user will get README.TXT locally and their browser will treat it exactly like a text-file, despite the url being opened as http://<host>/uploads2/1479451548.0 for instance (which normally would open as a binary file of the same name and be downloaded since the browser by default looks in the filename for a mime mapping unless we set Content-Type ourselves (as we do).

If you try to upload a file by the same name multiple times, you see it won't care. Sure, it'll show multiple files with that name, but really they're separate items and can all be accessed (sure it might be confusing for the user - but in real-life you'd use an owner name too probably).

Since we're not just mapping static files, we're actually using a handler that serves the content, we can intercept the file and prevent it from being downloaded if the JWT (session) token is missing or unauthorised for the content, however in this example I wanted to keep it simple.

Conclusions

Handle file uploads with care, both servers and filesystems don't treat all characters and symbols the same - or even nicely. You can never have any guarantee what types of filenames a user will send and handling clashing is important. Give file-uploads an entity structure and record those details in whatever data-store you prefer (it'll pay back dividends later).

Blog Post