Note: This is a long post focused on the approach I took to developing a new project. If you’re interested in more technical details, how to solve challenges this brought up, or insights into some of the processes I will be writing other posts under the “ns3000” tag.
For about four years now I’ve had in mind a project, an all-in-wonder file-storage solution that meets a whole bunch of criteria that I have. There’s some COTS (cheap, off-the-shelf) and open source products that meet some of the needs, but nothing meets all the features I need. More than just that reason, I wanted something designed and built by my own hand.
I’ve made attempts to start it every 6 months since I first had the idea, but they’ve all fallen apart after a few days and I’ve put the idea back in the ideas book again.
However coming up to April I set myself a goal to work on one of my larger scale projects and really push myself with it. I even decided to crack open the beautiful planner I got years ago from Evil Supply Co and use it to track progress and push me towards my goals.
In the end I settled on building NS3000 - my file storage solution.
With the project now selected, I gathered together all my notes and made my second big decision: how would I approach building this.
I decided with an iterative MVP (Minimum Viable Product) approach. I was going to split things up into small tranches, each one building on the last. At times this would mean some previous work might get discarded, but it also meant I’d have regular output I could see and use that would give me the motivation to keep pushing ahead.
I planned to build it in the following stages:
- Basic file upload with tagging and user metadata
- Dockerised app and automated metadata/content extraction
- Image thumbnailing
I spent the first week just building the basics of the site. I decided to use Laravel for the web site of things, as I had some familiarity and could avoid major issues, but wanted to improve my skills at the same time.
Beginning with modelling and relationships, I quickly set up migrations and models. To go with them, I built skeletong views attached to resource controllers to ensure that file uploads and tagging worked well. With that complete I moved on to something I wanted to focus on for the project: testing.
I spent a lot of time refactoring code for better unit tests, as well as feature tests to ensure things were perfect at the macro (as well as micro) scale. Although I’d done testing before, I hadn’t really done it with a completely new project, so it was a great learning experience (key phrase for a lot of this project).
I then introduced the user-supplied metadata section of the site, with a style more similar to TDD (test-driven development). This enabled a rapid development process and I could ensure features would work before I even opened the web browser.
With all the main website done, I did some basic styling, picking a colour scheme using Cloudflare’s excellent colour design tool.
Docker, Metadata Extraction
I spent the next day running an intensive investigation into using Docker and docker-compose to containerise the application. For the uninitiated, containerisation is similar to running things in virtual machines, but more lightweight and usually designed with the ability to be destroyed and recreated quickly to enable scaling, resilience, etc.
Once I had my website running in a lightweight Docker container (using an Alpine Linux base image) I moved on to automated metadata and content extraction. For that I decided to use the fantastic Apache Tika, something I’d eyed for years but never had an excuse or the motivation to use it.
I didn’t particular want to build a Java application to wrap around Tika, so I used the handy tika-python binding and then set up a Celery task queue with a Flask web/API route. Why make it so complex? So I could avoid people waiting ages for Tika to process when they uploaded files. Now as soon as a file is uploaded they can view it, while in the background it is processed by Tika which then uploads the extracted metadata/OCR/content to an API.
To stop users from needing to browse the entire file index or tag list, I wanted a full-text search function that was worth it’s salt. From my previous research into the topic, it came down to Elasticsearch or a newer contender - Xapiand. While Elasticsearch has the maturity and greater featureset, I wasn’t happy with needing about 4GB of RAM just to run a single node (on top of the 1GB required by the Apache Tika container). In the end I went for Xapiand, although as a future project* I plan to implement pluggable search backends so people can choose what they want.
Getting Xapiand set up and containerised was a dream, and then building it into the web app was relatively fun. I decided to wrap it up as using Laravel events thanks to ideas from Matthew Daly’s excellent blog. This meant they could also run asynchronously if I wanted to prevent blocking the file upload/edit process.
Currently the search is quite naive, using an approach which checks all fields for the results, including potentially lots of technical metadata (especially for images). I plan to implement a more complex and nuanced search function later - as well as advanced search - but so far I’ve found this to be an effective early system.
The last feature I decided to implement was image thumbnailing. I wanted people to be able to see small examples of any images they uploaded without needing to download the whole file. To do this I modified my Apache Tika class and created a new class to handle image thumbnailing which would be put on the Celery task queue after metadata extraction.
I decided to use the Pillow Python library and found it an intensely easy system to work with. This whole feature took about one day to fully implement, including writing tests for it.
We return to the present now, having completed the MVP of the project. I’m now starting to dogfood my project, using this as an opportunity both to organise the computer wallpapers I’ve collected over the years as well as get some real-world testing of this project and making small adjustments and fixes as I go.
This has so far been an invaluable tactic and I recommend everyone do this with their own projects to get a better understanding of user workflows
There’s still a lot to build, with so many directions I can take this project. I do plan to open up the source code at some point in the not-too-distant future, so people can download and set up their own version. That’s an exciting prospect too, with me needing to version my work, write docs, etc.