TL;DR: I use Rclone with AWS S3 as a backend. Versioning enabled and cron setup in TrueNAS to run a sync with bandwidth limiting, performance+cost tuning, file filtering and logging of the output.
A long time ago I would just manually do an off-site back up of my data to Google Drive, a laborious and time intensive process thanks to ADSL internet and the finicky nature of the Google Drive web interface. Not long before my original post about NAS upgrade ideas I migrated to ad-hoc runs of Rclone against key directories, using S3 as a backend for storage. But for something I only needed to do quarterly it was easy to forget for a year and then watch it spend hours catching up. Or remembering which NAS shares go in which bucket or subfolder and ending up with redundant copies scattered all over the place.
I’ve been playing with offsite backups again lately and now have it running monthly via cron using the capable TrueNAS web UI. For anyone who plans to do the same, what follows are my experiences and recommendations.
Before I dive in, here’s the command executed by cron:
rclone sync /mnt/pool-data s3:naslbs --filter-from /mnt/pool-data/conf/filter-lbs.txt -v --bwlimit 300k --size-only --fast-list --log-file /mnt/pool-data/conf/backup.log
I live in a country with an artificial scarcity of bandwidth thanks to years of mismanagement by every elected party. I get 18Mbps upload and in practice this works out to approximately 1.8MB/s to share across all the devices and servers running at home that either access the internet or can be accessed from it. I want to ensure that whenever scheduled backups take place, my NAS isn’t hogging our internet tube and making the internet unusable for my spousy-boo or Mastodon server. Rclone instantly became one of my favourite pieces of software for it’s simple --bwlimit 300k
option. Regardless of how much data is queued for upload, I’m guaranteed that it won’t impact our perceptual use of the connection.
We’ve turned versioning on for our S3 bucket, so anytime a file is changed it’ll keep a copy of the old version. The main motivator for this is backup protection through a naive Write Once Read Many (WORM) solution, if our files are accidentally deleted or encrypted by a ransomware infection and the cron task is run we won’t lose our offsite backups too. In the modern age of live cloud backups this is critical as the original 3-2-11 strategy usually included an offsite backup that was a weekly tape mailed offsite and why modern recommendations are for a 3-2-1-1-0 or 4-3-2 strategy.
Of course, if files are liable to change this isn’t appropriate, so we save our NAS backups for files that are considered stable or unlikely to be changed.
Our NAS has a lot of stuff, about 8TB of the digital ephemera produced by work and life in the digital age. A lot of this doesn’t need to be backed up, for now we’re only focusing on:
Thankfully the --filter-from file.txt
option in Rclone makes this a breeze. We start by adding exclusions for files and folders that are just OS-specific junk:
- .DS_Store
- ._.DS_Store
- Thumbs.db
- **/.AppleDouble/*
- .AppleDouble/*
Then we can go through and add the folders we want backed up, excluding subdirectories that don’t match the criteria we set above and then finally we say “exclude everything else”:
- projects/Assets/**
- projects/in-progress/**
+ projects/**
+ learning/uni/**
+ pictures/**
- *
In this case we’re excluding an “Assets” folder because the original source for the files is still up and the files are large, uncompressed, and binary which would cost more for the little benefit to backing them up. If we find out the original service providing them is shutting down we want to ensure we continue to own the things we “bought” and compressing and backing them up would be more important.
We’re also excluding our “in-progress” projects because the contents are subject to frequent change which would require the contents to be uploaded each run and also end up with many redundant copies as per the previous section on S3 versioning. For this kind of work we use have the projects stored in some form of version control and manage backups in a way that is relevant to that type of work.
We get verbose logging (-v
) so we can check the results during and after, looking for any problems.
We also log the output of each run with --log-file file.txt
, this lets us check for any issues. A future update will probably push the logs to a local logging service and have a trigger on any WARN/ERROR levels to ping me with the deets. It’s also pretty handy to go through and see just how much and how quickly each run has been.
The last two flags are ones recommended by Rclone in it’s guide on S3 tuning. We use --size-only
to reduce the number of HEAD requests made, saving a bunch of time and requests during backup runs. Any changes to the type of files we backup are likely to cause size changes too, which makes this an easy win. Finally we have --fast-list
which we use to reduce the number of requests, while this should save some money it’s mostly about saving time when iterating over all our files. We found this “no change” sync times by about 90%.
Please institute backups at home (and I guess at work). Never underestimate how horrible it feels to lose years of precious memories (or shareholder value) when a hard disk carks it or gets written over because your housemate needed to install the latest Modern Warfare patch.
Even just putting a copy onto an external hard drive is fine, but make sure to check it every 6 months to ensure there’s no corruption. An untested backup is a calamity waiting to be discovered. If you want to do a low-budget offsite backup, find a friend who also needs to be back up their data and just fill and swap your external drives every 6 months.
Also if anyone wants to ask why I don’t have backups in another geographic region it’s because if there’s any incident big enough to wipe my NAS and the AWS datacentres than I have bigger problems to worry about (finding clean water after the apocalypse).
Strategy: 3 copies of the data, 2 on different media, 1 offsite. ↩︎
I take a lot of silly photos throughout my day on Snapchat, sending them back and forth with friends to keep abreast of our going ons, espeically for those of us who live in geographically isolated locales. I usually save a copy of the photo I take so I’ve got my own record of my adventures, hobbies, or interesting things I want to investigate later. During a recent backup of my phone to LBS I noticed that each Snapchat photo was a JPEG about 0.5MB - 2MB in size, which isn’t massive by itself but made me curious about how much space this was taking up in my archives.
I use a Pixel 4a with a 12.2MP rear camera and an 8MP rear camera. That rear camera gives me a 4032x3024 photo and when using the standard Google Android camera app those pictures are between 1MB - 4MB. Meanwhile, the saved photos from snapchat are 4000x1957 or about 7.8MP. Since Snapchat is designed to take photos or videos quickly and the focus is on ephemeral pics that disappear rather than photos to be saved/posted/printed there’s not nearly as much in the way of optimisation as Google provides based on its research (such as HDR+). Thus the photos are often a bit blurry, a bit crunchy, a bit unfocused, etc. I still want a copy of these moments, but it won’t be one I plan to print and frame.
Okay, so I’ve been using Snapchat daily for nearly a decade, I probably have a few photos. Let’s find out by searching my pictures directory:
du -a . | grep -E "Snapchat.*(jpeg|jpg)" | wc -l
This gives me 18,702 photos. Alright, that’s a lot. But I’ve had older and shittier phones in the past so maybe they’re not taking that much space?
du -a . | grep -E "Snapchat.*(jpeg|jpg)" | awk '{ sum += $1 } END{ print sum}'
Ooof, 16,868,926KB or 16.8GB. Out of the 78GB of photos I have that’s 21% which are stored really inefficiently for the quality of picture. Maybe we can compress them further with an aim to retain perceptual quality while saving disk space at home and money in online backups?
ffmpeg
is an amazingly powerful command line tool for media manipulation whose full syntax can only be recalled by the most arcane and powerful of computer mages. Thankfully there’s a bunch of one-liners scattered about the internet so I can crib the fragments I need.
Here’s one of those example one-liners to recompress a JPEG:
ffmpeg -i input.jpg -q:v 20 output.jpg
The -q:v 20
let’s me pick a quality to encode the JPEG file at, with the parameter being a number between 1 and 30 on a scale that should approximate best quality through to best compression.
I went through my photos and picked out a representative sample of Snapchat photos over the years, trying to get a gamut of:
I tested a variety of quality parameters against each sample pic, with the aim to find a single “magic” value that provided a balance of quality and compression for all images with the view to run a batch convert of all my Snapchat images. The review of the photos was entirely subjective based on my own perception of the before and after photos when zoomed in and zoomed on both a computer and smartphone. In the end, I settled on 10 as the magic number for the following reasons:
Higher values could give better compression but with with quickly diminshing returns (80% - 90% vs 90% - 95%) and were likely to show a lot more crunchy visual artifacts in regular viewing scenarios.
While I originally planned to do all the conversions on my TrueNAS box inside a jail, there were a few issues with the package management that would require an OS upgrade. That’s not a task for today so I decided to do it on my Windows PC and just take the hit of reading over the network. Another key requirement was to keep the original creation date from the old files (when the photo was taken) and set the “last modify” to the same so that the timestamps on the compressed files would match the original, allowing for easier viewing/sorting. I ended up putting together the following Powershell script:
Get-ChildItem -Path '\\nas.home\pool-data\pictures' -File -Recurse | ForEach-Object {
$newname = [System.IO.Path]::GetFileNameWithoutExtension($_.FullName);
$fullname = $_.FullName
$newdirectory = $_.DirectoryName.Substring(39);
$created = $_.CreationTime;
$extn = [IO.Path]::GetExtension($_.FullName)
if ($newname -like "*Snapchat-*" -And ($extn -eq ".jpg" -Or $extn -eq ".jpeg")) {
if(!(Test-Path -Path ".\compressed\$newdirectory"))
{
New-Item -Name ".\compressed\$newdirectory" -ItemType Directory
Write-Host "New folder created successfully!" -f Green
}
ffmpeg -y -i $_.FullName -q:v 10 ".\compressed\$newdirectory\$newname.jpg"
Get-ChildItem ".\compressed\$newdirectory\$newname.jpg" | % {$_.LastWriteTime = $created}
Get-ChildItem ".\compressed\$newdirectory\$newname.jpg" | % {$_.CreationTime = $created}
}
};
This would recursively go through all the folders for my pictures on LBS, finding only saved Snapchats and capturing the appropriate metadata. It would then compress them and write the output to a file on my local PC and set the metadata on the new versions. I didn’t want to immediately overwrite the images so I could do some more spot checks on quality and size changes and then back up the originals on another folder (that doesn’t have offsite/offline backup).
I let the script run while I ducked out to the corner store and came back to it completed. All my spot checks matched my discoveries about qualities before, the files had the correct metadata, and I was ready to migrate them to the NAS.
The size of the compressed files ended up being 3.2GB, roughly 80% compression over the whole collection and I’ve now got a script I can run whenever I back up my phone.
Don’t worry, I always maintain backups in a 3-2-1 or 3-2-1-1-0 strategy which I’ll go over in a future post. ↩︎
A common scenario you’ll probably discover in frontend JavaScript or backend Node.js is the need to run a bunch of asynchronous functions. A typical form of this might be getting the results from a few different APIs or downloading a bunch of files from a remote server. Below you’ll find two potential solutions to this common problem:
const asyncFunction = (value) => {
return new Promise((resolve) => {
setTimeout(() => resolve(value), 1000);
});
};
// An array of elements 0 to 8.
const arr = Array.from(Array(10).keys());
async function testWithForOf() {
let elements = []
const time0 = Date.now()
for (const value of arr) {
const test = await asyncFunction(value);
elements.push(test);
}
console.log(`for...of took: ${(Date.now() - time0) / 1000} seconds`);
}
async function testWithMap() {
const time1 = Date.now()
let elements = await Promise.all(arr.map(asyncFunction));
console.log(`Promise.map() took: ${(Date.now() - time1) / 1000} seconds`);
}
testWithForOf();
testWithMap();
In this test setup, asyncFunction(value)
represents any asynchronous function you need to run, taking approximately one second to return a response. If you run this code, you should discover that testWithMap()
returns in just over a second while testWithForOf()
takes over ten seconds. In both cases we are await
-ing the results - expecting all the asynchronous operations to finish before we carry on - but the advantage to testWithMap()
is that we’re using the JavaScript event loop to run the asynchronous operations concurrently.
It’s important to note that this example is very simplified, you might not see something like exactly like this in your code. But there are a variety of scenarios where you’re making a bunch of asynchronous calls which may be at least partially independent of each other.
Maybe you need to use the results from each API call to build a final result where the order of the finished calls is still important, but the calls themselves are independent of each other. Below I’ve put together an example where we complete all our asynchronous calls and then process all the results in order.
const asyncFunction = (value) => {
return new Promise((resolve) => {
setTimeout(() => resolve(value), 1000);
});
};
// An array of elements 0 to 8.
const arr = Array.from(Array(10).keys());
const elements = await Promise.all(arr.map(asyncFunction));
const result = elements.reduce((acc, e) => {
return acc + e;
}, 0);
Another common setup is when you’ve got a few separate functions you need to run that are all asynchronous and you want to use the results of them. Something like the code below will execute all the functions asynchronously and then await all the results to come back in an array indexed the same order as the functions.
const asyncFunction1 = () => {
return new Promise((resolve) => {
setTimeout(() => resolve('a cool result'), 1000);
});
};
const asyncFunction2 = () => {
return new Promise((resolve) => {
setTimeout(() => resolve('another cool result'), 1100);
});
};
const arr = [asyncFunction1, asyncFunction2];
const results = await Promise.all(arr.map(e => e()));
If you’re doing any kind of billed by the millisecond ‘serverless’ functions this kind of setup may help shave some money off your bill by reducing the amount of time you spend waiting for operations to finish. One potential issue if you’re doing more than a hundred iterations is your vendor might have limits on the number of simultaneous outogoing connection you have. In this case, you can find a middle ground by “chunking” your array into sequential batches that the platform will allow and then running those chunks asynchronously.
const arr = Array.from(Array(10).keys());
const results = [];
const chunkSize = 2;
for (let i = 0; i < transformed.length; i += chunkSize) {
console.log(`Chunk: ${i}`);
const chunk = arr.slice(i, i + chunkSize);
results.concat(...(await Promise.all(chunk.map(asyncFunction))));
}
The 80/20 rule in software development says that the final 20% of a project will take 80% of the total time to build.
Algorithmic tools are designed to make life easier. The new raft of AI systems is a great example of this; tools for translation, image creation, text generation, and more that allow laypeople to do a whole bunch of stuff they never could before with ease and simplicity. But all of these systems share a similar problem: they’re only about 80% of the way there. For the everyday use case that’s all you need! But when these systems shift from helping end users to replacing human professionals the importance of that 20% becomes starkly apparent. This problem is twofold:
Machine translation systems are great for when you’re travelling abroad and trying to understand a menu or read place names. The ones that take a photo, run it through OCR and replace the foreign text with your preferred language are so handy, even though we’ve all seen they can get a bit rough around the edges. In those cases, you just adjust the camera and try again or work out the intended meaning from context clues. You use human intervention to ensure a good result and make the final decision, rather than explicitly trust the machine output.
An IBM slide from 1979.
But if you’re translating asylum claims with automated tools and using that as your basis for a confirmation or denial of someone’s hope to survive you need that missing 20% functionality to understand the deeper context. This is the kind of work that professional translators excel at, but they’re being supplanted in droves by people who think machine alternatives are Good Enough™️, no matter how many wrongfully denied claims this causes. The twofold problem is apparent here, either:
While Hanlon’s razor says the former is probably correct, we also can’t forget that these systems are orders of magnitude cheaper than paying for teams of specialised, skilled staff who understand things that automated tools don’t like minutiae of language, cultural context, translation vs localisation, etc.
Source: @mikilanguages on Twitter (dead account).
I’ve seen friends of mine that are graphic designers or other UI/UX professionals be told they’re skills are becoming useless because of DALL-E, Midjourney, and other generative image systems. The people who praise these tools understand it from their use case of generating pretty pictures. But a graphic designer doesn’t make pictures. They understand the context and branding of the project, they ensure designs fit with the theme and constraints, they build in accessibility in contrast ratios and font choices, they make sure their output can be converted from screen to print, can be scaled and vectorised, and they ensure a consistent visual language across all parts of the project. This is the skill difference between the amateur/hobbyist and the professional, the huge depth and breadth of work that goes into a profession that outsiders don’t see. These are things that AI systems don’t understand.
I’ve seen discussions about AI 3D modelling tools (from text prompt or images) as the next great revolution in gaming but making Maya/Blender models is not what a 3D modeller in game design does. For sure it is a part of their job, just like a graphic designer makes pictures, but it’s only one part. The 3D modeller need to ensure the creation fits in the thematic style of the existing assets, but they also need to care about efficiency. An AI generator that creates a 40,000 polygon couch is a cool novelty, but who will do the retopology to bring it down to an acceptable poly count for inclusion in a game? We have existing simplification algorithms in 3D design and vector art, but they are very limited. They’re a tool that assists experts but the hard work is still there, often half of which can even just be fixing the mistakes of the simplification algorithm. That’s not even getting into efficient UV wrapping, texture packing, or normal map generation, things these tools can’t do not for lack of training data but for lack of R&D into making a tool that specialises in that area.
I doubt any of the aforementioned professions (along with voice actors, copywriters, etc) will completely disappear. But these technologies are used as an excuse to devalue skilled work, reduce their wages, cut their contracts. I have already had friends in copywriting and design who have lost long-standing contracts and retainers thanks to ChatGPT/DALL-E only to be offered piecemeal work at reduced rates to just “tidy up” the output from those tools; a task which often takes more time and effort than if they had been starting from scratch.
The solution often touted is that “the tools will get better with more training data/computational power!” But this shows a naive understanding of how the technology works. I’ve discussed previously why neither more training data or computational power will solve the accuracy issues of general purpose AI tools, and my comments still stand. But when we talk about implementing these general purpose tools in specialist settings, we come up against the 80/20 barrier. Solving the last 20% of a complex problem would end up costing 80% of the total investment in development and no company has any reason to justify that expense.
Because the average Joe thinks the existing systems are sufficient, they’ve already purchased subscriptions or licenses and are happy with the results. The companies making these tools have already got that market with only 20% of the development cost of an actual system so why would they spend the rest of the 80% if there’s no demand? And because the use cases which cause problems are edge cases, the potential market share gains from trying to perfect it wouldn’t provide a return on investment.
Systems like ChatGPT cost over $700,000 a day on inferencing (producing text) alone, not including the millions spent on training. But that’s still a far cheaper option than trying to develop effective, vetted, properly labelled training data sets and researching how to solve edge cases. Why acquire and license and categorise and tag imagery/music/code/text when you can just crawl the web and use everyone’s data without their permission (while violating license agreements) and just hope your tool doesn’t end up creating accidental CSAM? Why research how to solve problems with racial bias when DALL-E can shadow modify your prompt to include ethnicities where it thinks is appropriate? Why hire linguistic experts and translators to provide supervised feedback for training when you can just use underpaid “ghost labour” who don’t have the skills or support to effectively refine a training dataset?
Update 22/10/2023: A day after I posted this 404 Media dropped the news that the LAION-5B training dataset used for imagery GenAI contains CSAM.
If there’s one benefit to rising interest rates it’s that tech investment may get more cautious. Throwing large amounts of cash after startups that are unlikely to ever turn a product profit is a lot harder to justify as the zero interest rate era dies out so we might see less half-baked tools that cost as much in software/hardware as they do in wages. The major competing platforms will probably feel the pinch and raise their prices which could make them less attractive compared to trained professionals who can do the whole job instead of just a bit of it. And given how expensive so many of these systems are to develop, train, and maintain, a few might eventually shutter as the hype surrounding them dies down.
It’d also be nice to see more regulation as the dangers of such systems become apparent. There’s been some success in limiting the use of facial recognition systems without oversight and if the trend continues we may see regulation looking into generative AI too.
]]>Without this they’re about as useful as tits on a bull.
A friend asking for “good industrial tunes for while I’m filing TPS reports” is an easy suggestion to make if you both go to the same gigs. You have shared history in the relevant topic, you know the context (genre, purpose, etc), and you’re probably going to suggest stuff that you’ve at least heard.
This is a simple example of a useful recommendation, but the problem is the world is full of our own terrible recommendations that we don’t always realise; so let’s look at what violations of the three principles look like.
“What phone should I get?”
This is the best question to get curs to come out of the woodwork and show their whole ass. Immediately saying iPhone, Samsung, or Google are terrible without more knowledge of the situation. What do they need the phone for? My retired mother-in-law mostly uses a phone for calls, texts, and occasional photos. An entry-level smartphone from AusPost absolutely nails her needs and is cheap as chips; anyone who suggest she goes and drops nearly $2k on the latest Samsung Galaxy is a fool and a knave. Even an older model refurbished iPhone is unnecessary because she has no other Apple gear. The correct thing to do when faced with this question is pry into what they need it for and what they’ve used in the past and tailor your response once you have more information.
A good principle in life is not to answer a question if you don’t have enough information to do so accurately.
In areas where there is so much brand-fawning1, people are terrible at providing useful recommendations. Whether it’s phones, computer operating systems, or cars, people will happily give advice that don’t satisfy any of the criteria I opened with.
Why would you recommend a Google Pixel to someone who has a Macbook Air, Watch, and a big-ass iCloud account? This person would be better with some type of iPhone unless you have more information that says otherwise.
In a similar vein, telling someone who thinks a package manager is another name for an Amazon warehouse to swap to Linux makes you a bad person. Your recommendation should be tailored to your audience. NixOS may be the greatest thing since sliced bread and your repo of dotfiles may have more GitHub stars than some galaxies but you are talking to someone without the same time and resources to develop the skills needed to look after Linux. FOSS operating systems are amazing because if anything goes wrong you can fix it yourself, but they are also terrible because if anything goes wrong you have to fix it yourself. It should be a legal requirement that if you want to convince a layperson to switch to a *nix derivative you must be at their beck and call to solve any problem they have with it in perpetuity.
If requirements and context are provided and you decide to ignore it, I will nailgun your ears to the minute hand of a clock tower.
I asked friends for suggestions of a good beginner automatic watch because I didn’t want fitness tracking, notifications, or dealing with recharging and was told to get an Apple Watch. Why someone would do this I don’t know, my leading theory is something to do with microplastics in the food supply.
Please don’t come up to me and, unprompted, tell me to use Schwarzkopf got2b Glued if you’ve never had a mohawk. I am very happy if you say “a YouTuber said this was the best hairspray for mohawks, is this true?” Because I can happily say it’s effective but overpriced compared to better options and with a strong scent that can be overwhelming. Now you are no longer providing a recommendation for something you don’t understand, but are learning something new.
In this vein, recommending things you don’t have experience with and have only heard about second- or third-hand is bad and potentially dangerous. You do not have enough unbiased knowledge to know that your suggestion meets the needs and are parroting something the recomendee would have found with a quick online search (something they have almost certainly already done).
Going back to the phone example, I always see the most ill-informed opinions from people who have either never experienced alternatives or have had extremely irrelevant experience which has confirmed their existing biases.
I will have no end of people who have only used iPhones tell me that they have the best cameras. For clearer low-light photography and better image pre-processing, the Google Pixel has won out for every generation where I’ve needed a new phone.2 That said, there are types of photography that the iPhone beats the competition in, so I won’t recomemnd a Pixel to someone embedded in the Apple ecosystem who mostly takes nice landscapes on their morning hikes3.
The flipside to not having any experience with alternatives is having irrelevant experience. I am guilty of the following myself:
“The iMac G3 I used in high school sucked compared to my new $3000 Windows gaming computer.”
I made an irrelevant comparison! The battered iMac G3 was purchased at the cheapest possible price and was already dated by the time I used it and I was comparing it to something I went hog wild specs for to get the highest frame rate in AAA’s latest release: Realistic Brown Texture And Lens Flare Shoot’em’up 7. You see the same with phone recommendations; borrowing your Mum’s refurbished Samsung Galaxy J5 and hating it compared to your iPhone 15 Pro Max Ultra doesn’t mean that iPhones are better than Samsungs. It just means you don’t have relevant experience with comparative devices.
We can’t expect to have used every alternative option for every potential purpose, but we can provide our own context and experiences, tell people where we lack information, and warn people of what we see as pitfalls with our own suggestion.
Your recommendation is a lot more valuable and trustworthy if you can look at it critically and show where it sucks.
A lot of friends are migrating away from Twitter because it’s a fucking trashfire and looking for recommendations. I prefer Mastodon because I use it for a lot of nerdy shit and have a decent community there. It’s not nearly as good for weird shitposting as Bluesky in my experience, and I haven’t used cohost before. I try and ask what people like from Twitter and then tailor my suggestion and also let them know where my knowledge gaps are on the subject.
You too should provide your contexts and caveats on recommendations regarding knowledge gaps and areas of experience or lack thereof. Tell the recommendee that they should get some other opinions on the areas where you’re fuzzy. Don’t assume they haven’t done the bare amount of personal research first.
Give useful recommendations.
Give considered recommendations.
Social media makes this harder because it incentivises short-form responses and aggressive side-taking, but fight back against this.