Read Later

Often I stumble upon articles but don’t have time to read them at work, or maybe they’re longform and I don’t want to spend that much time looking at text on an emissive screen. To solve this issue I decided to build a solution that would wrap these articles up and deliver them to my Kindle to browse at my leisure.

I was partially inspired by “reader mode”, a common feature in many browsers that lets you strip away anything that isn’t directly related to the article you’re reading. Often it removes menus, navigation bars, cluttered style elements, and if you’re lucky it takes care of ads too. I wasn’t aiming to get it as perfect as that to begin with, but something that would package a webpage — including related inline images and content — for later reading would be great.

Planning

My initial plan was a Python web app, probably Flask or Django that would accept a URL, then extract the text, perform some conversion, and send to Kindle. But as I started sketching out a solution I encountered some problems:

I only wanted to download one page (not scrape a site), but still have all image content.
I needed somewhere to host it.
I needed authentication/authorisation to prevent misuse.
I needed conversion to appropriate Kindle formats.

I remembered that Pandoc did HTML to ebook pretty well (as long as you didn’t care about maintaining styling information), and calibre will take an EPUB format ebook and change it to the Mobipocket format that is readable by Kindle. And honestly if I’m relying on a bunch of software to do this, how much code do I realistically want to be writing, especially to do authentication that isn’t a complete pain?

I had started considering using Docker Compose to handle all the extra applications, which I’d used before in an advanced version of my basic site crawler (repo), as well as the NS3000 file storage project that got shelved after I started working full time again. It had the key ability to wind together a bunch of different applications, with a lot less yak shaving than if I was doing it manually.

It was at this point that I realised I might not need to write any code - at least nothing more than a few lines of shell scripts. I run a GitLab instance and have had enough familiarity to write a book (you can buy it too!) on it, and indeed one of it’s major drawcards is the “kitchen sink” approach that lends it to having a powerful, well-rounded CI/CD system. GitLab has it’s own authentication system, and I’m usually logged into it to check various sites and projects that I have on the hop. So why not wire together a few stages of a CI/CD pipeline to convert a web page to an ebook and then email it to Amazon?

Solution

Using just GitLab CI, a few Docker containers, and some shell commands (mostly sed), we can manually instigate a pipeline and set the SITE_TO_READ_LATER variable to the URL we want to read later.

The final creation is relatively simple:

stages:
  - download_html
  - convert_page_to_epub
  - convert_epub_to_mobi
  - send_email_to_kindle

variables:
  SITE_PROTOCOL: https
  SITE_TO_READ_LATER: https://az.id.au/ops/claude-the-laptop/

downloadHtml:
  image: alpine:latest
  stage: download_html
  only:
    - web
  script:
    - apk add wget
    - mkdir public
    - cd public
    - wget -E -H -k -p --content-on-error --restrict-file-names=nocontrol $SITE_TO_READ_LATER || true
  artifacts:
    paths:
      - public

convertPageToEpub:
  image:
    name: pandoc/core:latest
    entrypoint: ["/bin/sh", "-c"]
  stage: convert_page_to_epub
  only:
    - web
  script:
    - mkdir /epub
    - cd public
    - echo `echo $SITE_TO_READ_LATER | sed 's/^https:\/\///' | sed 's/$/.html/' | sed 's/\/.html$/\/index.html/' | sed 's/\/[^/]*.html$//'`
    - echo `echo $SITE_TO_READ_LATER | sed 's/^https:\/\///' | sed 's/$/.html/' | sed 's/\/.html$/\/index.html/' | sed 's/^.*\///'`
    - cd `echo $SITE_TO_READ_LATER | sed 's/^https:\/\///' | sed 's/$/.html/' | sed 's/\/.html$/\/index.html/' | sed 's/\/[^/]*.html$//'`
    - pandoc -s -r html `echo $SITE_TO_READ_LATER | sed 's/^https:\/\///' | sed 's/$/.html/' | sed 's/\/.html$/\/index.html/' | sed 's/^.*\///'` -o $CI_PROJECT_DIR/readlater.epub
  dependencies:
    - downloadHtml
  artifacts:
    paths:
      - readlater.epub

convertEpubToMobi:
  image: linuxserver/calibre:latest
  stage: convert_epub_to_mobi
  only:
    - web
  script:
    - ebook-convert readlater.epub readlater.mobi
  dependencies:
    - convertPageToEpub
  artifacts:
    paths:
      - readlater.mobi

sendEmailToKindle:
  image: alpine:latest
  stage: send_email_to_kindle
  only:
    - web
  dependencies:
    - convertEpubToMobi
  script:
    - apk add bash coreutils msmtp sed
    - bash -c "sed -i 's/\[MSMTPSERV\]/$(echo \"$MSMTPSERV\")/' mailconfig"
    - bash -c "sed -i 's/\[MSMTPFROM\]/$(echo \"$MSMTPFROM\")/' mailconfig"
    - bash -c "sed -i 's/\[MSMTPUSER\]/$(echo \"$MSMTPUSER\")/' mailconfig"
    - bash -c "sed -i 's/\[MSMTPPASS\]/$(echo \"$MSMTPPASS\")/' mailconfig"
    - cp mailconfig /etc/msmtprc
    - bash -c "sed -i 's/\[MESSAGEIDREPLACE\]/$(date | sha1sum - | awk '{print $1}')11/' readlater.msg"
    - base64 -i readlater.mobi >> readlater.msg
    - echo "--=_[MESSAGEBOUNDARYREPLACE]--" >> readlater.msg
    - bash -c "sed -i 's/\[MESSAGEBOUNDARYREPLACE\]/$(date | sha1sum - | awk '{print $1}')22/' readlater.msg"
    - bash -c "sed -i 's/filename=readlater.mobi/filename=readlater.mobi;\n size=$(ls -l $CI_PROJECT_DIR | grep readlater.mobi | awk '{print $5}')/' readlater.msg"
    - bash -c "sed -i 's/readlater.mobi/$(date | sha1sum - | awk '{print $1}')33.mobi/' readlater.msg"
    - cat readlater.msg | msmtp --account=yourmail $KINDLEEMAIL
  artifacts:
    paths:
      - readlater.msg

Three simple stages:

Download the website and use Pandoc to convert to EPUB.
Convert the EPUB to MOBI using Calibre.
Send the MOBI to my Kindle via the Amazon Document Service.

The only extra files in the repo are a config for msmtp (which is used to send the email), as well as a template plaintext email message.

You’ll notice the bunch of sed commands in the last stage, which is to update the username and password for the email so we’re not storing them in the repo, as well as replacing some tokens to make sure the email boundary and message ID are unique.

We also encode the MOBI file using Base64 to add it to the multipart email message as an attachment.

You can find this repo online at GitLab.com and GitHub.

Afterword

The majority of this project was spent fighting email issues. While it can be an ancient and cantankerous protocol, in this case most of the problem came down to Amazon having hidden extra requirements that they didn’t publish that had to be worked out manually. It took a bunch of hours, about 60+ pipeline runs, and a 9 hour break to get the email part working but I’m glad it finally paid off.

I guess now I have to find some articles I want to send to it!

A Kindle with a website repackaged as an ebook. A monitor in the background shows the same website article.

Updated 2021-09-11: the example .gitlab-ci.yml in the post was updated to match more edge cases like the repo version.

Az

Read Later

Planning

Solution

Afterword