tl;dr summary furry.engineer and pawb.fun will be down for several hours this evening (5 PM Mountain Time onward) as we migrate data from the cloud to local storage. We’ll post updates via our announcements channel at https://t.me/pawbsocial.


In order to reduce costs and expand our storage pool, we’ll be migrating data from our existing Cloudflare R2 buckets to local replicated network storage, and from Proxmox-based LXC containers to Kubernetes pods.

Currently, according to Mastodon, we’re using about 1 TB of media storage, but according to Cloudflare, we’re using near 6 TB. This appears to be due to Cloudflare R2’s implementation of the underlying S3 protocol that Mastodon uses for cloud-based media storage, which is preventing Mastodon from properly cleaning up no longer used files.

As part of the move, we’ll be creating / using new Docker-based images for Glitch-SOC (the fork of Mastodon we use) and hooking that up to a dedicated set of database nodes and replicated storage through Longhorn. This should allow us to seamlessly move the instances from one Kubernetes node to another for performing routine hardware and system maintenance without taking the instances offline.

We’re planning to roll out the changes in several stages:

  1. Taking furry.engineer and pawb.fun down for maintenance to prevent additional media being created.

  2. Initiating a transfer from R2 to the new local replicated network storage for locally generated user content first, then remote media. (This will happen in parallel to the other stages, so some media may be unavailable until the transfer fully completes).

  3. Exporting and re-importing the databases from their LXC containers to the new dedicated database servers.

  4. Creating and deploying the new Kubernetes pods, and bringing one of the two instances back online, pointing at the new database and storage.

  5. Monitoring for any media-related issues, and bringing the second instance back online.

We’ll be beginning the maintenance window at 5 PM Mountain Time (4 PM Pacific Time) and have no ETA at this time. We’ll provide updates through our existing Telegram announcements channel at https://t.me/pawbsocial.

During this maintenance window, furry.engineer and pawb.fun will be unavailable until the maintenance concluded. Our Lemmy instance at pawb.social will remain online, though you may experience longer than normal load times due to high network traffic.


Finally and most importantly, I want to thank those who have been donating through our Ko-Fi page as this has allowed us to build up a small war chest to make this transfer possible through both new hardware and the inevitable data export fees we’ll face bringing content down from Cloudflare R2.

Going forward, we’re looking into providing additional fediverse services (such as Pixelfed) and extending our data retention length to allow us to maintain more content for longer, but none of this would be possible if it weren’t for your generous donations.

  • Exec
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    node 1 runs out of memory and basically crashes, node 2 then tries to pick up the services that are suddenly offline, immediately causing it to run out of memory and crash

    Oof, that’s pretty much a cascading failure

    • natebluehoovesMA
      link
      fedilink
      English
      arrow-up
      2
      ·
      10 months ago

      Actually yes. Recovery was slow and painful, but I have policies in place to handle these failures now. I’m sure we will find another failure mode as we go forward!