[Maintenance] Feb 7 - Mastodon Data Migration

Crashdoom (he/him) · edit-2 1 year ago

[Maintenance] Feb 7 - Mastodon Data Migration

huxley · 1 year ago

Looks like furry.engineer is down?

Stefen Auris · 1 year ago

I’m seeing the same here, something about an Argo tunnel error. @crashdoom@pawb.social

Crashdoom (he/him) · 1 year ago

Aware and investigating!

Stefen Auris · 1 year ago

and that’s why you’re the best <3

liquidparasyte · edit-2 1 year ago

pawb.fun as well. Something got fucky wucky during the migration, it seems.

natebluehooves · 1 year ago

Correct! to give a bit of background while I wait for backups…

last night we had what appears to be an out of memory error. Our cloudflare tunnels broke around the same time that the internet went out (probably related), and we also didn’t have our nodes configured to keep some ram reserved to allow kubernetes to keep running. Additionally, we still only had 1 replica of the data for furry.engineer and pawb.fun that we were still building/downloading from other instances (mostly cached images).

so it was the perfect storm. node 1 runs out of memory and basically crashes, node 2 then tries to pick up the services that are suddenly offline, immediately causing it to run out of memory and crash. There’s only one copy of the data, so nothing offline to check for corruption against. all the storage with 2 replicas was unaffected.

I’ve done an announcement post on the telegram channel to try and keep people appraised, but this restore is going to take another couple hours probably because I’m trying not to repeat my mistakes by setting things to 1 replica or skipping backups for expediency. My impatience pretty directly caused this issue.

Vincent Hayes · 1 year ago

SysAdmin lesson learned, always make the backups :3

natebluehooves · 1 year ago

Lessons do stick around when you have to learn the hard way!

Exec · 1 year ago

node 1 runs out of memory and basically crashes, node 2 then tries to pick up the services that are suddenly offline, immediately causing it to run out of memory and crash

Oof, that’s pretty much a cascading failure

natebluehooves · 1 year ago

Actually yes. Recovery was slow and painful, but I have policies in place to handle these failures now. I’m sure we will find another failure mode as we go forward!