Network outage

January 10, 2024 at 8:50 AM UTC

builds.sr.ht chat.sr.ht git.sr.ht hg.sr.ht lists.sr.ht man.sr.ht meta.sr.ht sr.ht todo.sr.ht paste.sr.ht pages.sr.ht

Update 2024-01-17 10:30 UTC: full service has been restored.

Update 2024-01-16 19:45 UTC: builds.sr.ht is now running at full capacity.

Update 2024-01-16 18:00 UTC: builds.sr.ht is now available for full service. We are running it at half of the planned capacity, with the remainder coming online shortly, so you may experience longer queue times while waiting for builds to be processed.

Update 2024-01-16 10:30 UTC: Our priority for today is to bring builds.sr.ht and chat.sr.ht online, as well as address a number of less user-impacting tasks we have on our todo list.

Update 2024-01-15 22:15 UTC: Object storage for git.sr.ht (e.g. releases attached to git tags) is now available for read and write operations.

Update 2024-01-15 15:50 UTC: pages.sr.ht is now available for full service, albeit with degraded performance for publish operations.

Update 2024-01-15 08:20 UTC: hg.sr.ht is now available for full service. We are working on builds availability today.

Update 2024-01-15 07:45 UTC: We got in touch with hg.sr.ht’s community maintainer and put the finishing touches on it. It is now available in read-only mode.

Update 2024-01-14 14:45 UTC: The mail system is coming back online, and with it lists and todo are entering full service. We do not believe any emails were lost, queues should process normally and catch up on emails sent during outage intervals.

For users with questions about billing: the billing system has been shut off for the duration of the outage. Before we turn it on again, we are going to credit all paying users with free service for the duration of the outage period.

Update 2024-01-14 14:00 UTC: We have have restored git.sr.ht and man.sr.ht to full service.

Note that we are still working with our existing transit provider and are experimenting with a new solution for mitigating the DDoS, but we are not certain that this approach is reliable and we are still working on setting up a more permanent transit solution in the background.

Update 2024-01-14 08:50 UTC: We have partially restored service once again, and are working on restoring more services. lists and todo are operational from the web, but we have disabled the mail system for the time being.

Update 2024-01-13 14:17 UTC: We will need to find a new transit provider to mitigate the problem. We are in talks with a provider to address the problem and we will begin the engineering work shortly. Further updates to come as we have them. Thank you again for your patience.

Update 2024-01-13 13:00 UTC: Our new transit provider has notified us that the DDoS has followed us to our new network. They have deployed mitigations and we do not expect an interruption in service. Update: service is impacted, we are investigating further.

Note that our new transit solution utilizes end-to-end encryption such that traffic between you and SourceHut is received and processed directly by our colocated servers and is not handled in plaintext by third-parties.

Update 2024-01-13 12:40 UTC: Mail service to lists and todo has been restored, subject to DNS propagation delays. Emails queued during the outage should resume processing now – we do not believe any emails have been lost.

Update 2024-01-13 11:42 UTC: We have brought pages.sr.ht into service for read-only operations. All custom domains should be working soon, subject to DNS propagation delays, with the exception of custom domains using apex records (i.e. top level domains such as example.org rather than subdomain.example.org). Manual intervention is required for affected users.

We have established a temporary IP address for serving custom domains using apex records. Users can change their apex record to the following IP address to restore read-only pages service:

@   IN  A   141.95.4.185

Note that we may change this IP address in the future. You will be notified by email if later changes are required for your domain.

Update 2024-01-13 11:00 UTC: We have enabled read/write access to most services. hg is still read-only, and todo and lists process requests from the web but the mail system is still a work-in-progress.

Update 2024-01-13 09:26 UTC: Good morning. We have brought hg.sr.ht up in read-only mode and are working towards enabling read/write for available services today. We have also finished importing the diff from backup for git.sr.hg, and the diff for hg should finish soon.

Update 2024-01-12 20:49 UTC: We are now putting the finishing touches on our goals for today. We have 7 primary services up and running in read-only mode. An earlier issue with git clone/fetch was also fixed; this should be working again.

We have planned a lighter workload for this weekend; we need the rest. Our goal is to have hg.sr.ht online in read-only mode. We will try to get read/write service partially restored for these 7 services, plus hg, this weekend. chat.sr.ht, pages.sr.ht, and builds.sr.ht have special considerations for resumption of service which we will be planning for early next week.

Thank you for your patience, and have a good night.

Update 2024-01-12 16:38 UTC: todo and lists are coming online in read-only mode.

Update 2024-01-12 16:13 UTC: We are beginning to bring some services online. meta and git will become available as DNS propegates to your local nameserver.

My name is Drew, I’m the founder of SourceHut and one of three SourceHut staff members working on the outage, alongside my colleagues Simon and Conrad. As you have noticed, SourceHut is down. I offer my deepest apologies for this situation. We have made a name for ourselves for reliability, and this is the most severe and prolonged outage we have ever faced. We spend a lot of time planning to make sure this does not happen, and we failed. We have all hands on deck working the problem to restore service as soon as possible.

In our emergency planning models, we have procedures in place for many kinds of eventualities. What has happened this week is essentially our worst-case scenario: “what if the primary datacenter just disappeared tomorrow?” We ask this question of ourselves seriously, and make serious plans for what we’d do if this were to pass, and we are executing those plans now – though we had hoped that we would never have to.

I humbly ask for your patience and support as we deal with a very difficult situation, and, again, I offer my deepest apologies that this situation has come to pass.

What is happening?

At 06:30 UTC on January 10th, two days prior to the time of writing, a distributed denial of service attack (DDoS) began targetting SourceHut. We still do not know many details – we don’t know who they are or why they are targetting us, but we do know that they are targetting SourceHut specifically.

We deal with ordinary DDoS attacks in the normal course of operations, and we are generally able to mitigate them on our end. However, this is not an ordinary DDoS attack; the attacker posesses considerable resources and is operating at a scale beyond that which we have the means to mitigate ourselves. In response, before we could do much ourselves to understand or mitigate the problem, our upstream network provider null routed SourceHut entirely, rendering both the internet at large, and SourceHut staff, unable to reach our servers.

The primary datacenter, PHL, was affected by this problem. We rent colocation space from our PHL supplier, where we have our own servers installed. We purchase networking through our provider, who allocates us a block out of their AS, and who upstreams with Cogent, which is the upstream that ultimately black holed us. Unfortunately, our colocation provider went through two acquisitions in the past year, and we failed to notice that our account had been forgotten as they migrated between ticketing systems through one of these acquisitions. Thus unable to page them, we were initially forced to wait until their normal office hours began to contact them, 7 hours after the start of the incident.

When we did get them on the phone, our access to support ticketing was restored, they apologised profusely for the mistake, and we were able to work with them on restoring service and addressing the problems we were facing. This led to SourceHut’s availability being partially restored on the evening of January 10th, until the DDoS escalated in the early hours of January 11th, after which point our provider was forced to null route us again.

We have seen some collateral damage as well. You may have noticed that Hacker News was down on January 10th; we believe that was ultimately due to Cogent’s heavy handed approach to mitigating the DDoS targetting SourceHut (sorry, HN, glad you got it sorted). Last night, a non-profit free software forge, Codeberg, also became subject to a DDoS, which is still ongoing and may be caused by the same actors. This caused our status page to go offline – Codeberg has been kind enough to host it for us so that it’s reachable during an outage – we’re not sure if Codeberg was targetted because they hosted our status page or if this is part of a broader attack on free software forge platforms.

What are we doing about it?

We maintain three sites, PHL, FRE, and AMS. PHL is our primary and is offline, FRE is our backup site, and AMS is a research installation we eventually hoped to use to migrate our platform to European hosting. As we initially had no access whatsoever to PHL, we began restoring from backups to AMS to set up a parallel installation of SourceHut from scratch.

We have since received some assistance from our PHL provider in regaining access to our PHL servers out of band, which is speeding up affairs, but we do not expect to get PHL online soon and we are proceeding with the AMS installation for now.

The prognosis on user data loss is good. Our backups are working and regularly tested, the last full backup of git and hg was taken a few hours before the DDoS began, and we have out-of-band access to the live PHL servers where all changes which occured since the most recent backup are safely preserved. The database is replicated in real-time and was only seconds behind production before it went offline.

We have replicated the production database in AMS and started spinning up SourceHut services there: we have meta, todo, lists, paste, and the project hub fully operational against production data in our staging environment here. We are still working on the following services in order of priority:

git.sr.ht
hg.sr.ht
pages.sr.ht
chat.sr.ht
man.sr.ht
builds.sr.ht

These services, particularly git and hg, require large transfers of data across our networks to restore from backups, and will take some time. Chat does not require particularly large amounts of data to be managed, but has special networking concerns that we are addressing as well.

Our goal is to enable read-only access for the community as quickly as possible, then work on full read/write access following that. Object storage (used for git/hg releases, build artifacts, and SourceHut pages) presents a special set of problems; we are working on those separately. Finding suitable compute to run build jobs is another issue which requires special attention, but we have a plan for this as well.

One of our main concerns right now is finding a way of getting back online on a new network without the DDoS immediately following us there, and we have reason to believe that it will. A layer 3 DDoS like the one we are facing is complex and expensive to mitigate. We spoke to CloudFlare and were quoted a number we cannot reasonably achieve within our financial means, but we are investigating other solutions which may be more affordable and have a few avenues for research today, though we cannot disclose too many details without risking alerting the attackers to our plans.

How you can help

What we need the most right now is your patience and understanding. Mitigating this sort of attack is a marathon, not a sprint, and we have to be careful not to overwork our staff, ensure we’re getting enough sleep, and so on – we are working as hard as we can. There are many people hard at work on this problem for you – I’d like to thank Simon and Conrad in particular for their work, as well as the datacenter and network operators upstream of us who are doing their best as well.

You can receive updates on this page, so long as we’re able to keep it online (low priority), as well as on Mastodon, where we are posting updates as well. This is also a good place to share your words of support and encouragement, as well as the #sr.ht IRC channel on Libera Chat. My inbox at sir@cmpwn.com is also working (not without some effort, I’ll add), if you wish to send your support or offer any resources that might help.

Thank you for your patience and support. We are working to make things right with you.

Note: at this point our status page went offline. We prepared a temporary status page; the text above has been imported from this status page.

The migration is proceeding. We are making good progress on the migration. We have a staging environment with production data (with no loss of user data) working with for the following services:

meta.sr.ht
sr.ht (project hub)
lists.sr.ht
todo.sr.ht
paste.sr.ht

We will strive to have these services available to you soon. We are still working on the following services:

git.sr.ht
hg.sr.ht
chat.sr.ht
pages.sr.ht
builds.sr.ht
man.sr.ht

These are currently being restored from a backup taken several hours prior the start of this incident, two days ago, which will require several more hours to complete. Following this, we have a procedure planned which will restore any data changed in the two-day window since this backup was taken.

We are prioritizing the following services, in order:

git repositories
Mercurial repositories
Object storage – git/hg releases, build artifacts, and SourceHut pages
chat.sr.ht logs
builds.sr.ht logs

The object storage may take as much as a few days to restore to full service; however, we may have read-only service working sooner.

We do not anticipate that we will have builds fully operational until we provision more compute to handle user builds, which may take another day or two.

Thank you for your patience. (18:49 UTC — Jan 11)

PHL service unreliable. The network to our PHL installation is unreliable. When available, it is configured in a read-only mode. We are still proceeding with the migration for the time being. (13:43 UTC — Jan 11)

PHL network restored. Cogent removed the black hole a few minutes ago, restoring access to our PHL installation. However, we are partway through our migration and have placed PHL into a read-only configuration for the time being. Further updates to come. (13:19 UTC — Jan 11)

Migration is in progress. We are partway through setting up a new installation in one of our secondary datacenters. We do not have an ETA but work is proceeding apace. (11:32 UTC — Jan 11)

We are preparing a migration. We do not see a near-term resolution being possible with our upstream network provider and we are preparing to restore service from backups in a new installation. It will take a while, but we are all-hands to address the issue. (08:48 UTC — Jan 11)

Cogent strikes again. They broke it again this morning. (07:00 UTC — Jan 11)

Service is partially restored. All services save for the project hub are now operational. Work proceeds to finish restoring all services. (18:23 UTC — Jan 10)

Root cause identified, mitigations underway. We have received another update from our NOC. Their upstream provider black holed our ASN to mitigate a DDoS coming into our network by hijacking our BGP routes to point to their black hole ASN. Untangling this requires some work, but we should be coming partially online as the network is reconfigured shortly. Full restoration of service may require some time still. (16:17 UTC — Jan 10)

Investigation still ongoing. We just received an update from the NOC. They’re working the issue but there is no ETA on the resolution. “The team is working on the problem as our sole priority”. (15:04 UTC — Jan 10)

NOC is investigating the issue. We got in touch with network operations and the issue is under investigation. (14:11 UTC — Jan 10)

We are unable to reach our emergency NOC. We have been attempting to reach the emergency datacenter operations contacts for the past hour without success; there is an issue with their emergency ticketing system. We will have to wait until their non-emergency line is available, which should be at 9 AM EST (ETA 4 hours). (09:56 UTC — Jan 10)

Issues with our upstream network provider is causing intermittent outages. We are having an issue with our upstream network provider which is causing intermittent outages. We are investigating the issue. (08:50 UTC — Jan 10)