2006-04-09: 19:10 UTC     Kernel update maintenance

Between 18:00 and 18:10 EDT (22:00 and 22:10 UTC) today, the IMAP/POP server with most of our customers mailboxes, m3.mxes.net, will be rebooted to introduce a kernel parameter change. IMAP and POP services will not be available for approximately 5 minutes during this 10 minute maintenance window.

Further investigation shows that in all probability the IMAP server panic on March 10 was due to hitting a compiled in kernel limit and not due to a bug in the fileystem code as was reported after the March 10 panic.

Our apologies in advance for this brief service disruption.

2006-03-23: 08:20 UTC     Server down

Server is back up at 03:55 EST (08:55 UTC).

Web services have been moved to a machine running a new kernel with a fix for a bug that most likely caused the file system corruption on the machine running web services on the 21st. The problem on the 21st was compounded by the RAID system not taking a failing SCSI drive offline and rebuilding on a hot spare drive.

The crash this morning was caused by hardware problem. Replacement hardware will arrive tomorrow.

The IMAP server panic on the 10th was most likely caused by the same kernel bug that caused the file system corruption on the 21st. The kernel on this IMAP server will be updated this weekend or next when system usage is low.

At 03:15 EST (08:15 UTC) web services are not available due to a server crash. IMAP and SMTP services are not affected.

2006-03-21: 15:30 UTC     Raid array issues

20:20 EST Due to a configuration error, the standard HTTP port, port 80, was not permitted through the firewall. The secure HTTPS port, port 443 was allowed and customers using HTTPS were able to access the web services. 20:05 EST Web services restored at 20:02 EST (01:02 UTC). This morning at 10:07 EST (15:07 UTC) a machine developed a disk space problem that was resolved by 10:20 EST (15:20 UTC). There is an ongoing issue with the RAID system that will require shutting down this machine.

The affected machine runs two customer visible services, a server in the SMTP cluster and a web server for the Manager and the production web clients. Taking this machine offline will be transparent to customers except for web services which will be unavaliable for approximately two minutes between 20:00 EST and 20:05 EST (01:00 and 01:05 UTC). Some customers may loose their login session when web services are restored. Our apologies in advance for this disruption.

2006-03-10: 22:50 UTC     IMAP server problem

Update: 18:56 EST (23:56 UTC) The IMAP server holding most of our customers mailboxes paniced resulting in damage to the file system containing the Cyrus metadata. The file system has been repaired but some metadata files have been lost. These files can be reconstructed but some customers will not be able to receive new mail or access No existing mail was lost and new mail is queued on the MX servers.

We have elected not to switch to a replica IMAP sever and we will continue to investigate the cause of the panic.

The affected IMAP server should be back online by 19:10 EST (00:10 UTC). We apologize for this outage.

We are having trouble with one IMAP server. We are working the problem but there is no ETR at this time.

2006-03-10: 15:13 UTC     Changes in forwarding update

Effective today at 12:00 EST (17:00 UTC), all forwarded mail scoring 6.0 and higher is being discarded.

2006-02-11: 14:40 UTC     Changes in forwarding

Today and over the next few weeks we will be introducing some changes that will improve the deliverability of forwarded mail. There are two problems that these changes will solve or at least minimize.

We are seeing more systems that are rejecting mail that violates the sender's published SPF policy. Mail from a domain with published a SPF policy that we forward on behalf of our customer may violate the sender's SPF policy because the sender has no way to know that the recipient is forwarding mail to another system. The recipient that is forwarding the mail has no way to know in advance that the sender does not wish mail from the their domain to be forwarded.

In most all cases the rejected mail is spam with a forged sender address but we have seen a small percentage of legitimate mail rejected. It is this small percentage that we would like to see accepted by the system that the mail is forwarded to.

The SPF forwarding problem is well known and the debate about wether SPF breaks forwarding or forwarding is inherently broken still rages on.

To ensure deliveriablity of mail that is forwarded, SRS will be enabled for all forwarded mail. Currently SRS is enabled for mail forwarded to AOL and a few dozen other systems.

Another problem is that more and more systems are rejecting spam that is forwarded to them. When a system that we are forwarding mail to rejects a message, a Delivery Status Notification must be delivered to the sender. If the rejected message is truly spam, and it almost always is, the sender is most likely forged and the DSN is sent to the innocent forgery victim adding to the bad and growing backscatter problem.

Another issue with forwarding spam is that many systems pay attention to what IP addresses are sending them spam in the same way that this system does. The recipient system does not know or care that the spam is forwarded on behalf of a customer. To the receiving system the message is just spam. Too much spam from an IP address and the IP address is blocked. Blame the spammers, not the mail system operators.

To reduce the backscatter that this system generates and to reduce the possibility of our forwarding servers being blocked and preventing mail from being forwarded, messages with a spam score of 6.0 and higher will no longer be forwarded. This change has been in affect for mail to AOL and its subsidiaries since May 2005 with no customer complaints.

Starting today messages scoring 6.0 and over will not be forwarded to adelphia.net, yausi.com, yahoo.com, hotmail.com and cox.net. SRS envelope sender re-writing is also being applied to mail forwarded to those domains.

Along with this change is a real-time report showing forwarded messages that have been discarded.

To: somebody@aol.com
recipient@domain-hosted-here
1 18.2 chancei@ambienttech.com
1 28.5 charlenehipes@asamoacpa.com
1 20.3 daiangelag@0451.com
1 15.5 fpeters@globalfriends.us
1 12.5 gar@directwest.com
1 18.2 gunnersoe@rednet.com
1 12.5 kph@gfj.bdstnetd.com
1 23.9 meeny.beatrix9hm@gmail.com
Select Account -> Real-Time Reports -> Forwarded Discards to view the reports for your account.

2005-11-11: 18:00 UTC     Fastmail box poll errors

The errors customers are seeing for box polls configured to fetch mail from Fastmail are due to a server failure at Fastmail and not due to problems with this system.

2005-10-25: 10:30 UTC     Office telephone outage

Update Oct 31, 2005: Telephone service was restored this past weekend.

We had dialtone and working lines yesterday but this morning nothing. Problem is most likely due to batteries failing in one of the BellSouth fiber huts. We have no estimate to when telephone service will be restored.


2005-10-24: 17:30 UTC     Desktop client access

Late this morning a problem with the IMAP and POP multiplexors caused some connections to timeout. Webmail and SMTP services were not affected. The exact time and duration of this problem is lost in the blur of the day resulting from dealing with hurricane Wilma last night and this morning. The duration was most likely less than 90 minutes.

2005-10-15: 14:00 UTC     Real time reports

Real-time reporting of messages rejected by the MX servers, messages infected with viruses that were discarded, and messages delivered to mailboxes was added to the Account Manager last week in an expanded beta test. Real-time reporting is available for hosted email accounts and for will be for MX relay customers later this month.

The last known bug was fixed today and with that fix the delivery reports now include messages accepted for forwarding to external accounts as well as messages that will be delivered to mailboxes. Fixing this bug required deleting all delivery data prior to 1400 EDT, 1800 UTC, today.

Because of the huge number of unknown address rejections due to backscatter from addresses forged by spammers and viruses, the reject reports include only a count of those messages.


2005-09-22: 19:15 UTC     Another Power Failure

Update: All services were back online by 15:55 EDT (19:55 UTC). The outage was caused when the building electricians accidently tripped the main panel breaker while removing the panel cover for further investigation into why the breaker failed yesterday.

The datacenter has contracted with an engineering firm to determine where the weak points are in their power distribution systems and what they can do to improve the power system.

Staring tomorrow evening we will begin moving servers and routers that provide load sharing and redundant routing into another cage. This work will take several days to complete and service outages are expected to be very brief. This work will be done between 20:00 and 00:00 EDT (00:00 - 04:00) UTC tomorrow evening and between 16:00 and 20:00 (20:00 - 00:00 UTC) Saturday and Sunday with the same schedule the following weekend.

Update on the power failure during hurricane Katrina. Lightning struck one of the generators supplying power to the first floor taking out the fuel pump that feeds fuel from the large ground level storage tanks to the generator day tank. The generator continued to operate untill its day tank was empty. To help prevent failures of this nature in the future, additional monitoring and pumps are being installed.

15:10 EDT (19:10 UTC) Power has been restored and services will be back online shortly.


2005-09-21: 15:50 UTC     Power Failure

Update: Power was restored around 12:10 EST (1610 UTC) and IMAP access via the webmail clients was restored shortly after that. Full access to the IMAP and POP servers followed and finally SMTP access was restored. We are still working to bring up the backup raid systems.

The cause of the failure was an overheated 208V 3-phase circuit breaker in the distribution panel that supplies our racks. The breaker failure was complete and it had to be replaced.

Our second cage in the same datacenter is powered from a different 208V distribution panel. The redundant servers, routing, and switching, will be moved into that cage starting later this week.

We have lost all power to one of our cages at the datacenter. We will be moving enough servers to our second cage to bring services back online.