I work with and on
SmokeDetector (SD). Over the last several days (starting last Monday, 2020-05-111), we've noticed times when SmokeDetector has stopped scanning posts on Stack Exchange. After investigating why this was happening, I found there are now times when SmokeDetector does connect to the Stack Exchange
WebSocket, but doesn't get any notifications of activity anywhere on the SE network.2
What happens: no errors, but no data for about 9.67% of connection attempts
The WebSocket is opened, a subscription is sent, but the only data that is sent from SE on the WebSocket is the heartbeat.3 No data is sent other than the heartbeat, no mater how long the socket is left open, even hours. No errors are raised on the socket. There's just no data.
In testing, I've been able to duplicate the problem in both Python and JavaScript (multiple browsers and IP addresses). I've run 331 trials of opening the WebSocket, sending the
155-questions-active
channel specifier and then waiting for data. Out of those trials, 32 times the WebSocket connected, but failed to send data, other than the heartbeat.4 That works out to failing for 9.67% of connection attempts.
Does this affect other things than SmokeDetector? Yes, it affects SE in general.
Stack Exchange uses WebSockets for push notifications of status change on a wide variety of data across the network. This problem is likely affecting a significant percentage of times where an endpoint connects to the server.
The problem is probably affecting roughly 10% of page views, but it's not something that's going to be readily noticed as a specific, consistent problem, because it affects live page updates after the page is loaded (much less noticeable), and will usually resolve itself when the page is reloaded.
On each page load, the JavaScript on the page connects to the WebSocket in order to keep the page "alive" (i.e. get notifications of actions done elsewhere). On the random page loads where the problem will occur, it will just feel like the page is a bit more static than normal. The page will feel "Dead". Given that it's something that has a 90% chance of being resolved when the page is reloaded, it's likely that, when it's noticed, it's just being written off by people as a glitch.
Over the last few days, I've definitely noticed times when there have been missing notifications within a SE page for events that happen on the network (e.g. not informed of post score changes, a post being deleted or edited, inbox notifications, reputation changes, a substantial difference between updates on the same page in two tabs, etc.). The problem has been noticed
by at least one other person. It's likely been noticed by many other people, but if you're not aware that there's a systemic problem, it's easy to just wave it off as some intermittent issue, as it's 90% likely to go away when the page is reloaded.
Mitigation
We've written code to mitigate the issue for SmokeDetector (detect that we're not getting data and reconnect). For normal viewing of Stack Exchange pages, just reload the page.
Duplication
You can try connecting to the SE WebSocket for yourself using the Stack Snippet below:
Code:
(() => {
'use strict';
let webSocket = null;
let noAdd = true;
let startTime = 0;
const passEl = document.getElementById('pass');
const failEl = document.getElementById('fail');
const failPercentEl = document.getElementById('failPercent');
const startTimeEl = document.getElementById('startTime');
const elapsedTimeEl = document.getElementById('elapsedTime');
const diffLabel = {
days: 'd',
hours: 'h',
minutes: 'm',
seconds: 's',
};
const diffOrder = [
'days',
'hours',
'minutes',
'seconds',
];
function incrementAndShowPercent(addPass, addFail) {
if (noAdd) {
return;
}
const pass = +passEl.textContent + addPass;
passEl.textContent = pass;
const fail = +failEl.textContent + addFail;
failEl.textContent = fail;
const failPercent = Math.round((10000 * fail) / (pass + fail))/100
failPercentEl.textContent = failPercent;
if (fail) {
failEl.classList.add('isFailing');
failPercentEl.classList.add('isFailing');
}
/* The elapsed time code was copied, then significantly modified, from
* https://codereview.stackexchange.com/a/160240/53535 by Przemek
* https://codereview.stackexchange.com/users/97934/przemek */
const elapsed = (Date.now() - startTime) / 1000;
const diff = {
days: Math.floor(elapsed / 86400),
hours: Math.floor((elapsed / 3600) % 24),
minutes: Math.floor((elapsed / 60) % 60),
seconds: Math.floor(elapsed % 60),
};
const elapsedText = diffOrder.reduce((sum, key) => {
if (sum || diff[key] || key === 'seconds') {
sum += ` ${diff[key]}${diffLabel[key]}`;
}
return sum;
}, '').trim();
elapsedTimeEl.textContent = elapsedText;
}
function registerWebSocket() { // eslint-disable-line no-unused-vars
if (webSocket) {
if (typeof webSocket.removeEventListener === 'function') {
//There can be additional events that mess up the count.
webSocket.removeEventListener('message', socketOnMessage);
}
if (typeof webSocket.close === 'function') {
// Close any existing WebSocket.
webSocket.close();
}
}
webSocket = new WebSocket('wss://qa.sockets.stackexchange.com');
webSocket.addEventListener('message', socketOnMessage);
webSocket.addEventListener('open', () => {
//console.error('WS: Open:');
webSocket.send('155-questions-active');
console.log('vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv');
console.log('WS: Open & Subscribed to active questions');
noAdd = false;
});
webSocket.addEventListener('close', event => {
//console.log('WS: Closed: event.code', event.code, ':: event:', event);
console.log('WS: Closed');
});
webSocket.addEventListener('error', event => {
console.error('WS: Error:');
console.log('WS: Error: event', event);
});
}
function socketOnMessage(event) {
const messageObject = JSON.parse(event.data);
const data = messageObject.data;
let dataObject = data;
try {
dataObject = JSON.parse(data);
} catch(error) {
// Ignore any errors
}
messageObject.data = dataObject;
console.log('WS: messageObject:', messageObject);
if (messageObject.action === 'hb') {
//At least for the 155-questions-active socket, the hb is every 5 minutes.
incrementAndShowPercent(0, 1);
noAdd = true;
console.error('WS: heartbeat Received:');
//console.error('WS: Responding to heartbeat:');
//webSocket.send(data);
registerWebSocket();
} else {
incrementAndShowPercent(1, 0);
noAdd = true;
registerWebSocket();
}
}
document.getElementById('start').addEventListener('click', () => {
passEl.textContent = 0;
failEl.textContent = 0;
failEl.classList.remove('isFailing');
failPercentEl.textContent = 0;
failPercentEl.classList.remove('isFailing');
startTime = Date.now();
startTimeEl.textContent = (new Date(startTime)).toISOString();
registerWebSocket();
});
document.getElementById('stop').addEventListener('click', () => {
webSocket.close();
});
})();
Code:
span:not([id]) {
margin-left: 10px;
}
span.extraSpace {
margin-left: 20px;
}
#fail.isFailing,
#failPercent.isFailing {
color: red;
font-weight: bold;
}
#pass,
#failPercent {
color: #00d000;
font-weight: bold;
}
Code:
<button id="start">Start/Restart</button>
<button id="stop">Stop</button>
<span> Pass: <span id="pass">0</span></span>
<span> Fail: <span id="fail">0</span></span>
<span>Fail %: <span id="failPercent">0</span> %</span>
<span class="extraSpace">Started: <span id="startTime"></span></span>
<span class="extraSpace">Elapsed: <spam id="elapsedTime"></span></span>
</br>You may want to open the Web Console (F12), as it's easier to read the output there.</br>
If you don't see anything in the console for 5 minutes after "WS: Open & Subscribed to active questions", that means that attempt to connect (probably) failed.</br>
- See gap in "API quota rolled over with" messages in this chat search. There should be one message per day, every day at nearly the same time, unless we've switched instances. This represents days where SmokeDetector stalled out in processing posts. In addition, the following image shows times when there was no scanning or update to the reported SE API quota (both indicated with red circles). The green circle shows what the rising edge of the API quota graph should normally look like. The fact that it isn't completely vertical in the red circled areas is due to data missing as a result of no scans being performed
To be clear, the dates and times shown above are only show the points in time at which the problem randomly affected SmokeDetector and the length of time from then to when the bot was rebooted. Those dates and times don't reflect when the problem was more likely to occur.
- SD uses the
155-questions-active
channel on the WebSocket available from wss://qa.sockets.stackexchange.com/
, which notifies about (almost) all post activity on all sites.
- The "heartbeat" is a message that SE sends which says "hb". After receiving the heartbeat, the consumer of the WebSocket must send some data to indicate that the WebSocket is still active. The "heartbeat" is administrative and doesn't have any data for the user to consume.
- "Failure" here is defined never receiving any data for a full 5 minutes from the opening of the WebSocket to receiving the first heartbeat on the socket. While it's possible that some of these failures were that there just wasn't any activity on any SE site for 5 minutes, that's … very rare.