Description of Internet Outage Datasets

This page describes the format of our Internet outage datasets.

We have three primary formats:

  • raw probing output, which is collected at each observer
  • outages data, an integrated output that merges data from all observers
  • outagedownup format that is a more “cooked” form of outages

We recommend using outagedownup data for most purposes, since it includes post-processing cleans up some known flaws that occur in the raw data.

Sites

As of Oct. 2017, our sites are:

w: ISI-West in Los Angeles;
c: Colorado data from Ft. Collins;
j: Japan data from Keio University (SJF campus) near Tokyo;
e: ISI-East data from near Washington, DC;
g: Greek data from Athens University of Economics in Business;
n: Netherlands data from SurfNet.

Outage Probing Results

Outage probing output is provided for each site (see “Sites” above for details).

Each dataset includs input to the prober in several formats and the output.

Output is in tab-separated text (FSDB format, with the following schema:

block: hex format of /24 IP block, with trailing zeros (A7 omits trailing zeros)
round_no: round number in this batch (will reset each time we restart)
round_start_epoch: when the round began, in seconds since 1970
a_short: the short term estimate of availability
a_oper: the operational estimate of A value (long term and reflecting variance)
status: status of this block: A12 and later: 0 for down, 1 for up, 2 for unknown
belief: our belief the block is down
n_pos: number of positive responses in this round
n_neg: number of negative responses in this round
probe_log: A base-64-encoded list of what specific addresses were probed. (Only in a18 and later).
rtt_us: estimated round-trip time in microseconds. (Only in a20 and later.)

A sample of raw data from dataset internet_outage_adaptive_a30w-20171006, file data/pinger-w4.e1507326545.a30w.2.r0.001.fsdb.bz2:

#fsdb -F t block round_no round_start_epoch a_short a_oper status belief n_pos n_neg probe_log rtt_us
58bae200        0       1507326545      0.8766  0.4383  1       0.01    1       0       CiQ=    219051
bd378a00        0       1507326545      0.3644  0.1822  1       0.01    1       0       CqQ=    204900
342e1a00        0       1507326545      0.2879  0.1439  1       0.01    1       0       CgQ=    158242
83c1c200        0       1507326545      0.1323  0.06614 1       0.01    1       0       CqU=    64556
d02afa00        0       1507326545      0.7865  0.3932  1       0.01    1       0       CuQ=    60168
...

The data shows the schema (the #fsdb line), followed by data for block 0x58bae200, which is 88.186.226.0/24, taken at 1507326545 seconds past the Unix epoch (2017-10-06t21:49:05Z). The block was detected as up (status is 1), and the positive ping replied in 219.051ms. Other lines show other blocks, all probed at this time.

“Outages” Format Raw Outages

“Outages” format data merges all observing sites for one time period (see “Sites” above for details, time periods are typically quarters).

Output is in tab-separated text (FSDB format, with the following schema:

block: block address of the /24 in hex (with trailing zeros)
start: when the status was takes effect (seconds since the Unix epoch)
duration: how long the status is in effect
uncertainty: our confidence in the precision of the start time. In non-raw data uncertainty is sometimes lowered when we merge observations from multiple observers.
precision_improvement: is either unused (‘-‘) or precision improvement of the onset of a state change resulting from merging data from multiple vantage points
status: vantage point that saw the outage (each letter ‘c’,’j’,’w’, ‘g’ is one of the sites from our observers; corresponding capital letter ‘W’, ‘C’, ‘J’, ‘G’ means the vantage point saw no outage; the order is fixed to [wW][cC][jJ][gG])

A sample of outages data from dataset internet_outage_adaptive_a30all-20171006, file `a30all.outages.fsdb.bz2:

#fsdb -F t block start duration uncertainty status
01000400        1507326957      23      660     W
01000400        1507326980      890349  594     WCJGEN
...
01000400        1512522614      2242390 660     WJGEN
01000400        1514765004      839     242     WEN
01000400        1514765843      1082    331     E
01000500        1507326628      23      660     W
01000500        1507326651      1957    591     WCJGEN
01000500        1507328608      692     637     Wcjgen
...

These two segments show outages for two blocks. For the first, block 0x01000400 (1.0.4.0/24), was up (capital letters in status), as detected by site W at time 1507326957 (2017-10-06t21:55:57Z), and seen by all 6 sites in the next line 23 seconds later.

The second block, 0x01000500 (1.0.5.0/24) was detected as up by site W at time 1507326628 (2017-10-06t21:50:28Z), followed by all the other sites 23 seconds later. However, at time 1507328608 (2017-10-06t22:23:28Z) all sites except for W failed to detect it as up.

“Outagedownup” Format Integrated Outages

“Outagesdownup” format data merges all observing sites for one time period (see “Sites” above for details, time periods are typically quarters). It also includes several post-processing step:

  1. insufficient VP detection
  2. merging roles
  3. unmeasurability detection
  4. hole filling of periods with insufficient observers

Output is in tab-separated text (FSDB format, with the following schema:

block: block address of the /24 in hex (with trailing zeros)
start: when the status was takes effect, in seconds since the Unix epoch.
duration: how long the status is in effect, in seconds.
uncertainty: our confidence in the precision of the start time. The true start time is sometime between start and start-uncertainty. The true duration is between duration-NextEventUncertainty and duration+ThisEventUncertainty. In non-raw data uncertainty is sometimes lowered when we merge observations from multiple observers.
downup: up (1), down (0), unmeasurable (-1, typically due to insufficient active observers), or gone dark (-2, typically out for more than 10 days)

Sample data, from dataset internet_outage_adaptive_a30all-20171006, file a30all.outagedownup.fsdb.bz2:

#fsdb -F t block start duration uncertainty downup
01000400        1507326957      7439968 331     1
01000500        1507326628      7439966 660     1
01000600        1507327123      7439309 333     1
01005000        1507326901      3269846 12540   1
01005000        1510596747      32046   7920    0
01005000        1510628793      47225   7920    1
01005000        1510676018      35681   11972   0
...

This data shows that blocks 0x01000400 (1.0.4.0/24), 0x01000500 (1.0.5.0/24), and 0x01000600 (1.0.6.0/24), were up (downup is 1) for the entire observtion period (starting at 1507326957, 2017-10-06t21:55:57Z and continuing for 7439968 seconds, just more than 86 days).

Block 0x01005000 (1.0.80.0/24) was up starting at 1507326901 (2017-10-06t21:55:01Z) for 3269846 seconds (37.8 days), then down for 32046 seconds (8.9 hours), then up for 47225 seconds (13.1 hours), etc.

Versions

We have had several different of our outage data processing pipeline as we learn more. In general, we have two goals in our datasets: to be as accurate as possible to what really happened, and to provide a long-term result for others to use.

These two goals are in conflict, so to resolve that confict we sometimes update our datasets with recomputed results while preserving the old results as different files in the same database.

All datasets now include a “vX” tag that indicates the version.

Here is our summary:

Version input raw Trinocular (icmptrain, per-site data) aXXall.vYY.outages.fsdb.bz2
(FBS+LABR, raw to outages, merge, precision improvement)
aXXall.vYY.outagedownup.fsdb.bz2
(disagreement resolution, hole filling, gone-dark)
v1 target blocks: |E(b)| >= 15 and |A(E(b))| equal to 0.1, from Quan13c a_oper: do not include down events to calculate a_oper
probing order: per-block probe-order order is randomized from full round (FR) to FR
survey edges: no pre-staging of before & after quarter data
hole filling: raw to outages: single unknown states in between 2 rounds of equal status, has its status set to the same status as the other 2
precision improvement: forward precision-improvement, from Quan13c section 4.5
gone-dark: 1 week windows need 0.8 up time, otherwise set to -2, from Alwabel15a
multi-site resolution: any-up
v1b same same same gone-dark: improvements in downup_to_unmeasurable.py code, window increase from 1 to 3 weeks
v2 same same raw to outages: unknowns are not fixed
precision improvement: backward precision improvement
gone-dark: same as in v1 but fixed some bugs
v3 same same same internal only
v4 same same a_oper: a_oper as a new column in outages format (from Quan14c) gone dark: outages longer than 1 week set to -2
a_oper: adds a_oper as a new column in outagedownup format (from Quan14c)
v5 same same survey edges: added from 1 week before/after survey for proper gone-dark filtering, motivated by gone-dark in Alwabel15a
FBS: full block scanning over flaky blocks; outages in sparse blocks sometimes mapped to up, from Baltra19b
LABR: lone block recovery algorithm, single addresses down events mapped to unknown, from Baltra19b
multi-site resolution: majority voting, from Baltra19a
v5b same same FBS: a_short bug fixed
FBS: full round (FR) completion (after a non-UP round): we are willing to count down probes in the first TR that includes a positive response against the FR accumulation
FBS: windowing - require 2FRs to due to round reordering, for data on or before 2019q4
same
v5c target blocks: blocks with |E(b)| >= 3 extra probe: send 16th probe if to old known replier block changes state
probing order: no longer change order each round
FBS: windowing - FBS defaults to 1FR (although when we run on datasets on 2019q4 or earlier, we need to manually override to 2FR) same