ANT provides a number of datasets in different formats.
See our separate datasets requests page for steps to take to get access to our data, or contact IMPACT.
See also our list of all datasets, and pointers to their formats.
We have several categories of dataset types:
Address Space Allocation Data contains Internet addresses that have some properties that characterize Internet topology (for example, addresses that respond with different codes, or that appear to be dynamic, etc.). The IP addresses in this dataset are not typically anonymized because they are determined from measurement traffic and not actual sender-receiver communications and so are not associated with specific individuals. This data can be used to better understand the Internet topology and address usage.
Specific sub-categories of address space allocation data include:
Internet address censuses (format):
we ping every allocated IP address one time over about two months and report the results.
A census is useful for studying address allocation policies.
We have collected 2 censuses every 2-4 months since 2006.
(Example: internet_address_survey_it39w-20110322
was the first census
after all IPv4 blocks were allocated.)
Internet address surveys (format):
we pick 1-2% of the /24 blocks in allocated Internet address space and
ping every address in those blocks many times (approximately every 11
minutes) for up to two weeks.
An address survey is useful for studying address usage (for example, static vs. dynamic allocation).
We have collected 2 surveys every 2-4 months since 2006.
(Example: internet_address_survey_reprobing_it43j-20111012
is our first survey from Japan.)
IP Address Hitlists (format):
a hitlist provides a representative for each /24 IPv4 address block
that is judged mostly likely to reply to pings.
Our goal is to provide representatives that are responsive, complete and stable:
that is, addresses that respond to pings and traceroutes (with high probability),
that cover every allocated IPv4 /24 prefix,
and that do not change much over time.
Our hitlists are derived from our census data.
Hitlists are useful input for Internet topology studies.
We have generated hitlists every 2-4 months since 2009.
(Example: internet_address_hitlist_it29w-20091102
was our first production hitlist.)
IP Address History (format):
The IP address history contains all IPv4 addresses who ever responded to
16 prior IPv4 address censuses, along with a scores for each address
about its likelihood to respond to future pings.
Address histories are useful for doing customized selection of IP
addresses for topology and reliability studies.
(Example: internet_address_history_it43w-20110913
was our first address history.)
IP address accumulation (format:
IP accumulation datasets report counts of the number of active addresses
per /24 IPv4 block over time, estimated from Trinocular probing.
(Example: ip_accumulation_a39-20200101
.)
Datasets in this category record information about Internet outages–address blocks that become unreachable. Typically outages are inferred from active probing. It may include /24 block-level outages over time, or lists of inferred outages that affect larger parts of the Internet. Outage data can be useful to understand Internet reliability.
For detail of this dataset, please refer to the description page and outages formats.
Specific sub-categories of Internet outage data include:
all outages: Datasets like internet_outage_adaptive_a28all-20170403
represent Trinocular’s evaluation of outages for a quarter.
raw probing output: Datasets like internet_outage_adaptive_a28w-20170403
represent the “raw” results from one observation location.
additional raw probing output: To provide timely address status reconstruction we do additional probing of some blocks. The format is the same as raw probing output. Sample dataset internet_outage_adaptive_additional_a49w-20220701
.
suppplemental outage data:
Starting with internet_outage_adaptive_a16_supplement-20140407
we provide supplement data of IP mappings AS numbers, DNS names, organization. We used to provide geolocation data but no longer can, but we have a tool that can process Maxmind data to match our outage data.
Geolocation data: IP Geolocation datasets provide the inferred geographic location (latitude/longitude) of specific IP addresses, and in some cases the source data that was used to make that inference. Geolocation data can be used to understand where Internet traffic comes from and how it can be influenced by local policies. More information.
AS-to-Organization maps:
we relate ASes (from BGP routing) to organizations (companies).
(Example: as_to_org_mapping_inferred_truth-20100507
)
Internet Topology data:
Internet topology data is created by a program that tries to map the
Internet. The program is able to determine which routers are capable of
talking to other routers. Internet topology data only shows router
connectivity within the Internet core and to external enterprise borders;
it does not contain any identifiable information or internal enterprise
topology information. This dataset can be used for worm outbreak modeling
and simulation, worm containment and countermeasures, zombie distribution
for DDoS attacks, vulnerability assessments, longitudinal studies of the
evolution of Internet topology and address distribution, Internet topology
and address map inference.
(Example: internet_router_map_planetlab-20030412
)
Much of our data is IP Packet Headers: these datasets are comprised of headers of traffic data, containing information such as anonymized source and destination IP addresses and other IP and transport (e.g., TCP, UDP, ICMP, SCTP) header fields. No packet contents are included. Depending on the specific dataset, this category of data can be used for characterization of typical Internet traffic, or of traffic anomalies such as DDoS attacks, port scans, or worm outbreaks.
general anonymized packet headers:
collections of packet headers from regional network access links.
(Example: lander_sample-20080903
)
anonymized attack traffic (packet headers):
collections of packet headers containing known denial-of-service attacks
taken from a regional network access link.
(Example: attack-tcpsyn-20061106
)
artificial attacks over real background traffic:
we have generated a set of artificial attacks of varying intensity
overlayed on real network background traffic.
(Example: UniformAttack_Traces_Generated20070821-20041202
)
anonamolous events from real B-Root DNS traffic:
we have curated a number of events pertaining to B-Root events,
including DDoS attacks and flash traffic.
(Example: B_Root_Anomaly_message_question-20151130
)
We also have traffic flow data: Network traffic can represented as flows between two endpoints. This dataset contains traffic flow information, which includes a variety of attributes such as source and destination IP address, source and destination port, protocol type, and packet and byte counts. This data can be in different formats generated by a range of different collection tools such as NetFlow, IPFIX, and argus, or variants. IP addresses in these files are anonymized on a per-dataset or per-time interval basis. These datasets are useful for research such as network economics and accounting, network planning, analysis, security, denial of service attacks, network monitoring, as well as traffic visualization.
generalized flows:
we provide some flow data in
Argus
format.
(Example: FRGPContinuousFlowData_sample-20130923
)
long-flows (format):
we provide network traffic flow data,
derived from packet headers at regional network access links,
in a format designed to support examination of short and long flows.
(Example: long_flows_D1_2_days-20090605
)
We have DNS data (Domain Name System), showing DNS-protocol lookups. We have more information about DNS datasets pertaining to specific papers and data about anycast.
Our DNS data comes in three flavors:
Public DNS data: this dataset consists of Domain Name Systems data derived from public sources, such as from pubic DNS servers. It is not associated with users and has no privacy constraints.
An example public DNS dataset is our reverse DNS data rdns_ipv4-20160312
.
Anonymized DNS data: this dataset consists of Domain Name Systems data that contains no identifying information for individuals, either because it was anonymized, or because it is aggregated to the level that individual’s queries are obscured, or contains only experimental data from test programs (not individuals).
An example anonymized DNS datasets is our DITL data DITL_B_Root-20160405
.
Limited DNS data: this dataset consists of Domain Name Systems data that does not directly identify individuals, but cannot be combined with other data sources.
Our service enumeration data consists of our efforts to enumeration different Internet services, now including:
Anycast enumeration datasets: Active probing information to DNS anycast services such as root DNS. Typically probes are made from many vantage points with the goal to enumerate all anycast nodes in the service. Anycast enumeration datasets are useful to understand the operational status and geographic reach of anycast services and nodes. For detailed information about the dataset, please refer to dataset description page.
Google front-ends enumeration and mapping: Active DNS queries with EDNS-client-subnet allow enumeration of Google front-ends IP addresses. With all the front-ends IP addresses, we use new technique to geolocate the front-ends and clustering them into serving sites. For detailed information about the dataset, please refer to dataset description page.
Reverse DNS data: We collect and provide a crawl of the IPv4 Reverse DNS domain names.
An example reverse DNS dataset is rdns_ipv4-20160312
.
Other DNS data: we have data related to DNS backscatter.
other paper-specific datasets: We have several other datasets specific to papers we have published, including p2p traffic detection, TCP SYNs, etc.
Some things to look for: