ANT datasets

ANT provides a number of datasets in different formats.

getting datasets

See our separate datasets requests page for steps to take to get access to our data, or contact IMPACT.

See also our list of all datasets, and pointers to their formats.

dataset categories

We have several categories of dataset types:

Address-Space-Allocation Data

Address Space Allocation Data contains Internet addresses that have some properties that characterize Internet topology (for example, addresses that respond with different codes, or that appear to be dynamic, etc.). The IP addresses in this dataset are not typically anonymized because they are determined from measurement traffic and not actual sender-receiver communications and so are not associated with specific individuals. This data can be used to better understand the Internet topology and address usage.

Specific sub-categories of address space allocation data include:

Internet address censuses (format): we ping every allocated IP address one time over about two months and report the results. A census is useful for studying address allocation policies. We have collected 2 censuses every 2-4 months since 2006. (Example: internet_address_survey_it39w-20110322 was the first census after all IPv4 blocks were allocated.)
Internet address surveys (format): we pick 1-2% of the /24 blocks in allocated Internet address space and ping every address in those blocks many times (approximately every 11 minutes) for up to two weeks. An address survey is useful for studying address usage (for example, static vs. dynamic allocation). We have collected 2 surveys every 2-4 months since 2006. (Example: internet_address_survey_reprobing_it43j-20111012 is our first survey from Japan.)
IP Address Hitlists (format): a hitlist provides a representative for each /24 IPv4 address block that is judged mostly likely to reply to pings. Our goal is to provide representatives that are responsive, complete and stable: that is, addresses that respond to pings and traceroutes (with high probability), that cover every allocated IPv4 /24 prefix, and that do not change much over time. Our hitlists are derived from our census data. Hitlists are useful input for Internet topology studies. We have generated hitlists every 2-4 months since 2009. (Example: internet_address_hitlist_it29w-20091102 was our first production hitlist.)
IP Address History (format): The IP address history contains all IPv4 addresses who ever responded to 16 prior IPv4 address censuses, along with a scores for each address about its likelihood to respond to future pings. Address histories are useful for doing customized selection of IP addresses for topology and reliability studies. (Example: internet_address_history_it43w-20110913 was our first address history.)
IP address accumulation (format: IP accumulation datasets report counts of the number of active addresses per /24 IPv4 block over time, estimated from Trinocular probing. (Example: ip_accumulation_a39-20200101.)

Internet Outage Data

Datasets in this category record information about Internet outages–address blocks that become unreachable. Typically outages are inferred from active probing. It may include /24 block-level outages over time, or lists of inferred outages that affect larger parts of the Internet. Outage data can be useful to understand Internet reliability.

For detail of this dataset, please refer to the description page and outages formats.

Specific sub-categories of Internet outage data include:

all outages: Datasets like internet_outage_adaptive_a28all-20170403 represent Trinocular’s evaluation of outages for a quarter.
raw probing output: Datasets like internet_outage_adaptive_a28w-20170403 represent the “raw” results from one observation location.
additional raw probing output: To provide timely address status reconstruction we do additional probing of some blocks. The format is the same as raw probing output. Sample dataset internet_outage_adaptive_additional_a49w-20220701.
suppplemental outage data: Starting with internet_outage_adaptive_a16_supplement-20140407 we provide supplement data of IP mappings AS numbers, DNS names, organization. We used to provide geolocation data but no longer can, but we have a tool that can process Maxmind data to match our outage data.

Partial Reachability in IPv4 and IPv6

We have new data about partial reachability in the Internet, complementing our above study of full outages, with data for Trinocular and RIPE Atlas:

IPv4 partial reachability from Trinocular: Datasets like outages_partial_a28_20170403 are derived from Trinocular.
RIPE Atlas Islands and Peninsulas: Provide daily analysis of all RIPE Atlas VPs showing which are islands (isolated from the public Internet, either in IPv4 or IPv6), and which see peninsulas (partial connectivity in IPv4 or IPv6).

IPv6 Deployment Data

**IPv6 Deplyoment Data: Datasets like tbd are custom crawls.

Other Internet Topology Data

Geolocation data: IP Geolocation datasets provide the inferred geographic location (latitude/longitude) of specific IP addresses, and in some cases the source data that was used to make that inference. Geolocation data can be used to understand where Internet traffic comes from and how it can be influenced by local policies. More information.
AS-to-Organization maps: we relate ASes (from BGP routing) to organizations (companies). (Example: as_to_org_mapping_inferred_truth-20100507)
Internet Topology data: Internet topology data is created by a program that tries to map the Internet. The program is able to determine which routers are capable of talking to other routers. Internet topology data only shows router connectivity within the Internet core and to external enterprise borders; it does not contain any identifiable information or internal enterprise topology information. This dataset can be used for worm outbreak modeling and simulation, worm containment and countermeasures, zombie distribution for DDoS attacks, vulnerability assessments, longitudinal studies of the evolution of Internet topology and address distribution, Internet topology and address map inference. (Example: internet_router_map_planetlab-20030412)

Much of our data is IP Packet Headers: these datasets are comprised of headers of traffic data, containing information such as anonymized source and destination IP addresses and other IP and transport (e.g., TCP, UDP, ICMP, SCTP) header fields. No packet contents are included. Depending on the specific dataset, this category of data can be used for characterization of typical Internet traffic, or of traffic anomalies such as DDoS attacks, port scans, or worm outbreaks.

general anonymized packet headers: collections of packet headers from regional network access links. (Example: lander_sample-20080903)
anonymized attack traffic (packet headers): collections of packet headers containing known denial-of-service attacks taken from a regional network access link. (Example: attack-tcpsyn-20061106)
artificial attacks over real background traffic: we have generated a set of artificial attacks of varying intensity overlayed on real network background traffic. (Example: UniformAttack_Traces_Generated20070821-20041202)
anonamolous events from real B-Root DNS traffic: we have curated a number of events pertaining to B-Root events, including DDoS attacks and flash traffic. (Example: B_Root_Anomaly_message_question-20151130)

We also have traffic flow data: Network traffic can represented as flows between two endpoints. This dataset contains traffic flow information, which includes a variety of attributes such as source and destination IP address, source and destination port, protocol type, and packet and byte counts. This data can be in different formats generated by a range of different collection tools such as NetFlow, IPFIX, and argus, or variants. IP addresses in these files are anonymized on a per-dataset or per-time interval basis. These datasets are useful for research such as network economics and accounting, network planning, analysis, security, denial of service attacks, network monitoring, as well as traffic visualization.

generalized flows: we provide some flow data in Argus format. (Example: FRGPContinuousFlowData_sample-20130923)
long-flows (format): we provide network traffic flow data, derived from packet headers at regional network access links, in a format designed to support examination of short and long flows. (Example: long_flows_D1_2_days-20090605)

DNS Data

We have DNS data (Domain Name System), showing DNS-protocol lookups. We have more information about DNS datasets pertaining to specific papers and data about anycast.

Our DNS data comes in three flavors:

Public DNS data: this dataset consists of Domain Name Systems data derived from public sources, such as from pubic DNS servers. It is not associated with users and has no privacy constraints.

An example public DNS dataset is our reverse DNS data rdns_ipv4-20160312.
Anonymized DNS data: this dataset consists of Domain Name Systems data that contains no identifying information for individuals, either because it was anonymized, or because it is aggregated to the level that individual’s queries are obscured, or contains only experimental data from test programs (not individuals).

An example anonymized DNS datasets is our DITL data DITL_B_Root-20160405.
Limited DNS data: this dataset consists of Domain Name Systems data that does not directly identify individuals, but cannot be combined with other data sources.

Service Enumeration Data

Our service enumeration data consists of our efforts to enumeration different Internet services, now including:

Anycast enumeration datasets: Active probing information to DNS anycast services such as root DNS. Typically probes are made from many vantage points with the goal to enumerate all anycast nodes in the service. Anycast enumeration datasets are useful to understand the operational status and geographic reach of anycast services and nodes. For detailed information about the dataset, please refer to dataset description page.
Google front-ends enumeration and mapping: Active DNS queries with EDNS-client-subnet allow enumeration of Google front-ends IP addresses. With all the front-ends IP addresses, we use new technique to geolocate the front-ends and clustering them into serving sites. For detailed information about the dataset, please refer to dataset description page.
Reverse DNS data: We collect and provide a crawl of the IPv4 Reverse DNS domain names.

An example reverse DNS dataset is rdns_ipv4-20160312.
Other DNS data: we have data related to DNS backscatter.

Other Types of Data

other paper-specific datasets: We have several other datasets specific to papers we have published, including p2p traffic detection, TCP SYNs, etc.

Some things to look for:

ad delivery: datasets about algorithmic fairness in ad delivery on social media

Data Formats

Documentation about dataset formats is now here.