LANDER:as to org mapping inferred truth-20110901 From Predict README version: 4045, last modified: 2014-06-6. This file describes the trace dataset "as_to_org_mapping_inferred_truth-20110901" provided by the LANDER project. Contents • 1 LANDER Metadata • 2 Dataset Contents • 3 Data Format • 3.1 Syntax • 3.2 Schema • 3.3 How Organization vs. AS files relate • 4 Clustering Method • 5 Citation • 6 Results Using This Dataset • 7 User Annotations LANDER Metadata ┌───────────────────────────┬────────────────────────────────────────────────────────────────────────────────────┐ │ dataSetName │ as_to_org_mapping_inferred_truth-20110901 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ status │ usc-web-and-predict │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ shortDesc │ Mapping from ASes to organizations │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ longDesc │ This dataset provides a mapping from ASes to organizations, i.e., identifies which │ │ │ ASes belong to which organizations. We determined the mapping by manual inspection │ │ │ of RIR whois information, using AS names and external information (company web │ │ │ pages, wikipedia, etc.) to infer a feasible ground truth. This dataset comprises │ │ │ 109 organizations and their 4019 ASes in total. │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ datasetClass │ Quasi-Restricted │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ commercialAllowed │ true │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ requestReviewRequired │ true │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ productReviewRequired │ false │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ ongoingMeasurement │ false │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ submissionMethod │ Upload │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ collectionStartDate │ 2011-09-01 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ collectionStartTime │ 00:00:00 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ collectionEndDate │ 2011-09-01 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ collectionEndTime │ 00:00:00 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ availabilityStartDate │ 2013-03-04 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ availabilityStartTime │ 18:10:02 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ availabilityEndDate │ 2030-01-01 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ availabilityEndTime │ 00:00:00 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ anonymization │ none │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ archivingAllowed │ false │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ keywords │ category:internet-topology-data, subcategory:as-organizational-data, internet, │ │ │ topology, one-time │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ format │ text │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ access │ https │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ hostName │ USC-LANDER │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ providerName │ USC │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ groupingId │ │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ groupingSummaryFlag │ false │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ retrievalInstructions │ download │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ byteSize │ 1048576 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ expirationDays │ 14 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ uncompressedSize │ 254665 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ impactDoi │ 10.23721/109/1353841 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ useAgreement │ dua-ni-160816 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ irbRequired │ false │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ privateAccessInstructions │ See https://ant.isi.edu/datasets/#getting-datasets for information on obtaining │ │ │ this dataset. │ │ │ See │ └───────────────────────────┴────────────────────────────────────────────────────────────────────────────────────┘ Dataset Contents as_to_org_mapping_inferred_truth-20110901.README.txt      copy of this README 9org_orgs.fsdb      9 large Internet organizations and their corresponding AS cluster IDs 9org_ases.fsdb      ASes belonging to the 9 organizations, each annotated with a cluster ID randtop_orgs.fsdb      50 random large Internet organizations and their corresponding AS cluster IDs randtop_ases.fsdb      ASes belonging to the 50 large organizations, each annotated with a cluster ID randall_orgs.fsdb      50 random Internet organizations and their corresponding AS cluster IDs randall_ases.fsdb      ASes belonging to the 50 organizations, each annotated with a cluster ID     .sha1sum SHA-1 checksum The file ".sha1sum" contains SHA1 checksums of individual compressed files. The integrity of the distribution thus can be checked by independently calculating SHA1 sums of files and comparing them with those listed in the file. If you have the sha1sum utility installed on your system, you can do that by executing: sha1sum --check .sha1sum Data Format Syntax Each of the *.fsdb files are in FSDB file format---this is a simple, white-space-separated text database format, where each line is a database row and whitespace separates columns. Schema Each file is a simple database. In *_orgs.fsdb, each row is an organization, and the 3 columns provide information about it. ┌───────────┬────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ │ the unique identifier that identifies an organization (an AS cluster), in the format of │ │ clusterid │ "method-id" of which "method" indicates how this organization is selected ("manual" indicates │ │ │ intentionally selected; "randtop" indicates randomly selected from the top; "randall" indicates │ │ │ randomly selected from all) and "id" is an unique identifier in that domain, such as "manual-1". │ ├───────────┼────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ orgname │ the name of the organization. │ ├───────────┼────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ orgdomain │ the domain of the organization. │ └───────────┴────────────────────────────────────────────────────────────────────────────────────────────────────┘ In *_ases.fsdb, each row is an AS, 3 columns provide information about it, and the remaining 1 column (clusterid) provide the link between *_ases.fsdb and *_orgs.fsdb. ┌────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ asn │ the AS Number, unique identifier of an Autonomous System (AS). │ ├────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ rir │ the Regional Internet Registry (RIR) the AS belongs to, should be one of {ARIN, RIPE, APNIC, LACNIC, │ │ │ AFRINIC}. │ ├────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ asname │ the name of the AS. │ └────────┴───────────────────────────────────────────────────────────────────────────────────────────────────────┘ If the value in a certain column is "-", it means the info is not available for that organization/AS. How Organization vs. AS files relate The organization files (*_orgs.fsdb) and the AS files (*_ases.fsdb) relate to each other (9org_orgs.fsdb relates to 9org_ases.fsdb, randtop_orgs.fsdb relates to randtop_ases.fsdb and randall_orgs.fsdb relates to randall_ases.fsdb). The organization file lists organizations, one organization per row. The AS file lists ASes and their affiliated organizations, one AS per row. To see which ASes belong to the same organization (sharing the same clusterid), join by clusterid, the organization file with AS file. Clustering Method This dataset provides a mapping from ASes to organizations, i.e., identifies which ASes belong to which organizations. We determined the mapping by manual inspection of RIR whois information, using AS names and external information (company web pages, wikipedia, etc.) to infer a feasible ground truth. We inferred the ground truth around September 1st, 2011 (the suffix of this dataset, 20110901). Thus, as a snapshot, one should expect this dataset to be correct ONLY around this time. This dataset comprises 109 organizations and their 4019 ASes in total, and is further broken down into three subsets. These three subsets are of varying degrees of quality, unbiasedness and size for different evaluation purposes. The "9org" subset contains 9 big U.S.-based public companies with plenty of information online, thus it is of fairly good quality. Although hand-picked, it sheds light over key players in today’s Internet, that is, four large telecommunications companies, four content providers, and a root-DNS provider. In contrast, "randtop" and "randall" subsets are randomly chosen, and each consists of 50 organizations. "randtop" is a random sample of size 50 from the 100 largest organizations we find, where the size of an organization is given by the number its ASes. From manual inspection, this dataset contains large ISPs, big research networks, media conglomerates and multi-national financial companies. By comparison, "randall" is a randomly selected set of 50 organizations from all 36,463 clusters our method produces. Most of "randall" organizations are small, private organizations, often without even a website. Although less interesting, "randall" represents a completely unbiased sample. We identified ASes as part of the same organization by manually inspection of keywords in AS names, such as organization names, subsidiary names and merger and acquisition company names. Although this data is the best we could infer and we have used it as best available ground truth to test automated algorithms, we cannot guarantee its completeness or accuracy. Citation If you use this trace to conduct additional research, please cite it as: PREDICT ID: USC-LANDER/as_to_org_mapping_inferred_truth-20110901/rev4045. Traces generated on 2012-06-18. Provided by the USC/LANDER project (http://www.isi.edu/ant/lander). Results Using This Dataset This dataset has been used in the following previously published work: • Xue Cai, John Heidemann, Balachander Krishnamurthy, and Walter Willinger. An Organization-Level View of the Internet and its Implications (Extended). Technical Report ISI-TR-2012-679, USC/Information Sciences Institute, June, 2012. User Annotations This dataset was used in the paper: Aaron D. Jaggard, Aaron Johnson, Paul Syverson, and Joan Feigenbaum. Representing Network Trust and Using It to Improve Anonymous Communication. In Proceedings of the Privacy Enhancing Technologies Symposium, Amsterdam, Netherlands, July, 2014. [1] Categories: • Datasets • LANDER • LANDER:Datasets • LANDER:Datasets:AddressSpace:Adaptive Probing • LANDER:Datasets:AddressSpace