LANDER:as to org mapping subsidiary linkage-20101019 From Predict README version: 4030, last modified: 2014-06-6. This file describes the trace dataset "as_to_org_mapping_subsidiary_linkage-20101019" provided by the LANDER project. This is a derived dataset processed on 2012-06-18, with data obtained from sources below: • U.S. Securities and Exchange Commission’s (SEC) EDGAR database. http://www.sec.gov/investor/pubs/edgarguide.htm, 2010. • Regional Internet Registry (RIR) WHOIS database. http://www.afrinic.net/, http://www.apnic.net/, http://www.arin.net/, http://www.lacnic.net/, http://www.ripe.net/, October 2010. Contents • 1 LANDER Metadata • 2 Dataset Contents • 3 Data Format • 3.1 Syntax • 3.2 Schema • 3.3 Included Data • 3.4 How organization vs. subsidiary files relate • 3.5 How subsidiary, AS and link files relate • 4 Linking Method • 5 Citation • 6 Results Using This Dataset • 7 User Annotations LANDER Metadata ┌───────────────────────────┬────────────────────────────────────────────────────────────────────────────────────┐ │ dataSetName │ as_to_org_mapping_subsidiary_linkage-20101019 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ status │ usc-web-and-predict │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ shortDesc │ Links between ASes and subsidiaries │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ longDesc │ This dataset provides a linking between ASes and company subsidiaries. It is │ │ │ derived from WHOIS database and Form 10-K filings. │ │ │ │ │ │ The linking is useful to associate ASes that belong to different subsidiaries of │ │ │ the same organization. We determined the links by automatic record linkage │ │ │ algorithms and followed by manual verification and pruning. The general idea is to │ │ │ compare how similar the name of an AS is to the name of a subsidiary. Due to the │ │ │ inaccuracy of automatic linkage, we then manually verify and prune the links for │ │ │ selected most important organizations. │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ datasetClass │ Quasi-Restricted │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ commercialAllowed │ true │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ requestReviewRequired │ true │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ productReviewRequired │ false │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ ongoingMeasurement │ false │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ submissionMethod │ Upload │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ collectionStartDate │ 2010-10-19 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ collectionStartTime │ 00:00:00 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ collectionEndDate │ 2010-12-31 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ collectionEndTime │ 00:00:00 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ availabilityStartDate │ 2013-03-04 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ availabilityStartTime │ 18:10:02 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ availabilityEndDate │ 2030-01-01 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ availabilityEndTime │ 00:00:00 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ anonymization │ none │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ archivingAllowed │ false │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ keywords │ category:internet-topology-data, subcategory:as-organizational-data, internet, │ │ │ topology, AS, organization, subsidiary, linking │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ format │ text │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ access │ https │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ hostName │ USC-LANDER │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ providerName │ USC │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ groupingId │ │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ groupingSummaryFlag │ false │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ retrievalInstructions │ download │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ byteSize │ 2709520384 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ expirationDays │ 14 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ uncompressedSize │ 22377760512 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ impactDoi │ 10.23721/109/1353818 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ useAgreement │ dua-ni-160816 │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ irbRequired │ false │ ├───────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤ │ privateAccessInstructions │ See http://www.isi.edu/ant/traces/index.html#getting_datasets for information on │ │ │ obtaining this dataset. │ │ │ See │ └───────────────────────────┴────────────────────────────────────────────────────────────────────────────────────┘ Dataset Contents as_to_org_mapping_subsidiary_linkage-20101019.README.txt      copy of this README orgs.fsdb      The IDs and names of organizations. Only US public companies are included. orgs_selected.fsdb      The selected most important organizations, a subset of orgs.fsdb. The subsidiaries of organizations in orgs.fsdb, subsidiaries.fsdb      including names of both organizations and their subsidiaries. ases.fsdb      The ASNs and names of ASes. links_selected.fsdb      The manually verified and pruned links between ASes and subsidiaries of the selected organizations. form10k.tar.bz2      The original Form 10-K filings of organizations in orgs.fsdb. The extracted Exhibit 21 contained in each Form 10-K ex21.tar.bz2      filing that provides information about organization subsidiaries. .sha1sum SHA-1 checksum The file ".sha1sum" contains SHA1 checksums of individual compressed files. The integrity of the distribution thus can be checked by independently calculating SHA1 sums of files and comparing them with those listed in the file. If you have the sha1sum utility installed on your system, you can do that by executing: sha1sum --check .sha1sum Data Format Syntax Each of the *.fsdb files are in FSDB file format---this is a simple, white-space-separated text database format, where each line is a database row and whitespace separates columns. Schema Each *.fsdb file is a simple database. In orgs.fsdb and orgs_selected.fsdb, each row is an organization, and the 3 columns provide information about it. ┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ /* the following fields are derived from EDGAR databse */ │ ├───────────┬────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ cik │ Central Index Key (CIK), the unique identifier of the organization in the EDGAR databse. │ ├───────────┼────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ orgname │ the name of the organization. │ ├───────────┼────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ │ the accession number that identifies the Form 10-K filing of the organization. Use this number to │ │ │ find the original filing in form10k.tar.bz2 and the extracted exhibit 21 in ex21.tar.bz2. For │ │ accession │ example, if the accession number is "0000002178-10-000008", then decompress form10k.tar.bz2, and │ │ │ the original Form 10-K filing will be "form10k/0000002178-10-000008.txt". We also provide the │ │ │ already extracted Exhibit 21 in each filing. Decompress ex21.tar.bz2, and the exhibit 21 of the │ │ │ organization will be "ex21/0000002178-10-000008-ex21.htm". │ └───────────┴────────────────────────────────────────────────────────────────────────────────────────────────────┘ In subsidiaries.fsdb, each row is a subsidiary, 2 columns provide information about it. ┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ /* the following fields are derived from EDGAR databse */ │ ├────────────┬───────────────────────────────────────────────────────────────────────────────────────────────────┤ │ cik │ Central Index Key (CIK), the unique identifier of the organization to which the subsidiary │ │ │ belong. │ ├────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────┤ │ subsidiary │ the name of the subsidiary. │ └────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────┘ In ases.fsdb, each row is an AS, 2 columns provide information about it. ┌───────────────────────────────────────────────────────────┐ │ /* the following fields are derived from WHOIS databse */ │ ├────────────────┬──────────────────────────────────────────┤ │ asn │ the unique identifier of the AS. │ ├────────────────┼──────────────────────────────────────────┤ │ asname │ the name of the AS. │ └────────────────┴──────────────────────────────────────────┘ In links_selected.fsdb, each row is a link between a subsidiary and an AS. 4 columns provide information about it. ┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ /* the following fields are derived from EDGAR databse */ │ ├────────────┬───────────────────────────────────────────────────────────────────────────────────────────────────┤ │ cik │ Central Index Key (CIK), the unique identifier of the organization to which the subsidiary │ │ │ belong. │ ├────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────┤ │ subsidiary │ the name of the subsidiary. │ ├────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────┤ │ /* the following fields are derived from WHOIS databse */ │ ├────────────┬───────────────────────────────────────────────────────────────────────────────────────────────────┤ │ asn │ the unique identifier of the AS. │ ├────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────┤ │ asname │ the name of the AS. │ └────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────┘ If the value in a certain column is "-", it means the info is not available. Included Data We also include the raw SEC 10-K data obtained from the EDGAR database. This data is distributed freely by the SEC at [1] with the statement "Anyone can access and download this information for free". form10k.tar.bz2 contains the original Form 10-K filings of organizations in orgs.fsdb. Each filing is in text format. Each filing contains an exhibit 21 that lists the organization's subsidiaries. The exhibit 21 is in html format. We extract these exhibits from all 10-K filings and store them in ex21.tar.bz2. Use the "accession" number stored in orgs.fsdb to find the corresponding Form 10-K and Exhibit 21 of an organization. How organization vs. subsidiary files relate The organization file (orgs.fsdb) and the subsidiary file (subsidiaries.fsdb) relate to each other. The organization file lists all US public companies, one organization per row. The subsidiary file lists these organizations' subsidiaries we extracted from their 10-K filings, one subsidiary per row. To see what organization an subsidiary belongs to, join by cik, the organization file with the subsidiary file. How subsidiary, AS and link files relate The subsidiary file (subsidiaries.fsdb), AS file (ases.fsdb) and link file (links_selected.fsdb) relate to each other. The subsidiary file lists subsidiaries and their organization IDs, one subsidiary per row. The AS file lists ASes, one AS per row. We link subsidiaries with ASes by their names, and these links are stored in the link file. Note that limited by the manual effort, only links to the subsidiaries belonging to the selected most important organizations in orgs_selected.fsdb are included. Linking Method This dataset provides a linking between ASes and company subsidiaries. The linking is useful to associate ASes that belong to different subsidiaries of the same organization. We determined the links by automatic record linkage algorithms and followed by manual verification and pruning. The general idea is to compare how similar the name of an AS is to the name of a subsidiary. Due to the inaccuracy of automatic linkage, we then manually verify and prune the links for selected most important organizations. Details about our methodology are in technical report: • Xue Cai, John Heidemann, Balachander Krishnamurthy, and Walter Willinger. An Organization-Level View of the Internet and its Implications (Extended). Technical Report ISI-TR-2012-679, USC/Information Sciences Institute, June, 2012. ftp://ftp.isi.edu/isi-pubs/tr-679.pdf Citation If you use this trace to conduct additional research, please cite it as: PREDICT ID: USC-LANDER/as_to_org_mapping_subsidiary_linkage-20101019/rev4030. Traces generated on 2012-06-18. Provided by the USC/LANDER project (http://www.isi.edu/ant/lander). Results Using This Dataset This dataset has been used in the following previously published work: • Xue Cai, John Heidemann, Balachander Krishnamurthy, and Walter Willinger. An Organization-Level View of the Internet and its Implications (Extended). Technical Report ISI-TR-2012-679, USC/Information Sciences Institute, June, 2012. ftp://ftp.isi.edu/isi-pubs/tr-679.pdf User Annotations Currently no annotations. Categories: • Datasets • LANDER • LANDER:Datasets