SIAM 2007 Text Mining Competition dataset
**Subject Area:**
Text Mining
**Description:**
This is the dataset used for the SIAM 2007 Text Mining competition. This competition focused on developing text mining algorithms for document classification. The documents in question were aviation safety reports that documented one or more problems that occurred during certain flights. The goal was to label the documents with respect to the types of problems that were described. This is a subset of the Aviation Safety Reporting System (ASRS) dataset, which is publicly available.
**How Data Was Acquired:**
The data for this competition came from human generated reports on incidents that occurred during a flight.
**Sample Rates, Parameter Description, and Format:**
There is one document per incident. The datasets are in raw text format. All documents for each set will be contained in a single file. Each row in this file corresponds to a single document. The first characters on each line of the file are the document number and a tilde separats the document number from the text itself.
**Anomalies/Faults:**
This is a document category classification problem.
Complete Metadata
| @type | dcat:Dataset |
|---|---|
| accessLevel | public |
| accrualPeriodicity | irregular |
| bureauCode |
[
"026:00"
]
|
| contactPoint |
{
"fn": "Nikunj Oza",
"@type": "vcard:Contact",
"hasEmail": "mailto:Nikunj.C.Oza@nasa.gov"
}
|
| description | **Subject Area:** Text Mining **Description:** This is the dataset used for the SIAM 2007 Text Mining competition. This competition focused on developing text mining algorithms for document classification. The documents in question were aviation safety reports that documented one or more problems that occurred during certain flights. The goal was to label the documents with respect to the types of problems that were described. This is a subset of the Aviation Safety Reporting System (ASRS) dataset, which is publicly available. **How Data Was Acquired:** The data for this competition came from human generated reports on incidents that occurred during a flight. **Sample Rates, Parameter Description, and Format:** There is one document per incident. The datasets are in raw text format. All documents for each set will be contained in a single file. Each row in this file corresponds to a single document. The first characters on each line of the file are the document number and a tilde separats the document number from the text itself. **Anomalies/Faults:** This is a document category classification problem. |
| distribution |
[
{
"@type": "dcat:Distribution",
"title": "Contest_Description_and_Rules.pdf",
"format": "PDF",
"mediaType": "application/pdf",
"description": "Contest Description and Rules",
"downloadURL": "https://c3.nasa.gov/dashlink/static/media/dataset/Contest_Description_and_Rules.pdf"
},
{
"@type": "dcat:Distribution",
"title": "ScoringSoftware.tar.gz",
"format": "GZ",
"mediaType": "application/x-gzip",
"description": "Software to calculate contest scoring metrics",
"downloadURL": "https://c3.nasa.gov/dashlink/static/media/dataset/ScoringSoftware.tar.gz"
},
{
"@type": "dcat:Distribution",
"title": "TestTruth.csv.gz",
"format": "GZ",
"mediaType": "application/x-gzip",
"description": "Test Document Labels",
"downloadURL": "https://c3.nasa.gov/dashlink/static/media/dataset/TestTruth.csv.gz"
},
{
"@type": "dcat:Distribution",
"title": "TestData.txt.gz",
"format": "GZ",
"mediaType": "application/x-gzip",
"description": "Test Documents",
"downloadURL": "https://c3.nasa.gov/dashlink/static/media/dataset/TestData.txt.gz"
},
{
"@type": "dcat:Distribution",
"title": "TrainCategoryMatrix.csv.gz",
"format": "GZ",
"mediaType": "application/x-gzip",
"description": "Training Document Labels",
"downloadURL": "https://c3.nasa.gov/dashlink/static/media/dataset/TrainCategoryMatrix.csv.gz"
},
{
"@type": "dcat:Distribution",
"title": "TrainingData.txt.gz",
"format": "GZ",
"mediaType": "application/x-gzip",
"description": "Training Documents",
"downloadURL": "https://c3.nasa.gov/dashlink/static/media/dataset/TrainingData.txt.gz"
}
]
|
| identifier | DASHLINK_138 |
| issued | 2010-09-22 |
| keyword |
[
"ames",
"dashlink",
"nasa"
]
|
| landingPage | https://c3.nasa.gov/dashlink/resources/138/ |
| modified | 2025-04-01 |
| programCode |
[
"026:029"
]
|
| publisher |
{
"name": "Dashlink",
"@type": "org:Organization"
}
|
| title | SIAM 2007 Text Mining Competition dataset |