• Data Sets
    • Frequent Itemset Mining Implementations Repository - This repository is the result of The 1st International Workshop on Frequent Itemset Mining Implementations, (FIMI'03) which took place at IEEE ICDM'03, on November 19, 2003, Melbourne, Florida, USA. This website will serve as the FIMI repository containing the source codes of all implementations that were accepted at the FIMI workshop together with several puclicly available datasets.
    • KDD Cup
      • KDD Cup 1999
        The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between ``bad'' connections, called intrusions or attacks, and ``good'' normal connections. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment.
      • KDD Cup 2000
        WebView1, WebView2, WebPOS
      • KDD Cup 2001
        Because of the rapid growth of interest in mining biological databases, KDD Cup 2001 was focused on data from genomics and drug design. Sufficient (yet concise) information was provided so that detailed domain knowledge was not a requirement for entry. A total of 136 groups participated to produce a total of 200 submitted predictions over the 3 tasks: 114 for Thrombin, 41 for Function, and 45 for Localization.
      • KDD Cup 2002
        This year the competition included two tasks that involved data mining in molecular biology domains. The first task focused on constructing models that can assist genome annotators by automatically extracting information from scientific articles. The second task focused on learning models that characterize the behavior of individual genes in a hidden experimental setting.
      • KDD Cup 2003
        The first task involves predicting the future; contestants predict how many citations each paper will receive during the three months leading up to the KDD 2003 conference. For the second task, contestants must build a citation graph of a large subset of the archive from only the LaTex sources. In the third task, each paper's popularity will be estimated based on partial download logs. And the last task is open! Given the large amount of data, contestants can devise their own questions and the most interesting result is the winner.
    • UCI Knowledge Discovery in Databases Archive
    • StatLib (Department of Statistics at Carnegie Mellon University)
    • ILP applications and Datasets
    • Review of Available ILP Datasets
    • Workgroup KDD-SISYPHUS
    • University of Toronto's Delve datasets
    • Statlog Datasets
    • RISE - Repository of online Information Sources used in information Extraction tasks. (RISE is a distributed repository of online information sources that are used for the empirical analysis of [machine] learning algorithms that generate extraction patterns)
    • The UC Irvine Database Repository
    • Oxford University Computing Laboratory ILP Datasets
    • Pattern recognition datasets from Universal Problem Solvers Inc
  • Synthetic Data Generator
    • Synthetic Data Generation Code for Associations and Sequential Patterns
    • Synthetic Data Generator for Associations and Sequential Patterns (support VC++ version 5 or above)
    • Synthetic Data Generation Code for Classification