Chinese Web test collection

    Note1: The Chinese Web Test collection with 100 GB web pages (CWT100g) is now freely available at a small fee (400RMB in 2004) to cover distribution costs. The applying method.
    Note2The CWT100g is designated as the test collection of SEWM-2004 Chinese Web Track. The procedure of taking part in the match, The result of Evaluation.
    Evaluation is a major force in research, development and application related to information retrieval(IR). Its four goals (refered to TREC): 1) to encourage retrieval research based on large test collections; 2) to increase communication among industry, academia, and government by creating an open forum for the exchange of research ideas; 3) to speed the transfer of technology from research labs into commercial products by demonstrating substantial improvements in retrieval methodologies on real-world problems; and 4) to increase the availability of appropriate evaluation techniques for use by industry and academia, including development of new evaluation techniques more applicable to current systems.
    The Web is becoming a universal repository of human knowledge and culture which has allowed unprecedent sharing of ideas and information in a scale never seen before. And IR has gained a place with other technologies at the center of the stage. Google's quick success further confirmed that Web search is a major orientation of IR.
    Since 1992 TREC engages to the evaluation of IR and largely accelerates the performance of IR. TREC has a Web track and providing English Web test collection ( since 1999. NII provides Japanese Web test collection ( since 2002. The lack of a large Chinese Web collection prevents the improvement in technologies of the Chinese IR. Thus, we look forward to building and perfecting the CWT100g, with the help of all interesting research groups, and accelerating technologies of the Chinese IR.

    The CWT100g is composed of three parts: the documents, the queries and relevance judgments.
    Documents, according to the statistic of Tianwang Search Engine on Feb. 1, 2004, we sampled 17,683 sites from 1,000,614 Web sites, and crawled 5,712,710 Web pages with 90GB capacity. Every page in the collection has a "text/html" or "text/plain" MIME type reveived from the HTTP server response message. Tianwang storage format for raw web pages. A sample.
    Queries, sampled from user query log of Tianwang between Apr. 2002 and June. 2004, then compiled by human editors.
    The Queries include the topic distillation task and the home/named page finding task
    SEWM-2004 Chinese Web Track Guidelines
    Relevance judgments, integrate Tianwang's strength and TREC's pooling method to get the final results. The traditional judgment pools are created as follows: first participants submit their retrieval runs, second assessors choose a number of runs to be merged, third for each selected run, the top X documents are added to the pools and judged. To deal more effectively with incomplete relevance information while fewer participants, we propose a method, Pooling Plus, that yield effective judgments with help of many virtual participants(search engines).

    Other links

Last Updated: Nov 15, 2004
Hosted by Network Group, Peking University