init repo

2022-10-15 11:36:03 +08:00
parent 180767d408
commit 0e475aa135
37 changed files with 204019 additions and 162 deletions
--- a/Questions/Q3/aclImdb/README
+++ b/Questions/Q3/aclImdb/README
@ -0,0 +1,92 @@
+Large Movie Review Dataset v1.0
+
+Overview
+
+This dataset contains movie reviews along with their associated binary
+sentiment polarity labels. It is intended to serve as a benchmark for
+sentiment classification. This document outlines how the dataset was
+gathered, and how to use the files provided. 
+
+Dataset 
+
+The core dataset contains 50,000 reviews split evenly into 25k train
+and 25k test sets. The overall distribution of labels is balanced (25k
+pos and 25k neg). We also include an additional 50,000 unlabeled
+documents for unsupervised learning. 
+
+In the entire collection, no more than 30 reviews are allowed for any
+given movie because reviews for the same movie tend to have correlated
+ratings. Further, the train and test sets contain a disjoint set of
+movies, so no significant performance is obtained by memorizing
+movie-unique terms and their associated with observed labels.  In the
+labeled train/test sets, a negative review has a score <= 4 out of 10,
+and a positive review has a score >= 7 out of 10. Thus reviews with
+more neutral ratings are not included in the train/test sets. In the
+unsupervised set, reviews of any rating are included and there are an
+even number of reviews > 5 and <= 5.
+
+Files
+
+There are two top-level directories [train/, test/] corresponding to
+the training and test sets. Each contains [pos/, neg/] directories for
+the reviews with binary labels positive and negative. Within these
+directories, reviews are stored in text files named following the
+convention [[id]_[rating].txt] where [id] is a unique id and [rating] is
+the star rating for that review on a 1-10 scale. For example, the file
+[test/pos/200_8.txt] is the text for a positive-labeled test set
+example with unique id 200 and star rating 8/10 from IMDb. The
+[train/unsup/] directory has 0 for all ratings because the ratings are
+omitted for this portion of the dataset.
+
+We also include the IMDb URLs for each review in a separate
+[urls_[pos, neg, unsup].txt] file. A review with unique id 200 will
+have its URL on line 200 of this file. Due the ever-changing IMDb, we
+are unable to link directly to the review, but only to the movie's
+review page.
+
+In addition to the review text files, we include already-tokenized bag
+of words (BoW) features that were used in our experiments. These 
+are stored in .feat files in the train/test directories. Each .feat
+file is in LIBSVM format, an ascii sparse-vector format for labeled
+data.  The feature indices in these files start from 0, and the text
+tokens corresponding to a feature index is found in [imdb.vocab]. So a
+line with 0:7 in a .feat file means the first word in [imdb.vocab]
+(the) appears 7 times in that review.
+
+LIBSVM page for details on .feat file format:
+http://www.csie.ntu.edu.tw/~cjlin/libsvm/
+
+We also include [imdbEr.txt] which contains the expected rating for
+each token in [imdb.vocab] as computed by (Potts, 2011). The expected
+rating is a good way to get a sense for the average polarity of a word
+in the dataset.
+
+Citing the dataset
+
+When using this dataset please cite our ACL 2011 paper which
+introduces it. This paper also contains classification results which
+you may want to compare against.
+
+
+@InProceedings{maas-EtAl:2011:ACL-HLT2011,
+  author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
+  title     = {Learning Word Vectors for Sentiment Analysis},
+  booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
+  month     = {June},
+  year      = {2011},
+  address   = {Portland, Oregon, USA},
+  publisher = {Association for Computational Linguistics},
+  pages     = {142--150},
+  url       = {http://www.aclweb.org/anthology/P11-1015}
+}
+
+References
+
+Potts, Christopher. 2011. On the negativity of negation. In Nan Li and
+David Lutz, eds., Proceedings of Semantics and Linguistic Theory 20,
+636-659.
+
+Contact
+
+For questions/comments/corrections please contact Andrew Maas
+amaas@cs.stanford.edu