init repo
This commit is contained in:
92
Questions/Q3/aclImdb/README
Normal file
92
Questions/Q3/aclImdb/README
Normal file
@ -0,0 +1,92 @@
|
||||
Large Movie Review Dataset v1.0
|
||||
|
||||
Overview
|
||||
|
||||
This dataset contains movie reviews along with their associated binary
|
||||
sentiment polarity labels. It is intended to serve as a benchmark for
|
||||
sentiment classification. This document outlines how the dataset was
|
||||
gathered, and how to use the files provided.
|
||||
|
||||
Dataset
|
||||
|
||||
The core dataset contains 50,000 reviews split evenly into 25k train
|
||||
and 25k test sets. The overall distribution of labels is balanced (25k
|
||||
pos and 25k neg). We also include an additional 50,000 unlabeled
|
||||
documents for unsupervised learning.
|
||||
|
||||
In the entire collection, no more than 30 reviews are allowed for any
|
||||
given movie because reviews for the same movie tend to have correlated
|
||||
ratings. Further, the train and test sets contain a disjoint set of
|
||||
movies, so no significant performance is obtained by memorizing
|
||||
movie-unique terms and their associated with observed labels. In the
|
||||
labeled train/test sets, a negative review has a score <= 4 out of 10,
|
||||
and a positive review has a score >= 7 out of 10. Thus reviews with
|
||||
more neutral ratings are not included in the train/test sets. In the
|
||||
unsupervised set, reviews of any rating are included and there are an
|
||||
even number of reviews > 5 and <= 5.
|
||||
|
||||
Files
|
||||
|
||||
There are two top-level directories [train/, test/] corresponding to
|
||||
the training and test sets. Each contains [pos/, neg/] directories for
|
||||
the reviews with binary labels positive and negative. Within these
|
||||
directories, reviews are stored in text files named following the
|
||||
convention [[id]_[rating].txt] where [id] is a unique id and [rating] is
|
||||
the star rating for that review on a 1-10 scale. For example, the file
|
||||
[test/pos/200_8.txt] is the text for a positive-labeled test set
|
||||
example with unique id 200 and star rating 8/10 from IMDb. The
|
||||
[train/unsup/] directory has 0 for all ratings because the ratings are
|
||||
omitted for this portion of the dataset.
|
||||
|
||||
We also include the IMDb URLs for each review in a separate
|
||||
[urls_[pos, neg, unsup].txt] file. A review with unique id 200 will
|
||||
have its URL on line 200 of this file. Due the ever-changing IMDb, we
|
||||
are unable to link directly to the review, but only to the movie's
|
||||
review page.
|
||||
|
||||
In addition to the review text files, we include already-tokenized bag
|
||||
of words (BoW) features that were used in our experiments. These
|
||||
are stored in .feat files in the train/test directories. Each .feat
|
||||
file is in LIBSVM format, an ascii sparse-vector format for labeled
|
||||
data. The feature indices in these files start from 0, and the text
|
||||
tokens corresponding to a feature index is found in [imdb.vocab]. So a
|
||||
line with 0:7 in a .feat file means the first word in [imdb.vocab]
|
||||
(the) appears 7 times in that review.
|
||||
|
||||
LIBSVM page for details on .feat file format:
|
||||
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
|
||||
|
||||
We also include [imdbEr.txt] which contains the expected rating for
|
||||
each token in [imdb.vocab] as computed by (Potts, 2011). The expected
|
||||
rating is a good way to get a sense for the average polarity of a word
|
||||
in the dataset.
|
||||
|
||||
Citing the dataset
|
||||
|
||||
When using this dataset please cite our ACL 2011 paper which
|
||||
introduces it. This paper also contains classification results which
|
||||
you may want to compare against.
|
||||
|
||||
|
||||
@InProceedings{maas-EtAl:2011:ACL-HLT2011,
|
||||
author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher},
|
||||
title = {Learning Word Vectors for Sentiment Analysis},
|
||||
booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
|
||||
month = {June},
|
||||
year = {2011},
|
||||
address = {Portland, Oregon, USA},
|
||||
publisher = {Association for Computational Linguistics},
|
||||
pages = {142--150},
|
||||
url = {http://www.aclweb.org/anthology/P11-1015}
|
||||
}
|
||||
|
||||
References
|
||||
|
||||
Potts, Christopher. 2011. On the negativity of negation. In Nan Li and
|
||||
David Lutz, eds., Proceedings of Semantics and Linguistic Theory 20,
|
||||
636-659.
|
||||
|
||||
Contact
|
||||
|
||||
For questions/comments/corrections please contact Andrew Maas
|
||||
amaas@cs.stanford.edu
|
89527
Questions/Q3/aclImdb/imdb.vocab
Normal file
89527
Questions/Q3/aclImdb/imdb.vocab
Normal file
File diff suppressed because it is too large
Load Diff
25000
Questions/Q3/aclImdb/test/labeledBow.feat
Normal file
25000
Questions/Q3/aclImdb/test/labeledBow.feat
Normal file
File diff suppressed because one or more lines are too long
25000
Questions/Q3/aclImdb/train/labeledBow.feat
Normal file
25000
Questions/Q3/aclImdb/train/labeledBow.feat
Normal file
File diff suppressed because it is too large
Load Diff
50000
Questions/Q3/aclImdb/train/unsupBow.feat
Normal file
50000
Questions/Q3/aclImdb/train/unsupBow.feat
Normal file
File diff suppressed because one or more lines are too long
Reference in New Issue
Block a user