Word2Vec dictionary for 65000 Gutenberg E-books
No Thumbnail Available
Date
Authors
Egense, Thomas
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Description:
55,000 e-books from Project Gutenberg (http://www.gutenberg.org/).
About 35.000 books are english, but over 50 different languages are represented.
The word2vec algorithm does a good job at seperating the different languages, so it
is almost like it is 50 different word2vec dictionaries.
Corpus size:
30GB of text spliteded into in 230 million sentences sentences with all punctuations removed.
Word2Vec takes about 1.5 week/CPU time to build the dictionary.
Word2Vec parameters:
Software implementation:Google
Model: Skip-Gram
Word window size: 5
Iterations: 10
Minimum word frequency: 100
Dimensions:300
Ouput format:text
Word2vec dictionary file:
1.4 million different words, from over 50 different languages.
Requirements:
Opening the dictionary in a word2vec implementation will require 16GB of memory.
Description
Keywords
word2vec, machine learning, NLP, Gutenberg