Word2Vec dictionary for 65000 Gutenberg E-books
dc.contributor.author | Egense, Thomas | |
dc.date.accessioned | 2018-06-26T11:04:29Z | |
dc.date.available | 2018-06-26T11:04:29Z | |
dc.description.abstract | Description: 55,000 e-books from Project Gutenberg (http://www.gutenberg.org/). About 35.000 books are english, but over 50 different languages are represented. The word2vec algorithm does a good job at seperating the different languages, so it is almost like it is 50 different word2vec dictionaries. Corpus size: 30GB of text spliteded into in 230 million sentences sentences with all punctuations removed. Word2Vec takes about 1.5 week/CPU time to build the dictionary. Word2Vec parameters: Software implementation:Google Model: Skip-Gram Word window size: 5 Iterations: 10 Minimum word frequency: 100 Dimensions:300 Ouput format:text Word2vec dictionary file: 1.4 million different words, from over 50 different languages. Requirements: Opening the dictionary in a word2vec implementation will require 16GB of memory. | en_US |
dc.identifier.uri | https://loar.kb.dk/handle/1902/327 | |
dc.identifier.uri | http://dx.doi.org/10.21994/loar157 | |
dc.language.iso | en | en_US |
dc.rights | CC Public Domain | * |
dc.rights.uri | https://creativecommons.org/publicdomain/mark/1.0/deed.en | * |
dc.subject | word2vec | en_US |
dc.subject | machine learning | en_US |
dc.subject | NLP | en_US |
dc.subject | Gutenberg | en_US |
dc.title | Word2Vec dictionary for 65000 Gutenberg E-books | en_US |
dc.type | Dataset | en_US |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- gutenberg_65K_books.txt
- Size:
- 7.99 GB
- Format:
- Plain Text
- Description:
- Word2Vec dictionary file in text format for 65K Gutenberg E-books
License bundle
1 - 1 of 1
Loading...
- Name:
- license.txt
- Size:
- 4.47 KB
- Format:
- Item-specific license agreed upon to submission
- Description: