Ondrej Novak: Learning Convolutional Neural Networks for Age Estimation.
Master's thesis. Czech Technical University in Prague,
Faculty of Information Technology, 2018.
1,231,822 source HTML files of Wikipedia biography articles
Date downloaded: July 7–17, 2017
Saved in format {article_id}_{revision_id}.html
E.g. 3259263_789091946.html corresponding to Demis Hassabis
646,188 source HTML files of Wikipedia image file pages
Date downloaded: December 10–11, 2017
Saved in format {article_id}_{revision_id}_{sequence_number}.html
E.g. 507174_788292444_0.html corresponding to Geoffrey Hinton's first image file page
646,134 images as displayed on the respective File pages
Date downloaded: December 12, 2017
Saved in format {article_id}_{revision_id}_{sequence_number}_thumb.{ext}
E.g. 165492_789661774_0_thumb.jpg corresponding to Daniel Kahneman's first image as on the file page
Extracted information, 1st level key is {article_id}
Additional characteristics (demography)
See how to work with these description files (download source ipynb file)
#filepath;facebox;trn/val/tst=0/1/2;gender=M/F;age;labels=caption,date,desc,name,exif 39/339_788113021_0_thumb.jpg;26,75,151,65,161,190,36,200;0;F;20;20,20,,,
filepath
Source directory (in this case) as last two digits of {article_id}facebox
{x1,y1,x2,y2,x3,y3,x4,y4} Coordinates designating bounding box points going clockwise from top left ([0,0] as top left of the image). Note: There is added 25% margin to each side (making the facebox 1.5x1.5 the size of the original) when actually extracting the faces.trn/val/tst
{0/1/2} Value designating the split into training / validation / testing samplesgender
{M/F} M for males, F for females as in the respective
Wikidata item
age
{16–75} Age label as eventually computed (extracted) from available weak labels (annotations). For training samples it is a median of all (present) annotations, otherwise (for validation and testing sets) it is the value of caption annotation.labels
{age1,...,age5} Age annotations as extracted from the article and the respective image file page. Each sample in this dataset is guaranteed to have at least one such annotation. Refer to the thesis for more information about the annotations.There is also a version of the dataset containing directly the extracted faces. You can choose between 128x128 color images and 100x100 grayscale images. Note: 26,577 of total 217,800 images (12 %) are natively in grayscale.
There are in total 217,800 subjects of which 174,583 are men and 43,217 women (80 % / 20 % imbalanced split), see detailed distribution per age-gender category. Validation and testing sets are both comprised of 15,000 examples chosen by stratified random sampling (so each of the sets follows the same distribution). This leaves 187,800 samples for training.
There are also provided Caption and Nocaption datasets in case you are interested in re-creating the experiments as described in the thesis.
The use of Wikipeople database is not restricted in any way provided you cite this work and adhere to the copyright of original authors (especially the database of images)—OK for research purposes.
In case of a commercial use, please keep in mind that there are also non-free images in the database which may not be allowed to be used for such purposes (which may also be the case for free-licensed images anyway). See description file on how to distinguish between free and non-free images (line 106, key [52469][imgs][0][link][full][src]).