Wikipeople | Homepage

Ondrej Novak: Learning Convolutional Neural Networks for Age Estimation.
Master's thesis. Czech Technical University in Prague, Faculty of Information Technology, 2018.


Database

Metadatabase aka Wikipeople corpus

Article pages

1,231,822 source HTML files of Wikipedia biography articles

Date downloaded: July 7–17, 2017

Saved in format {article_id}_{revision_id}.html

E.g. 3259263_789091946.html corresponding to Demis Hassabis

File pages

646,188 source HTML files of Wikipedia image file pages

Date downloaded: December 10–11, 2017

Saved in format {article_id}_{revision_id}_{sequence_number}.html

E.g. 507174_788292444_0.html corresponding to Geoffrey Hinton's first image file page

Images

646,134 images as displayed on the respective File pages

Date downloaded: December 12, 2017

Saved in format {article_id}_{revision_id}_{sequence_number}_thumb.{ext}

E.g. 165492_789661774_0_thumb.jpg corresponding to Daniel Kahneman's first image as on the file page

Description

Wikipedia

Extracted information, 1st level key is {article_id}

Wikidata

Additional characteristics (demography)

Showcase

Datasets

Wikipeople aka Anycaption

Anycaption
Description format

#filepath;facebox;trn/val/tst=0/1/2;gender=M/F;age;labels=caption,date,desc,name,exif
39/339_788113021_0_thumb.jpg;26,75,151,65,161,190,36,200;0;F;20;20,20,,,

  1. filepath Source directory (in this case) as last two digits of {article_id}
  2. facebox {x1,y1,x2,y2,x3,y3,x4,y4} Coordinates designating bounding box points going clockwise from top left ([0,0] as top left of the image). Note: There is added 25% margin to each side (making the facebox 1.5x1.5 the size of the original) when actually extracting the faces.
  3. trn/val/tst {0/1/2} Value designating the split into training / validation / testing samples
  4. gender {M/F} M for males, F for females as in the respective Wikidata item
  5. age {16–75} Age label as eventually computed (extracted) from available weak labels (annotations). For training samples it is a median of all (present) annotations, otherwise (for validation and testing sets) it is the value of caption annotation.
  6. labels {age1,...,age5} Age annotations as extracted from the article and the respective image file page. Each sample in this dataset is guaranteed to have at least one such annotation. Refer to the thesis for more information about the annotations.

Face images

There is also a version of the dataset containing directly the extracted faces. You can choose between 128x128 color images and 100x100 grayscale images. Note: 26,577 of total 217,800 images (12 %) are natively in grayscale.

Distribution

There are in total 217,800 subjects of which 174,583 are men and 43,217 women (80 % / 20 % imbalanced split), see detailed distribution per age-gender category. Validation and testing sets are both comprised of 15,000 examples chosen by stratified random sampling (so each of the sets follows the same distribution). This leaves 187,800 samples for training.

Other

There are also provided Caption and Nocaption datasets in case you are interested in re-creating the experiments as described in the thesis.

Caption
Nocaption

License

The use of Wikipeople database is not restricted in any way provided you cite this work and adhere to the copyright of original authors (especially the database of images)—OK for research purposes.

In case of a commercial use, please keep in mind that there are also non-free images in the database which may not be allowed to be used for such purposes (which may also be the case for free-licensed images anyway). See description file on how to distinguish between free and non-free images (line 106, key [52469][imgs][0][link][full][src]).

Errata

image_errors.txt 295 corrupted images (as per load using scikit-image instead of opencv)