Persistent Identifier
|
doi:10.23708/N2UY4C |
Publication Date
|
2023-01-13 |
Title
| COCO-style geographically unbiased image dataset for computer vision applications |
Subtitle
| URLs from Flickr platform with specific geographic location focus |
Alternative Title
| COCO_World_URLs |
Author
| Bayet, Theophile (UMI UMMISCO - IRD, Sorbonne Univ., Univ. Cady Ayyad, Univ. Cheikh Anta Diop, Univ. Sciences et Techniques de Hanoi, Univ. Yaoundé 1, Univ. Gaston Berger - France) - ORCID: 0000-0001-5518-4570 |
Point of Contact
|
Use email button above to contact.
Bayet, Theophile (UMI UMMISCO - IRD, Sorbonne Univ., Univ. Cady Ayyad, Univ. Cheikh Anta Diop, Univ. Sciences et Techniques de Hanoi, Univ. Yaoundé 1, Univ. Gaston Berger - France) |
Description
| There are already a lot of datasets linked to computer vision tasks (Imagenet, MS COCO, Pascal VOC, OpenImages, and numerous others), but they all suffer from important bias. One bias of significance for us is the data origin: most datasets are composed of data coming from developed countries.
Facing this situation, and the need of data with local context in developing countries, we try here to adapt common data generation process to inclusive data, meaning data drawn from locations and cultural context that are unseen or poorly represented.
We chose to replicate MS COCO's data generation process, as it is well documented and easy to implement. Data was collected from January to April 2022 through Flickr platform.
This dataset contains the results of our data collection process, as follows :
- 23 text files containing comma separated URLs for each of the 23 geographic zones identified in the UN M49 norm. These text files are named according to the names of the geographic zones they cover.
- Annotations for 400 images per geographic zones. Those annotations are COCO-style, and inform on the presence or absence of 91 categories of objects or concepts on the images. They are shared in a JSON format.
- Licenses for the 400 annotations per geographic zones, based on the original licenses of the data and specified per image. Those licenses are shared under CSV format.
- A document explaining the objectives and methodology underlying the data collection, also describing the different components of the dataset.
(2022-11-02) |
Subject
| Computer and Information Science |
Keyword
| Images
Computer Vision
Developing countries
Bias
Classification |
Related Publication
| Theophile Bayet, Christophe Denis, Alassane Bah, Jean-Daniel Zucker. Distribution Shift nested in Web Scraping : Adapting MS COCO for Inclusive Data. ICML Workshop on Principles of Distribution Shift 2022, Jul 2022, Baltimore, United States. ⟨hal-03777066⟩ doi: hal-03777066 https://hal.archives-ouvertes.fr/hal-03777066 |
Language
| English |
Producer
| Bayet, Theophile (IRD) |
Production Date
| 2022-01-01 |
Depositor
| Bayet, Theophile |
Deposit Date
| 2022-11-02 |
Time Period
| Start Date: 2022-01-01 ; End Date: 2022-04-01 |
Software
| MS_COCO_world_url_scraper |
Related Material
| MS_COCO_world_url_scraper (source code available on GitHub) |
Data Source
| Flickr platform |