Object Detection with Street View Images for Demographic Analysis
According to a study derived in Stanford University [1], Google Street View (GSV) images were analyzed with an artificial intelligence model to understand political tendencies of the people in the United States. According to this model, it was decided if a neighborhood was likely to vote Republican or Democrat in the presidential elections by analyzing the cars in the neighborhood. The cars were divided into classes according to their models, years and colors. Each year, a cost of $250 million was held by American Community Survey (ACS) to make such a demographic study. With this study, it was shown that artificial intelligence models could be used for demographic analyses with very few costs.
We got inspired from this study to reproduce our own model which would predict a neighborhood’s welfare level by observing different factors within the neighborhood. Our project was supported and presented within inzva’s AI Projects #6 Showcase. Project contributors are Alara Hergün, Başak Ekinci, Efehan Danışman and Sefa Kurtipek.
The Dataset
Stanford’s study was based on socioeconomic data acquired from ACS on political leanings, however such a dataset was not present in our case. However, two different datasets about demographic analyzes in Istanbul were found: Mahallem İstanbul (2016) and İlçelerin Sosyo-Ekonomik Gelişmişlik Sıralaması (2017). The first one is a dataset published by IBB: it is a study that assigns a point (SEGE : Social-Economic Welfare Level Index) from A+ to F to every neighborhood in İstanbul by observing its several features like population, health and education accesses and economic levels. At our first times we used Mahallem Istanbul, however obtaining evenly distributed photos every neighborhood and finding a neighborhood’s start and ending coordinates was a demanding task for us, working on a larger scale would be much more detailed and professional so we switched to our second dataset: İlçelerin Sosyo-Ekonomik Gelişmişlik Sıralaması. This dataset focuses on a district instead of a neighborhood by observing more or less the same features and assigns a point to every district in Turkey, and these points differ from 7.73 to -1.74. You can read further from this link. For our study, we worked with photos and points from Istanbul but in the future we can (and most probably will) work further with other cities in Turkey to remake such a demographic search.
Istanbul has a very complex and mixed demographic state from district to district so we planned on training our model primarily on GSV images taken from Istanbul only, then we will expand to other cities in Turkey. After reading some related studies [1,2], we searched for ways to take evenly distributed photos from different regions with different demographic outcomes to train our model and where to take the photos. In order to take the photos we used Google Street View API [3] prepared by Google. In this API, you can take pictures by precising a heading and pitch value to move the camera in the latitude and longitude your camera is being on. However, our biggest problem was to take non-overlapping photos, so we needed to divide the district into equally divided grid regions and take photos on the corners of every square. In order to find the boundaries of districts Google Geocoding API was used. We did reverse geocoding to find upper left and lower right coordinates of the boundaries.
These coordinates were observed in geojson.io, as some discretes might have non-rectangular shapes; these exceptional discretes were divided into plural rectangular grids by observing their appearances in geojson.io.
In order to divide our region into a grid, the formula from [2] was used :
In this formula, we calculate a Haversine distance. The earth is not flat, so every time a distance or area is calculated, we should be doing our calculations with arc distances. This formula accepts earth as a flat surface with a small error rate to easen our calculations and aims to find the distance between two coordinates to equally divide them into a grid.
For the visualization and to manipulate a district’s coordinates geojson.io is used. It can either be used manually or a file which contains a geojson format to visualize input.
Below, you can see the differences between the visualization in Google Maps vs geojson.io.
The Model
A YOLO v3 model pre-trained on COCO dataset was used for the detection from imageAi library.
YOLO Object Detection
YOLO is a real-time algorithm that detects objects using the Convolutional Neural Network. It stands for “You Only Look Once”. The reason it is called “You Only Look Once” is that the algorithm passes the object detection very quickly and the whole picture through the neural network in one step. YOLO divides the input image into NxN grids. The size of these grids may vary according to the picture. Each grid decides whether the center point is in its own area if it thinks that there is an object in itself or not. The grid decides that the object has a center point, finds the class, height and width of that object and creates a bounding box around that object. This bounding box created gives the object detection result of the YOLO algorithm.
With the YOLO v3 model we were able to detect everyday objects like people, cars, traffic objects, animals and much more. From this dataset, we used in particular motorcycles, cars and people for the major part. With these features, and SEGE points for the Label; we obtained a correlation of about 27–29% and our primary aim was to increase it with more complex features like tree detection, building detections and detecting a building’s age with a model. As these features were not common in datasets, we followed a tutorial to do so. Building tutorial did not seem to be very promising so it still is in development. As for the tree detection part, our model was working well but we weren’t able to detect such green areas in Istanbul as it is a very crowded and urban place. However, we are still trying to implement them with better usages for our project.
Results
We acquired approximately 200–250 Google Street View Images from each for 10 different districts of Istanbul with similar size and different SEGE levels. As a result we detected various objects with YOLO v3, however only a handful of them have a meaningful number of objects. Hence, we decided to group them into two groups, one as Motor Vehicles (e.g. motorcycle, car, bus, trucks) and another as person. After aggregating the number of objects for each district, we plotted them on a scatterplot and checked whether there is any correlation between SEGE score and number of objects. Below you can see the number of people,motor vehicles and total objects vs socioeconomic development score of the neighbourhood.
In the first graph we can observe the relationship of SEGE vs Number of Motor Vehicles. There is a weak correlation of 0.18 according to our 10 districts.
As a second observation, we looked for a relationship between SEGE and Number of Humans. There is a weak correlation of 0.24 which looks more promising yet still not robust as an only factor to define our model.
In the 3rd graph, we combined these factors to observe a better correlation. However, this combination did not fully change our conclusions as well. There is a weak correlation of 0.23 between SEGE and Number of Humans and Motor Vehicles.
As a result we only reached a weak correlation between the number of objects and scores as can be seen from plots below. However this does not stop us to explore more about the topic. Our project will continue with detecting even more objects (e.g. building age, tree etc.) and acquiring more GSV images for different neighbourhoods to extract some kind of relationship between SEGE scores and number of objects.
Conclusion
To finalize, our study was mainly a pipeline of extracting equally distanced photos from Istanbul’s different districts via Google Street View API and Geojson.io, feeding these photos into a YOLO v3 model to find the number of different features (like humans and motor vehicles), and visualize our results with the dataset published by Industry and Technology Ministry to find correlations.
Extracting insights from a neighborhood and district is costly via traditional survey methods. All the work we have encountered so far were costly in terms of finance and human resources. By using object detection methods we would like to extract similar insights from GSV images. While we could not find the relationship we were looking for so far, this is an on-going project and we are aiming to extend the scope of this project.
Codes : https://github.com/inzva/object-detection-with-street-view
References
[1] Gebru, Timnit & Krause, Jonathan & Wang, Yilun & Chen, Duyun & Deng, Jia & Aiden, Erez & Fei-Fei, Li. (2017). Using deep learning and Google Street View to estimate the demographic makeup of neighborhoods across the United States. Proceedings of the National Academy of Sciences. 114. 201700035. 10.1073/pnas.1700035114.
[2] Diou, C.; Lelekas, P.; Delopoulos, A. Image-Based Surrogates of Socio-Economic Status in Urban Neighborhoods Using Deep Multiple Instance Learning. J. Imaging 2018, 4, 125
[3] https://developers.google.com/maps/documentation/streetview/overview
[4] https://kisi.deu.edu.tr/yunusemre.ozer/lce_sege-2017.pdf