Project 3-OpenStreetMap Data Case Study

Map Area

North part of Seattle, Washington

Audit Data

  • Types of tags and count for each tag in the dataset.

The north_seattle.osm file contains following tags:

{'member': 18333,
 'meta': 1,
 'nd': 1237490,
 'node': 1122168,
 'note': 1,
 'osm': 1,
 'relation': 1278,
 'tag': 907212,
 'way': 128150}

According to the osm document, there are three major tags "node", "way", and "relation". "memeber", "nd", "tag" are sublevel tags.

  • Audit street type

Street type can be treated as unit, and it should have constant labels. After auditting street types by extracting the street names and comparing with an expected list, it finds the unexpected abbreviation which needs to be fixed with full name, such as "E", "Ave","SW", "St","W" etc.

  • Audit post code Post code could be another potential problem, for example incorrect post code or inconsistant format etc. Similiarly as street name auditting, below functions and code will extract and output eacb unique post code.

Before auditting, I decide to cut the data into smaller dataset, so that it's faster to run and more specific to clean data no need to scanning the entire file. As an example, I extract all "relation" tgas, and sub-tags under each "relation". After all, instead of running north_seattle.osm, I will run "node.osm","way.osm" and "relation.osm" to audit and clean.

The following function is to output post code in each osm file.

def audit(osmfile):
    osm_file = open(osmfile, "r")
    postcode = set()
    for event, elem in ET.iterparse(osm_file, events=("start",)):       
        for tag in elem.iter("tag"):
            if is_postcode(tag):
                postcode.add(tag.attrib["v"])
    osm_file.close()
    return postcode

Run the function for each "node.osm", "way.osm" and "relation.osm" to see which file could have a potential problem. It finds out that the format is consistant and correct for "way.osm" and "relation.osm", while "node.osm" contains a few wrong post-codes that are not in Seattle area, for example:

'98230','98271','98354','98516','98531','98532','98902'

Those post code are incorrect and need to be clean.

Clean Data

  • Clean street names Below function is to substitute any abbreviation street name to its full name, and run the functions for each osm files.
def update_street(input_osm,output_osm):
    osm_file = ET.parse(input_osm)
    root=osm_file.getroot()
    for tag in root.iter("tag"):
        if is_street_name(tag):
            name=tag.attrib["v"]
            updated_name=update_name(name,mapping)
            tag.set("v",updated_name)
    osm_file.write(output_osm)
    return output_osm
  • Clean post code for node.osm data Since only node.osm contains incorrect post code which wrongly represent the nodes, so just run following functions to remove the wrong post code value and remain the rest part of node. The function is to compare the post-code with seattle post code saced in a list, and drop the ones not included in Seattle post codes.
def drop_bad_postcode(input_osm,output_osm,tag):
    osm_file=ET.parse(input_osm)
    root=osm_file.getroot()
    i=0
    for elem in root.iter(tag):
        for tag in elem.iter("tag"):
            if is_postcode(tag):
                postcode=tag.get("v")
                if postcode not in seattle_zipcode:
                    elem.remove(tag)
                    i+=1

    if i==0:
        return input_osm
    if i>0:
        osm_file.write(output_osm)
        return (output_osm,i)

Over View of Data

north_seattle.osm---------268.9MB
node.osm------------------193.7MB (after cleaning)
way.osm-------------------77.1MB (after cleaning)
relation.osm--------------1.3MB (after cleaning)
nodes.csv-----------------96.8MB
nodes_tags.csv------------11.6MB
ways.csv------------------7.9MB
ways_nodes.csv------------29.8MB
ways_tgs.csv--------------22.7MB
relations.csv-------------76KB
relations_members.csv-----535KB
relations_tags.csv--------175KB
  • How many nodes, ways, relations?
select count(*) from nodes;
1122168
select count(*) from ways;
128150
select count(*) from relations;
1278
  • List the first 10 most contribute users.
select user,count(*) as n from nodes group by user order by n desc limit 10;
Glassman|357069
SeattleImport|357024
seattlefyi|88077
jeffmeyer|47446
sctrojan79|44101
Sudobangbang|28462
Foundatron|28110
lukobe|23917
chronomex|20729
Omnific|16237

Other Ideas

Consistency of data structure

Since the map data is generally in large size and only has a few top levels of uniformed structure, for example the uniform structure for node tags would include key, value and type, and there is no consistant sub levels for tags. This makes it difficult to understand the features of nodes so that it's not clear what information of the data could provide. This can be improved by adding categorized sub levels for key and values, and users who upload to map data would only choose from those uniformed keys and values. And those sub levels should be specified for node, way and relations so it won't contain repeated information.

Besides, another ideas to simplify the data is to restrict the tags to include those common and useful features such post code.

The last but not the least, it's to include customer rating for like restaurants or other amenities, so to make the data more practically useful.

However, by making those restriction it may reduce the input sources, and need more effort to maintain and update the structure.

Additional Data Exploration

  • How many bridge in the selected area of Seattle?
select value,count(*) as n 
from ways_tags 
where key="bridge" 
group by value;


movable|8
viaduct|32
yes|497
  • How many parks in the selected area of Seattle?
select count(*) 
from relations_tags 
where key="leisure" and value="park";


48
  • Top 10 types of restaurant in the selected area of Seattle.
select value,count(*) as n 
from nodes_tags 
where key="cuisine" 
group by value 
order by n desc 
limit 10;


coffee_shop|209
pizza|76
mexican|73
sandwich|73
american|50
thai|47
italian|46
chinese|41
vietnamese|41
japanese|38

Conclusion

The map data for Seattle area is relatively clean. It's interesting to see how the data can provide information that can relate with daily life.