North part of Seattle, Washington
(node(47.5858,-122.4423,47.6929,-122.2438);<;);out meta;
The north_seattle.osm file contains following tags:
{'member': 18333,
'meta': 1,
'nd': 1237490,
'node': 1122168,
'note': 1,
'osm': 1,
'relation': 1278,
'tag': 907212,
'way': 128150}
According to the osm document, there are three major tags "node", "way", and "relation". "memeber", "nd", "tag" are sublevel tags.
Street type can be treated as unit, and it should have constant labels. After auditting street types by extracting the street names and comparing with an expected list, it finds the unexpected abbreviation which needs to be fixed with full name, such as "E", "Ave","SW", "St","W" etc.
Before auditting, I decide to cut the data into smaller dataset, so that it's faster to run and more specific to clean data no need to scanning the entire file. As an example, I extract all "relation" tgas, and sub-tags under each "relation". After all, instead of running north_seattle.osm, I will run "node.osm","way.osm" and "relation.osm" to audit and clean.
The following function is to output post code in each osm file.
def audit(osmfile):
osm_file = open(osmfile, "r")
postcode = set()
for event, elem in ET.iterparse(osm_file, events=("start",)):
for tag in elem.iter("tag"):
if is_postcode(tag):
postcode.add(tag.attrib["v"])
osm_file.close()
return postcode
Run the function for each "node.osm", "way.osm" and "relation.osm" to see which file could have a potential problem. It finds out that the format is consistant and correct for "way.osm" and "relation.osm", while "node.osm" contains a few wrong post-codes that are not in Seattle area, for example:
'98230','98271','98354','98516','98531','98532','98902'
Those post code are incorrect and need to be clean.
def update_street(input_osm,output_osm):
osm_file = ET.parse(input_osm)
root=osm_file.getroot()
for tag in root.iter("tag"):
if is_street_name(tag):
name=tag.attrib["v"]
updated_name=update_name(name,mapping)
tag.set("v",updated_name)
osm_file.write(output_osm)
return output_osm
def drop_bad_postcode(input_osm,output_osm,tag):
osm_file=ET.parse(input_osm)
root=osm_file.getroot()
i=0
for elem in root.iter(tag):
for tag in elem.iter("tag"):
if is_postcode(tag):
postcode=tag.get("v")
if postcode not in seattle_zipcode:
elem.remove(tag)
i+=1
if i==0:
return input_osm
if i>0:
osm_file.write(output_osm)
return (output_osm,i)
north_seattle.osm---------268.9MB
node.osm------------------193.7MB (after cleaning)
way.osm-------------------77.1MB (after cleaning)
relation.osm--------------1.3MB (after cleaning)
nodes.csv-----------------96.8MB
nodes_tags.csv------------11.6MB
ways.csv------------------7.9MB
ways_nodes.csv------------29.8MB
ways_tgs.csv--------------22.7MB
relations.csv-------------76KB
relations_members.csv-----535KB
relations_tags.csv--------175KB
select count(*) from nodes;
1122168
select count(*) from ways;
128150
select count(*) from relations;
1278
select user,count(*) as n from nodes group by user order by n desc limit 10;
Glassman|357069
SeattleImport|357024
seattlefyi|88077
jeffmeyer|47446
sctrojan79|44101
Sudobangbang|28462
Foundatron|28110
lukobe|23917
chronomex|20729
Omnific|16237
Since the map data is generally in large size and only has a few top levels of uniformed structure, for example the uniform structure for node tags would include key, value and type, and there is no consistant sub levels for tags. This makes it difficult to understand the features of nodes so that it's not clear what information of the data could provide. This can be improved by adding categorized sub levels for key and values, and users who upload to map data would only choose from those uniformed keys and values. And those sub levels should be specified for node, way and relations so it won't contain repeated information.
Besides, another ideas to simplify the data is to restrict the tags to include those common and useful features such post code.
The last but not the least, it's to include customer rating for like restaurants or other amenities, so to make the data more practically useful.
However, by making those restriction it may reduce the input sources, and need more effort to maintain and update the structure.
select value,count(*) as n
from ways_tags
where key="bridge"
group by value;
movable|8
viaduct|32
yes|497
select count(*)
from relations_tags
where key="leisure" and value="park";
48
select value,count(*) as n
from nodes_tags
where key="cuisine"
group by value
order by n desc
limit 10;
coffee_shop|209
pizza|76
mexican|73
sandwich|73
american|50
thai|47
italian|46
chinese|41
vietnamese|41
japanese|38
The map data for Seattle area is relatively clean. It's interesting to see how the data can provide information that can relate with daily life.